Distinguished Engineer, AI Resiliency Lead

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$308,000 - $471,500
Machine Learning
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI

Description For Distinguished Engineer, AI Resiliency Lead

NVIDIA is seeking a Distinguished Engineer to lead AI Resiliency initiatives, focusing on architecting and developing world-class software resiliency features for training AI models on the largest AI superclusters globally. This role combines technical leadership with hands-on development, requiring expertise in large-scale AI systems and software architecture. The position involves working with cutting-edge AI frameworks like PyTorch and JAX/XLA, ensuring near-zero downtime for critical training operations. The ideal candidate will lead a cross-functional team, driving innovations in AI software stack development and working directly with NVIDIA's senior leadership. This role offers the opportunity to impact the future of AI computing at one of technology's most innovative companies, working on projects that push the boundaries of what's possible in AI training and system resilience. The position comes with competitive compensation, including equity, and the chance to work with some of the industry's most talented engineers in a collaborative, innovation-driven environment.

Last updated 3 months ago

Responsibilities For Distinguished Engineer, AI Resiliency Lead

  • Define scalable software architecture for single-job resilient training on hundreds of thousands of GPUs
  • Design and deliver modular, resilient software features for large-scale AI training
  • Innovate resilient architecture designs to achieve stringent uptime requirements
  • Collaborate with internal partners and communicate progress to senior leadership
  • Lead team of cross-functional experts
  • Drive and shape end-to-end AI software stack

Requirements For Distinguished Engineer, AI Resiliency Lead

Python
  • Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience
  • 15+ years of experience in software architecture or related fields
  • Deep understanding of AI-optimized systems
  • Excellent ability to collaborate and communicate across engineering teams
  • At least 5 years of hands-on experience in software development on high-complexity projects involving HPC or AI

Benefits For Distinguished Engineer, AI Resiliency Lead

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Distinguished Engineer, AI Resiliency Lead

Principal DGX Cloud Machine Learning Architect

Principal ML Architect role at NVIDIA focusing on optimizing generative AI models for DGX Cloud, requiring 15+ years of experience and offering competitive compensation.

Principal Engineer for AI Software Resiliency

Lead AI software resiliency development for world's most powerful AI supercomputers at NVIDIA

Senior Product Architect, HPC and AI

Senior Product Architect position at NVIDIA focusing on HPC and AI infrastructure design, offering competitive compensation and opportunity to shape the future of AI technology.

Principal Engineer, Distributed Machine Learning

Principal Engineer position at NVIDIA focusing on distributed machine learning and GPU acceleration for Apache Spark.

Senior Deep Learning Performance Architect

Senior Deep Learning Performance Architect role at NVIDIA focusing on developing next-generation AI architectures and optimizing deep learning performance.