Distinguished Engineer, AI Resiliency Lead

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$308,000 - $471,500
Machine Learning
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI

Description For Distinguished Engineer, AI Resiliency Lead

NVIDIA is seeking a Distinguished Engineer to lead AI Resiliency initiatives, focusing on architecting and developing world-class software resiliency features for training AI models on the largest AI superclusters globally. This role combines technical leadership with hands-on development, requiring expertise in large-scale AI systems and software architecture. The position involves working with cutting-edge AI frameworks like PyTorch and JAX/XLA, ensuring near-zero downtime for critical training operations. The ideal candidate will lead a cross-functional team, driving innovations in AI software stack development and working directly with NVIDIA's senior leadership. This role offers the opportunity to impact the future of AI computing at one of technology's most innovative companies, working on projects that push the boundaries of what's possible in AI training and system resilience. The position comes with competitive compensation, including equity, and the chance to work with some of the industry's most talented engineers in a collaborative, innovation-driven environment.

Last updated 3 months ago

Responsibilities For Distinguished Engineer, AI Resiliency Lead

  • Define scalable software architecture for single-job resilient training on hundreds of thousands of GPUs
  • Design and deliver modular, resilient software features for large-scale AI training
  • Innovate resilient architecture designs to achieve stringent uptime requirements
  • Collaborate with internal partners and communicate progress to senior leadership
  • Lead team of cross-functional experts
  • Drive and shape end-to-end AI software stack

Requirements For Distinguished Engineer, AI Resiliency Lead

Python
  • Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience
  • 15+ years of experience in software architecture or related fields
  • Deep understanding of AI-optimized systems
  • Excellent ability to collaborate and communicate across engineering teams
  • At least 5 years of hands-on experience in software development on high-complexity projects involving HPC or AI

Benefits For Distinguished Engineer, AI Resiliency Lead

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Distinguished Engineer, AI Resiliency Lead

Principal DGX Cloud Machine Learning Architect

Principal ML Architect role at NVIDIA focusing on optimizing generative AI models for DGX Cloud, requiring 15+ years of experience and offering competitive compensation.

Principal Engineer for AI Software Resiliency

Lead AI software resiliency development for world's most powerful AI supercomputers at NVIDIA

Principal Software Engineer, Planning and Controls - Autonomous Vehicles

Principal Software Engineer role at NVIDIA focusing on autonomous vehicle planning and controls systems development.

Senior Product Architect, HPC and AI

Senior Product Architect position at NVIDIA focusing on HPC and AI infrastructure design, offering competitive compensation and opportunity to shape the future of AI technology.

Senior Software Architect, AI and HPC

Senior Software Architect role at NVIDIA focusing on AI and HPC, developing solutions for networking hardware and programming environments with competitive compensation.