Distinguished Engineer, AI Resiliency Lead

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$308,000 - $471,500
Machine Learning
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI

Description For Distinguished Engineer, AI Resiliency Lead

NVIDIA is seeking a Distinguished Engineer to lead AI Resiliency initiatives, focusing on architecting and developing world-class software resiliency features for training AI models on the largest AI superclusters globally. This role combines technical leadership with hands-on development, requiring expertise in large-scale AI systems and software architecture. The position involves working with cutting-edge AI frameworks like PyTorch and JAX/XLA, ensuring near-zero downtime for critical training operations. The ideal candidate will lead a cross-functional team, driving innovations in AI software stack development and working directly with NVIDIA's senior leadership. This role offers the opportunity to impact the future of AI computing at one of technology's most innovative companies, working on projects that push the boundaries of what's possible in AI training and system resilience. The position comes with competitive compensation, including equity, and the chance to work with some of the industry's most talented engineers in a collaborative, innovation-driven environment.

Last updated a day ago

Responsibilities For Distinguished Engineer, AI Resiliency Lead

  • Define scalable software architecture for single-job resilient training on hundreds of thousands of GPUs
  • Design and deliver modular, resilient software features for large-scale AI training
  • Innovate resilient architecture designs to achieve stringent uptime requirements
  • Collaborate with internal partners and communicate progress to senior leadership
  • Lead team of cross-functional experts
  • Drive and shape end-to-end AI software stack

Requirements For Distinguished Engineer, AI Resiliency Lead

Python
  • Master's or Ph.D. in Computer Science, Electrical or Computer Engineering from a top-tier university, or equivalent experience
  • 15+ years of experience in software architecture or related fields
  • Deep understanding of AI-optimized systems
  • Excellent ability to collaborate and communicate across engineering teams
  • At least 5 years of hands-on experience in software development on high-complexity projects involving HPC or AI

Benefits For Distinguished Engineer, AI Resiliency Lead

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Distinguished Engineer, AI Resiliency Lead

Principal Engineer for AI Software Resiliency

Lead AI software resiliency development for world's most powerful AI supercomputers at NVIDIA

Gen AI Product Evangelist Engineer, Retail

Senior technical role combining AI expertise with developer evangelism to build and document Gen AI applications for retail sector using NVIDIA's platform.

Principal Software Engineer, Planning and Controls - Autonomous Vehicles

Principal Software Engineer role at NVIDIA focusing on autonomous vehicle planning and controls systems development.

Principal Engineer, AIOps

Lead AIOps engineer position at NVIDIA, developing AI-powered solutions for IT operations using machine learning and generative AI technologies.

Senior Product Architect, HPC and AI

Senior Product Architect position at NVIDIA focusing on HPC and AI infrastructure design, offering competitive compensation and opportunity to shape the future of AI technology.