Principal Engineer for AI Software Resiliency

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$272,000 - $425,500
Machine Learning
Principal Software Engineer
In-Person
10+ years of experience
AI · Enterprise SaaS

Description For Principal Engineer for AI Software Resiliency

NVIDIA is seeking a Principal Software Engineer to spearhead AI software resiliency development for the world's most powerful AI supercomputers. This role is at the forefront of pushing AI computing boundaries, focusing on systems operating at a massive scale of 100,000+ GPUs. The position demands expertise in distributed systems and AI infrastructure, combining technical leadership with hands-on development.

The role involves architecting and implementing critical resiliency features for AI supercomputers, including checkpoint-recovery systems, error detection mechanisms, and performance optimization. You'll work directly with major customers and cross-functional teams to integrate these features into frameworks like PyTorch and JAX/XLA.

As a Principal Engineer, you'll lead by example in engineering excellence, fostering innovation while ensuring high code quality and rigorous testing standards. The position requires deep technical expertise combined with strong collaborative skills to work effectively across multiple engineering disciplines.

NVIDIA offers a competitive compensation package, including a base salary range of $272,000-$425,500, plus equity. The company is recognized as one of the world's most desirable technology employers, known for its pioneering work in AI computing and GPU technology. This role presents an exceptional opportunity to impact the future of AI computing infrastructure while working with cutting-edge technology and industry-leading experts.

The ideal candidate will bring extensive experience in distributed systems, AI frameworks, and large-scale infrastructure, along with a passion for developing AI-specific system architectures. This role is perfect for someone who thrives on solving complex technical challenges and wants to be at the forefront of AI technology advancement.

Last updated a day ago

Responsibilities For Principal Engineer for AI Software Resiliency

  • Serve as a trusted authority on AI software resiliency
  • Lead execution and development of software resiliency features
  • Drive engineering excellence and contribute to large software codebases
  • Work closely with multiple teams across NVIDIA
  • Collaborate directly with major customers
  • Partner with TPMs, PMs, and QA teams for feature launches

Requirements For Principal Engineer for AI Software Resiliency

Python
  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or related field
  • Minimum 10 years of experience in systems architecture or related fields
  • At least 10 years of hands-on experience in software development for distributed systems
  • 5 years in developing AI frameworks such as PyTorch or JAX/XLA
  • Proven track record of working effectively across multiple engineering fields
  • Deep understanding of distributed systems and large-scale AI infrastructure

Benefits For Principal Engineer for AI Software Resiliency

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Principal Engineer for AI Software Resiliency

Gen AI Product Evangelist Engineer, Retail

Senior technical role combining AI expertise with developer evangelism to build and document Gen AI applications for retail sector using NVIDIA's platform.

Distinguished Engineer, AI Resiliency Lead

Lead AI Resiliency engineering role at NVIDIA, focusing on developing resilient software features for large-scale AI model training with competitive compensation.

Principal Software Engineer, Planning and Controls - Autonomous Vehicles

Principal Software Engineer role at NVIDIA focusing on autonomous vehicle planning and controls systems development.

Principal Engineer, AIOps

Lead AIOps engineer position at NVIDIA, developing AI-powered solutions for IT operations using machine learning and generative AI technologies.

Senior Product Architect, HPC and AI

Senior Product Architect position at NVIDIA focusing on HPC and AI infrastructure design, offering competitive compensation and opportunity to shape the future of AI technology.