NVIDIA is seeking a Principal Software Engineer to spearhead AI software resiliency development for the world's most powerful AI supercomputers. This role is at the forefront of pushing AI computing boundaries, focusing on systems operating at a massive scale of 100,000+ GPUs. The position demands expertise in distributed systems and AI infrastructure, combining technical leadership with hands-on development.
The role involves architecting and implementing critical resiliency features for AI supercomputers, including checkpoint-recovery systems, error detection mechanisms, and performance optimization. You'll work directly with major customers and cross-functional teams to integrate these features into frameworks like PyTorch and JAX/XLA.
As a Principal Engineer, you'll lead by example in engineering excellence, fostering innovation while ensuring high code quality and rigorous testing standards. The position requires deep technical expertise combined with strong collaborative skills to work effectively across multiple engineering disciplines.
NVIDIA offers a competitive compensation package, including a base salary range of $272,000-$425,500, plus equity. The company is recognized as one of the world's most desirable technology employers, known for its pioneering work in AI computing and GPU technology. This role presents an exceptional opportunity to impact the future of AI computing infrastructure while working with cutting-edge technology and industry-leading experts.
The ideal candidate will bring extensive experience in distributed systems, AI frameworks, and large-scale infrastructure, along with a passion for developing AI-specific system architectures. This role is perfect for someone who thrives on solving complex technical challenges and wants to be at the forefront of AI technology advancement.