NVIDIA is seeking a Distinguished Engineer to lead AI Resiliency initiatives, focusing on architecting and developing world-class software resiliency features for training AI models on the largest AI superclusters globally. This role combines technical leadership with hands-on development, requiring expertise in large-scale AI systems and software architecture. The position involves working with cutting-edge AI frameworks like PyTorch and JAX/XLA, ensuring near-zero downtime for critical training operations. The ideal candidate will lead a cross-functional team, driving innovations in AI software stack development and working directly with NVIDIA's senior leadership. This role offers the opportunity to impact the future of AI computing at one of technology's most innovative companies, working on projects that push the boundaries of what's possible in AI training and system resilience. The position comes with competitive compensation, including equity, and the chance to work with some of the industry's most talented engineers in a collaborative, innovation-driven environment.