As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that power all AI research across the company. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers' productivity.
As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, using a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.
In this role, you will be:
- Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
- Maintaining and building deep learning AI-HPC GPU clusters at scale
- Supporting researchers to run their workflows on our clusters, including performance analysis and optimizations
- Designing, implementing, and supporting operational and reliability aspects of large-scale distributed systems
- Optimizing cluster operations for maximum reliability, efficiency, and performance
- Driving foundational improvements and automation to enhance researcher productivity
- Troubleshooting, diagnosing, and root-causing system failures
- Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
- Participating in on-call rotation to support production systems
- Writing and reviewing code, developing documentation and capacity plans
- Implementing remediations across software and hardware stack
Required qualifications:
- Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
- 5+ years of experience designing and operating large-scale compute infrastructure
- Proven experience in site reliability engineering for high-performance computing environments
- Operational experience with at least 2K GPUs cluster
- Deep understanding of GPU computing and AI infrastructure
- Experience with AI/HPC advanced job schedulers (e.g., Slurm)
- Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
- Experience with container technologies (e.g., Docker, Enroot)
- Programming skills in Python and Bash scripting
Join NVIDIA's diverse and intellectually curious team, collaborating in a blame-free environment to tackle meaningful projects and drive innovation in AI and GPU computing.