NVIDIA is seeking a Senior Site Reliability Engineer for their AI Research Clusters team. As a member of the GPU AI/HPC Infrastructure team, you'll lead in designing and implementing groundbreaking GPU compute clusters powering AI research across NVIDIA. Your role involves building and operating these clusters for high reliability, efficiency, and performance, while driving improvements and automation to enhance researcher productivity.
Key responsibilities include:
- Designing and implementing state-of-the-art GPU compute clusters
- Optimizing cluster operations for maximum reliability, efficiency, and performance
- Driving foundational improvements and automation to enhance researcher productivity
- Tackling strategic challenges in large-scale, high-performance computing environments
- Troubleshooting and diagnosing system failures
- Implementing sustainable incident response and blameless postmortems
- Participating in an on-call rotation to support production systems
- Writing and reviewing code, developing documentation and capacity plans
- Managing upgrades and automated rollbacks across all clusters
Requirements:
- Bachelor's degree in Computer Science, Electrical Engineering, or related field
- 6+ years of experience designing and operating large-scale compute infrastructure
- Proven experience in site reliability engineering for high-performance computing environments
- Deep understanding of GPU computing and AI infrastructure
- Experience with AI/HPC advanced job schedulers (e.g., Slurm)
- Solid experience with GPU clusters and cluster configuration management tools
- In-depth understanding of container technologies
- Experience programming in Python and Bash scripting
NVIDIA offers competitive salaries, equity, and comprehensive benefits. The company values diversity and fosters an inclusive work environment.