NVIDIA is seeking a Senior Site Reliability Engineer for their AI Research Clusters. As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in designing and implementing groundbreaking GPU compute clusters that power all AI research across NVIDIA. You'll be responsible for building and operating these clusters with high reliability, efficiency, and performance, while driving foundational improvements and automation to enhance researcher productivity.
Key responsibilities include:
- Designing and implementing state-of-the-art GPU compute clusters
- Optimizing cluster operations for maximum reliability, efficiency, and performance
- Tackling strategic challenges in large-scale, high-performance computing environments
- Troubleshooting and diagnosing system failures
- Building automation for AI-HPC GPU Cluster bring-up and scaled-up operation
- Implementing remediations across software and hardware stacks
Requirements:
- Bachelor's degree in Computer Science, Electrical Engineering, or related field
- Minimum 5 years of experience designing and operating large-scale compute infrastructure
- Proven experience in site reliability engineering for high-performance computing environments
- Deep understanding of GPU computing and AI infrastructure
- Experience with AI/HPC advanced job schedulers (e.g., Slurm)
- Knowledge of cluster configuration management tools and infrastructure-level applications
- In-depth understanding of container technologies
- Programming experience in Python and Bash scripting
Preferred qualifications:
- Familiarity with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking
- Experience with InfiniBand, IBoIP, and RDMA
- Understanding of fast, distributed storage systems for AI/HPC workloads
- Familiarity with deep learning frameworks like PyTorch and TensorFlow
NVIDIA offers competitive salaries, comprehensive benefits, and the opportunity to work with some of the most brilliant and talented people in the world. Join our diverse team and help shape the future of AI and accelerated computing!