Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

Bengaluru, Karnataka, India • Hyderabad, Telangana, India • Pune, Maharashtra, India…

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Site Reliability Engineer - AI Research Clusters

As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that power all AI research across the company. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers' productivity.

As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, using a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

In this role, you will be:

Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
Maintaining and building deep learning AI-HPC GPU clusters at scale
Supporting researchers to run their workflows on our clusters, including performance analysis and optimizations
Designing, implementing, and supporting operational and reliability aspects of large-scale distributed systems
Optimizing cluster operations for maximum reliability, efficiency, and performance
Driving foundational improvements and automation to enhance researcher productivity
Troubleshooting, diagnosing, and root-causing system failures
Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
Participating in on-call rotation to support production systems
Writing and reviewing code, developing documentation and capacity plans
Implementing remediations across software and hardware stack

Required qualifications:

Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
5+ years of experience designing and operating large-scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments
Operational experience with at least 2K GPUs cluster
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers (e.g., Slurm)
Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
Experience with container technologies (e.g., Docker, Enroot)
Programming skills in Python and Bash scripting

Join NVIDIA's diverse and intellectually curious team, collaborating in a blame-free environment to tackle meaningful projects and drive innovation in AI and GPU computing.

Last updated 8 months ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for maximum reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Troubleshoot, diagnose, and root cause system failures
Scale systems sustainably through automation
Participate in on-call rotation to support production systems
Write and review code, develop documentation and capacity plans
Implement remediations across software and hardware stack

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python

Kubernetes

Linux

MySQL

Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
5+ years of experience designing and operating large-scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments
Operational experience with at least 2K GPUs cluster
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers (e.g., Slurm)
Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
Experience with container technologies (e.g., Docker, Enroot)
Programming skills in Python and Bash scripting