Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI

Description For Senior Site Reliability Engineer - AI Research Clusters

As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that power all AI research across the company. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers' productivity.

As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, using a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

In this role, you will be:

  • Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
  • Maintaining and building deep learning AI-HPC GPU clusters at scale
  • Supporting researchers to run their workflows on our clusters, including performance analysis and optimizations
  • Designing, implementing, and supporting operational and reliability aspects of large-scale distributed systems
  • Optimizing cluster operations for maximum reliability, efficiency, and performance
  • Driving foundational improvements and automation to enhance researcher productivity
  • Troubleshooting, diagnosing, and root-causing system failures
  • Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
  • Participating in on-call rotation to support production systems
  • Writing and reviewing code, developing documentation and capacity plans
  • Implementing remediations across software and hardware stack

Required qualifications:

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Operational experience with at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
  • Experience with container technologies (e.g., Docker, Enroot)
  • Programming skills in Python and Bash scripting

Join NVIDIA's diverse and intellectually curious team, collaborating in a blame-free environment to tackle meaningful projects and drive innovation in AI and GPU computing.

Last updated 14 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot, diagnose, and root cause system failures
  • Scale systems sustainably through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Implement remediations across software and hardware stack

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
MySQL
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Operational experience with at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
  • Experience with container technologies (e.g., Docker, Enroot)
  • Programming skills in Python and Bash scripting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Site Reliability Engineer - REST API

Apple is hiring a Site Reliability Engineer for their Vision Pro team to support event operations, focusing on API integration and automation.

Senior Site Reliability Engineer

Senior Site Reliability Engineer at Microsoft, ensuring product reliability and solving complex customer issues in Windows services.

Site Reliability Engineer - Video on Demand/Streaming Event Support

Join Apple's Vision Pro team as a Site Reliability Engineer, supporting video on demand and streaming event operations with a focus on automation, monitoring, and innovation.

Senior Software Engineer, ATS Matrix Site Reliability Engineer

Senior Software Engineer role in Site Reliability Engineering at Google, building and maintaining large-scale distributed systems.

Senior Software Developer, Site Reliability Development, Protected Data

Senior Software Developer role at Google focusing on Site Reliability Development for Protected Data systems.