Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI

Description For Senior Site Reliability Engineer - AI Research Clusters

As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that power all AI research across the company. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve researchers' productivity.

As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, using a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work.

In this role, you will be:

  • Building and improving our ecosystem around GPU-accelerated computing, including developing large-scale automation solutions
  • Maintaining and building deep learning AI-HPC GPU clusters at scale
  • Supporting researchers to run their workflows on our clusters, including performance analysis and optimizations
  • Designing, implementing, and supporting operational and reliability aspects of large-scale distributed systems
  • Optimizing cluster operations for maximum reliability, efficiency, and performance
  • Driving foundational improvements and automation to enhance researcher productivity
  • Troubleshooting, diagnosing, and root-causing system failures
  • Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
  • Participating in on-call rotation to support production systems
  • Writing and reviewing code, developing documentation and capacity plans
  • Implementing remediations across software and hardware stack

Required qualifications:

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Operational experience with at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
  • Experience with container technologies (e.g., Docker, Enroot)
  • Programming skills in Python and Bash scripting

Join NVIDIA's diverse and intellectually curious team, collaborating in a blame-free environment to tackle meaningful projects and drive innovation in AI and GPU computing.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot, diagnose, and root cause system failures
  • Scale systems sustainably through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Implement remediations across software and hardware stack

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
MySQL
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 5+ years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Operational experience with at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools (e.g., BCM, Ansible)
  • Experience with container technologies (e.g., Docker, Enroot)
  • Programming skills in Python and Bash scripting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.