Senior Site Reliability Engineer - AI Research Clusters

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$184,000 - $425,500
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
6+ years of experience
AI

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA, the pioneer in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role is crucial in designing and implementing cutting-edge GPU compute clusters that power NVIDIA's AI research initiatives. As an SRE, you'll be at the forefront of maintaining and optimizing large-scale AI infrastructure, working with some of the most advanced computing systems in the world.

The position offers an opportunity to work with NVIDIA's state-of-the-art GPU technology and contribute to the infrastructure that enables breakthrough AI research. You'll be responsible for ensuring the reliability, efficiency, and performance of massive GPU clusters while implementing automation solutions to enhance researcher productivity. The role combines hands-on technical work with strategic thinking about system architecture and optimization.

The ideal candidate will bring deep expertise in GPU computing, AI infrastructure, and large-scale system operations. You'll work in a culture that values diversity, intellectual curiosity, and problem-solving, with opportunities to collaborate with brilliant minds in the field. The position offers competitive compensation, including a substantial base salary range of $184,000 to $425,500, plus equity and comprehensive benefits.

This is an excellent opportunity for experienced engineers who are passionate about high-performance computing and want to make a significant impact in the AI field. You'll be working with cutting-edge technology, solving complex technical challenges, and contributing to NVIDIA's mission of advancing AI and accelerated computing. The role offers both technical depth and the chance to influence the direction of critical research infrastructure.

Last updated 5 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot and diagnose system failures
  • Scale systems through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in Computer Science, Electrical Engineering or related field
  • 6+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 5K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (Slurm)
  • Experience with cluster configuration management tools (BCM or Ansible)
  • Knowledge of container technologies like Docker, Enroot
  • Experience programming in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Medical Insurance
Equity
  • Competitive base salary
  • Equity compensation
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.

Senior Software Engineer, Site Reliability Engineering, Google Cloud

Senior SRE position at Google Cloud focusing on building and maintaining large-scale distributed systems, requiring 5 years of software development experience and strong system design skills.