Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Santa Clara, CA, USA • Westford, MA 01886, USA • Austin, TX, USA…

$184,000 - $425,500

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA, the pioneer in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role is crucial in designing and implementing cutting-edge GPU compute clusters that power NVIDIA's AI research initiatives. As an SRE, you'll be at the forefront of maintaining and optimizing large-scale AI infrastructure, working with some of the most advanced computing systems in the world.

The position offers an opportunity to work with NVIDIA's state-of-the-art GPU technology and contribute to the infrastructure that enables breakthrough AI research. You'll be responsible for ensuring the reliability, efficiency, and performance of massive GPU clusters while implementing automation solutions to enhance researcher productivity. The role combines hands-on technical work with strategic thinking about system architecture and optimization.

The ideal candidate will bring deep expertise in GPU computing, AI infrastructure, and large-scale system operations. You'll work in a culture that values diversity, intellectual curiosity, and problem-solving, with opportunities to collaborate with brilliant minds in the field. The position offers competitive compensation, including a substantial base salary range of $184,000 to $425,500, plus equity and comprehensive benefits.

This is an excellent opportunity for experienced engineers who are passionate about high-performance computing and want to make a significant impact in the AI field. You'll be working with cutting-edge technology, solving complex technical challenges, and contributing to NVIDIA's mission of advancing AI and accelerated computing. The role offers both technical depth and the chance to influence the direction of critical research infrastructure.

Last updated 5 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for maximum reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Troubleshoot and diagnose system failures
Scale systems through automation
Participate in on-call rotation to support production systems
Write and review code, develop documentation and capacity plans
Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python

Kubernetes

Linux

Bachelor's degree in Computer Science, Electrical Engineering or related field
6+ years of experience designing and operating large scale compute infrastructure
Operational experience of at least 5K GPUs cluster
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers (Slurm)
Experience with cluster configuration management tools (BCM or Ansible)
Knowledge of container technologies like Docker, Enroot
Experience programming in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Medical Insurance

Equity

Competitive base salary
Equity compensation
Comprehensive benefits package

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Santa Clara, CA, USA • Westford, MA 01886, USA • Austin, TX, USA…

$184,000 - $425,500

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.

Senior Software Engineer, Site Reliability Engineering, Google Cloud

Google

Senior SRE position at Google Cloud focusing on building and maintaining large-scale distributed systems, requiring 5 years of software development experience and strong system design skills.