Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins to transform industries and society.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$180,000 - $339,250
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
6+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer for their AI Research Clusters team. As a member of the GPU AI/HPC Infrastructure team, you'll lead in designing and implementing groundbreaking GPU compute clusters powering AI research across NVIDIA. Your role involves building and operating these clusters for high reliability, efficiency, and performance, while driving improvements and automation to enhance researcher productivity.

Key responsibilities include:

  • Designing and implementing state-of-the-art GPU compute clusters
  • Optimizing cluster operations for maximum reliability, efficiency, and performance
  • Driving foundational improvements and automation to enhance researcher productivity
  • Tackling strategic challenges in large-scale, high-performance computing environments
  • Troubleshooting and diagnosing system failures
  • Implementing sustainable incident response and blameless postmortems
  • Participating in an on-call rotation to support production systems
  • Writing and reviewing code, developing documentation and capacity plans
  • Managing upgrades and automated rollbacks across all clusters

Requirements:

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field
  • 6+ years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Solid experience with GPU clusters and cluster configuration management tools
  • In-depth understanding of container technologies
  • Experience programming in Python and Bash scripting

NVIDIA offers competitive salaries, equity, and comprehensive benefits. The company values diversity and fosters an inclusive work environment.

Last updated 2 months ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Tackle strategic challenges in large-scale, high-performance computing environments
  • Troubleshoot, diagnose and root cause system failures
  • Implement sustainable incident response and blameless postmortems
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in Computer Science, Electrical Engineering or related field
  • 6+ years of experience designing and operating large scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Solid experience with GPU clusters and cluster configuration management tools
  • In-depth understanding of container technologies
  • Experience programming in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and large-scale system reliability with competitive compensation and benefits.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.