Senior Site Reliability Engineer - AI Research Clusters

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The ideal candidate will be responsible for building and operating these clusters with high reliability, efficiency, and performance while driving foundational improvements and automation to enhance researcher productivity.

As an SRE at NVIDIA, you'll be part of a diverse team that values intellectual curiosity and problem-solving. The role involves working with a broad spectrum of tools and approaches, implementing practices such as limiting reactive operational work, conducting blameless postmortems, and proactively identifying potential outages. You'll be immersed in an environment that promotes self-direction while providing necessary support and mentorship for growth.

The position requires expertise in GPU computing and AI infrastructure, with hands-on experience managing large-scale compute clusters of at least 2K GPUs. You'll work with technologies like Kubernetes, Docker, and various configuration management tools while utilizing programming skills in Python and Bash scripting. The role offers exposure to cutting-edge AI research infrastructure and the opportunity to impact NVIDIA's groundbreaking work in artificial intelligence.

NVIDIA's culture emphasizes diversity, openness, and collaboration in a blame-free environment. The company has a strong track record of innovation, from inventing the GPU that sparked the PC gaming market to revolutionizing parallel computing and advancing AI research worldwide. This role offers the chance to be part of a company that consistently evolves and adapts to new opportunities while making lasting impacts on the world.

Last updated 2 days ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot, diagnose, and root cause system failures
  • Scale systems through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
MySQL
  • Bachelor's degree in computer science, Electrical Engineering or related field
  • 5+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (Slurm)
  • Working knowledge of cluster configuration management tools (BCM, Ansible)
  • In depth understanding of container technologies (Docker, Enroot)
  • Experience programming in Python and Bash scripting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and large-scale system reliability with competitive compensation and benefits.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.