Senior Site Reliability Engineer - AI Research Clusters

World leader in accelerated computing, pioneering AI and digital twins technology transforming major industries.
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The position requires an expert who can build and operate clusters with high reliability, efficiency, and performance while driving improvements and automation to enhance researcher productivity.

As an SRE, you'll be responsible for overseeing system interactions, utilizing various tools and approaches to address a wide range of challenges. The role emphasizes proactive problem-solving, including limiting reactive operational work and conducting blameless postmortems. NVIDIA's culture promotes diversity, intellectual curiosity, and innovation in a blame-free environment.

The role involves building and improving the ecosystem around GPU-accelerated computing, developing large-scale automation solutions, and maintaining AI-HPC GPU clusters. You'll support researchers in optimizing their workflows and focus on performance scaling, real-time monitoring, and system reliability.

Key responsibilities include designing state-of-the-art GPU compute clusters, optimizing operations, implementing automation, and participating in an on-call rotation. The ideal candidate will have extensive experience with GPU computing, AI infrastructure, and proven expertise in site reliability engineering for high-performance computing environments.

This position offers the opportunity to work with some of the largest and most complex systems in the world, contributing to groundbreaking advancements in AI research infrastructure. The role combines technical expertise with strategic thinking, making it perfect for those passionate about large-scale distributed systems and cutting-edge technology.

Last updated 13 days ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot and diagnose system failures
  • Implement automation and evolve systems for improved reliability
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in computer science, Electrical Engineering or related field
  • 5+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (Slurm)
  • Knowledge of cluster configuration management tools (BCM, Ansible)
  • Understanding of container technologies (Docker, Enroot)
  • Programming experience in Python and Bash scripting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.