Senior Site Reliability Engineer - GPU Clusters

World leader in accelerated computing, pioneering AI and digital twins technology.
$180,000 - $339,250
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
7+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and leader in accelerated computing, is seeking a Senior Site Reliability Engineer to spearhead the management of their GPU clusters. This role sits at the intersection of AI innovation and infrastructure management, where you'll be responsible for designing and maintaining the backbone of NVIDIA's AI computing capabilities.

The position offers an opportunity to work with cutting-edge technology in AI and machine learning, managing large-scale GPU clusters that power crucial workloads across multiple teams. You'll be part of a team that's directly contributing to the advancement of artificial intelligence and high-performance computing, working with the latest GPU technologies including the GB200.

As an SRE, you'll be responsible for ensuring the reliability and performance of critical infrastructure, implementing automation solutions, and maintaining high availability of services. The role combines traditional SRE responsibilities with specialized knowledge in GPU cluster management, making it a unique opportunity for those interested in high-performance computing infrastructure.

The position offers competitive compensation ranging from $180,000 to $339,250, plus equity benefits. You'll be working from one of NVIDIA's major tech hubs, collaborating with researchers, AI engineers, and infrastructure teams. This role is perfect for someone who has a strong background in infrastructure management, a passion for operational excellence, and wants to be at the forefront of AI technology advancement.

The ideal candidate will bring experience in cloud services, containerization, and infrastructure automation, along with strong problem-solving abilities and excellent communication skills. You'll be joining a company that's driving innovation in AI, autonomous vehicles, and high-performance computing, making this an excellent opportunity for career growth and impact.

Last updated 20 days ago

Responsibilities For Senior Site Reliability Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and quality of service through operational excellence
  • Support globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-prem
  • Define and implement SLOs and SLIs
  • Write Root Cause Analysis reports for production incidents
  • Participate in on-call rotation
  • Drive evaluation and integration of new GPU technologies

Requirements For Senior Site Reliability Engineer - GPU Clusters

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar environments
  • Expertise in production-level cloud services
  • Proficiency with Kubernetes, Docker, or similar tools
  • Experience in Python, Go, or Ruby
  • Strong proficiency with Linux and TCP/IP
  • Experience with CI/CD, GitOps, and Infrastructure as Code
  • Strong communication and documentation skills

Benefits For Senior Site Reliability Engineer - GPU Clusters

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - GPU Clusters

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.