Senior Site Reliability Engineer - GPU Clusters

World leader in accelerated computing, pioneering AI and digital twins technology.
$184,000 - $356,500
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
7+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and leader in accelerated computing, is seeking a Senior Site Reliability Engineer to spearhead the management of their large-scale GPU clusters. This role sits at the intersection of AI innovation and infrastructure management, supporting critical AI workloads across multiple teams and projects. The position offers an opportunity to work with cutting-edge technology in AI and machine learning infrastructure.

The role demands expertise in managing high-performance computing environments, with a focus on GPU clusters that power AI workloads. You'll be responsible for designing, deploying, and maintaining these systems while ensuring optimal performance and reliability. The position requires strong technical skills in cloud computing, containerization, and automation, along with the ability to work in a multi-cloud environment.

As a Senior SRE, you'll collaborate with researchers, AI engineers, and infrastructure teams, contributing to NVIDIA's mission of accelerating the next wave of artificial intelligence. The role offers competitive compensation ($184,000 - $356,500) plus equity, and the opportunity to work with a company at the forefront of AI and digital twins technology. You'll be part of a team that values operational excellence and innovation, working on projects that directly impact the future of machine learning and artificial intelligence.

The ideal candidate will bring 7+ years of software engineering experience, with specific expertise in GPU clusters or similar high-performance computing environments. This role is perfect for someone who combines technical expertise with a passion for operational excellence and automation, and who thrives in a fast-paced, innovative environment.

Last updated 22 days ago

Responsibilities For Senior Site Reliability Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and QoS through operational excellence
  • Support globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-prem
  • Define and implement SLOs and SLIs
  • Write Root Cause Analysis reports
  • Participate in on-call rotation
  • Drive evaluation and integration of new GPU technologies

Requirements For Senior Site Reliability Engineer - GPU Clusters

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar environments
  • Expertise in production-level cloud services
  • Proficiency with Kubernetes, Docker, or similar tools
  • Experience with Python, Go, or Ruby
  • Strong Linux and TCP/IP knowledge
  • Proficiency in CI/CD, GitOps, and Infrastructure as Code
  • Strong communication and documentation skills

Benefits For Senior Site Reliability Engineer - GPU Clusters

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineering - Infrastructure

Senior Site Reliability Engineer position at NVIDIA focusing on infrastructure automation, Kubernetes, and maintaining large-scale production systems.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining high-reliability systems for GPU cloud services.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.