Senior Site Reliability Engineer - GPU Clusters

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology.

San Francisco, CA, USA • Boston, MA, USA • Austin, TX, USA

$184,000 - $356,500

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

7+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and leader in accelerated computing, is seeking a Senior Site Reliability Engineer to spearhead the management of their large-scale GPU clusters. This role sits at the intersection of AI innovation and infrastructure management, supporting critical AI workloads across multiple teams and projects. The position offers an opportunity to work with cutting-edge technology in AI and machine learning infrastructure.

The role demands expertise in managing high-performance computing environments, with a focus on GPU clusters that power AI workloads. You'll be responsible for designing, deploying, and maintaining these systems while ensuring optimal performance and reliability. The position requires strong technical skills in cloud computing, containerization, and automation, along with the ability to work in a multi-cloud environment.

As a Senior SRE, you'll collaborate with researchers, AI engineers, and infrastructure teams, contributing to NVIDIA's mission of accelerating the next wave of artificial intelligence. The role offers competitive compensation ($184,000 - $356,500) plus equity, and the opportunity to work with a company at the forefront of AI and digital twins technology. You'll be part of a team that values operational excellence and innovation, working on projects that directly impact the future of machine learning and artificial intelligence.

The ideal candidate will bring 7+ years of software engineering experience, with specific expertise in GPU clusters or similar high-performance computing environments. This role is perfect for someone who combines technical expertise with a passion for operational excellence and automation, and who thrives in a fast-paced, innovative environment.

Last updated 3 months ago

Responsibilities For Senior Site Reliability Engineer - GPU Clusters

Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
Improve infrastructure provisioning, management, and monitoring through automation
Ensure high uptime and QoS through operational excellence
Support globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-prem
Define and implement SLOs and SLIs
Write Root Cause Analysis reports
Participate in on-call rotation
Drive evaluation and integration of new GPU technologies

Requirements For Senior Site Reliability Engineer - GPU Clusters

Python

Kubernetes

Linux

BS degree in Computer Science or equivalent experience
7+ years of software engineering experience
3+ years managing GPU clusters or similar environments
Expertise in production-level cloud services
Proficiency with Kubernetes, Docker, or similar tools
Experience with Python, Go, or Ruby
Strong Linux and TCP/IP knowledge
Proficiency in CI/CD, GitOps, and Infrastructure as Code
Strong communication and documentation skills

Benefits For Senior Site Reliability Engineer - GPU Clusters

Equity

Equity
Benefits package

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology.

San Francisco, CA, USA • Boston, MA, USA • Austin, TX, USA

$184,000 - $356,500

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

7+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer

NVIDIA

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.