Senior Site Reliability Engineer - GPU Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI, digital twins, and transforming the world's largest industries.
Site Reliability
Senior Software Engineer
In-Person
8+ years of experience
AI · Automotive · Enterprise SaaS

Description For Senior Site Reliability Engineer - GPU Cloud

NVIDIA, a pioneer in Accelerated Computing, is seeking a Senior Site Reliability Engineer for their GPU Cloud team. This role is part of a fast-paced SRE team managing cloud and on-prem infrastructure for High-Performance & Distributed Computing. The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers, spanning thousands of GPU nodes.

As a Senior SRE, you will:

  • Provide scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions for NVIDIA's on-prem and cloud-based GPU infrastructure.
  • Own the entire lifecycle of new tools and services, from requirements gathering to deployment.
  • Provide customer support on a rotation basis.

Key requirements:

  • 8+ years of experience in automating large-scale distributed system software deployments.
  • Proficiency in Go, Python, Perl, C++, Java, or C.
  • Strong command of Terraform, Kubernetes, and cloud infrastructure administration.
  • Excellent debugging, troubleshooting, and system design skills.
  • M.Sc or B.E in Computer Science or a related technical field.

NVIDIA offers a diverse work environment and is an equal opportunity employer. They do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other protected characteristic.

Join NVIDIA to work at the forefront of AI, autonomous vehicles, robotics, HPC, gaming/visualization, and cloud computing, contributing to breakthrough technologies that are transforming industries and society.

Last updated 2 months ago

Responsibilities For Senior Site Reliability Engineer - GPU Cloud

  • Provide scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions
  • Own the entire lifecycle of new tools and services
  • Provide customer support on a rotation basis

Requirements For Senior Site Reliability Engineer - GPU Cloud

Kubernetes
Go
Python
Java
  • Minimum of 8 years of experience in automating large-scale distributed system software deployments
  • Proficiency in Go, Python, Perl, C++, Java, or C
  • Strong command of Terraform, Kubernetes, and cloud infrastructure administration
  • Excellent debugging and troubleshooting skills
  • Ability to design simple and reliable systems
  • Outstanding teamwork and collaboration skills
  • Excellent interpersonal and written communication skills
  • M.Sc or B.E in Computer Science or related technical field

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - GPU Cloud

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and large-scale system reliability with competitive compensation and benefits.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.