Senior Site Reliability Engineer - GPU Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI, digital twins, and transforming the world's largest industries.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

In-Person

8+ years of experience

AI · Automotive · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer

NVIDIA

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Description For Senior Site Reliability Engineer - GPU Cloud

NVIDIA, a pioneer in Accelerated Computing, is seeking a Senior Site Reliability Engineer for their GPU Cloud team. This role is part of a fast-paced SRE team managing cloud and on-prem infrastructure for High-Performance & Distributed Computing. The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers, spanning thousands of GPU nodes.

As a Senior SRE, you will:

Provide scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions for NVIDIA's on-prem and cloud-based GPU infrastructure.
Own the entire lifecycle of new tools and services, from requirements gathering to deployment.
Provide customer support on a rotation basis.

Key requirements:

8+ years of experience in automating large-scale distributed system software deployments.
Proficiency in Go, Python, Perl, C++, Java, or C.
Strong command of Terraform, Kubernetes, and cloud infrastructure administration.
Excellent debugging, troubleshooting, and system design skills.
M.Sc or B.E in Computer Science or a related technical field.

NVIDIA offers a diverse work environment and is an equal opportunity employer. They do not discriminate based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other protected characteristic.

Join NVIDIA to work at the forefront of AI, autonomous vehicles, robotics, HPC, gaming/visualization, and cloud computing, contributing to breakthrough technologies that are transforming industries and society.

Last updated 5 months ago

Responsibilities For Senior Site Reliability Engineer - GPU Cloud

Provide scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions
Own the entire lifecycle of new tools and services
Provide customer support on a rotation basis

Requirements For Senior Site Reliability Engineer - GPU Cloud

Kubernetes

Python

Java

Minimum of 8 years of experience in automating large-scale distributed system software deployments
Proficiency in Go, Python, Perl, C++, Java, or C
Strong command of Terraform, Kubernetes, and cloud infrastructure administration
Excellent debugging and troubleshooting skills
Ability to design simple and reliable systems
Outstanding teamwork and collaboration skills
Excellent interpersonal and written communication skills
M.Sc or B.E in Computer Science or related technical field

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI, digital twins, and transforming the world's largest industries.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

In-Person

8+ years of experience

AI · Automotive · Enterprise SaaS

Interested in this job?