Site Reliability Engineer - GPU Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI, digital twins, and transforming industries.

Bengaluru, Karnataka, India

Site Reliability

Mid-Level Software Engineer

In-Person

3+ years of experience

AI · Enterprise SaaS · Cloud

Description For Site Reliability Engineer - GPU Cloud

NVIDIA, a pioneer in Accelerated Computing, is seeking a Site Reliability Engineer for their GPU Cloud team. This role is part of a fast-paced, dynamic Site Reliability Engineering (SRE) team serving at the forefront of the latest science and technology trends in cloud and on-prem infrastructure management for High-Performance & Distributed Computing.

The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers. The SRE team is accountable for the setup, management, reliability, and availability of this infrastructure spanning 1000s of GPU nodes. As an SRE, you will be responsible for providing scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions for NVIDIA's on-prem and cloud-based GPU infrastructure. You will own the whole lifecycle of new tools and services and provide customer support on a rotation basis.

The ideal candidate should have a minimum of 3 years of experience in automating and handling large-scale distributed system software deployments in on-prem/cloud environments. Proficiency in languages such as Go, Python, Perl, C++, Java, or C is required, along with a strong command of terraform, Kubernetes, and cloud infra administration. Excellent debugging, troubleshooting, and communication skills are essential.

NVIDIA offers a dynamic work environment at the cutting edge of AI, GPU technology, and cloud computing. This role provides an opportunity to work with some of the most forward-thinking and hardworking people in the technology world. If you're creative, autonomous, and passionate about infrastructure and resolving intricate multi-faceted issues, this position at NVIDIA could be an excellent fit for you.

Join NVIDIA in shaping the future of AI, cloud computing, and high-performance distributed systems. Apply now to be part of a team that's driving innovation in some of the most exciting areas of technology today.

Last updated 2 months ago

Responsibilities For Site Reliability Engineer - GPU Cloud

Provide scalable and robust service-oriented infrastructure automation, monitoring, and analytics solutions for NVIDIA's on-prem and cloud-based GPU infrastructure
Own the whole lifecycle of new tools and services - from requirements gathering to design documentation, validation, and deployment
Provide customer support on a rotation basis

Requirements For Site Reliability Engineer - GPU Cloud

Python

Kubernetes

Linux

Minimum of 3 years experience in automating and handling large-scale distributed system software deployments in on-prem/cloud environments
Proficiency in any language - Go/Python/Perl/C++/Java/C
Strong command on terraform, Kubernetes, and cloud infra administration
Excellent debugging and troubleshooting skills
Excellent interpersonal and written communication skills
B.E in Computer Science or a related technical field involving coding (e.g., physics or mathematics)

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI, digital twins, and transforming industries.

Bengaluru, Karnataka, India

Site Reliability

Mid-Level Software Engineer

In-Person

3+ years of experience

AI · Enterprise SaaS · Cloud

Interested in this job?

Jobs Related To NVIDIA Site Reliability Engineer - GPU Cloud

Cloud Site Reliability Engineer (SRE)

Incorta

Cloud SRE position at Incorta focusing on infrastructure reliability, automation, and DevOps practices, requiring 2-3 years of experience.

Site Reliability Engineer

Cprime

Site Reliability Engineer position focused on managing and supporting cloud applications and infrastructure using AWS and Atlassian tools.

Software Engineer, Traffic Trust SRE, DoS Infrastructure

Google

Site Reliability Engineer position at Google focusing on Traffic Trust and DoS Infrastructure, combining software engineering with systems operations to maintain large-scale distributed systems.

Software Engineer III, Site Reliability Engineer

Google

Site Reliability Engineer role at Google focusing on building and maintaining large-scale distributed systems for Google Cloud services.

Databases Site Reliability Engineer

Google

Site Reliability Engineer position at Google focusing on database systems, requiring expertise in distributed systems and infrastructure management.