Senior Site Reliability Engineer - DGX Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.

Santa Clara, CA, USA

$148,000 - $276,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer - DGX Cloud

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role demands knowledge across various domains including systems, networking, coding, database management, capacity management, continuous delivery and deployment, and cloud technologies like Kubernetes and OpenStack.

As a Senior Site Reliability Engineer for DGX Cloud, you'll be responsible for:

Designing, implementing, and supporting operational and reliability aspects of large-scale Kubernetes clusters
Engaging in the entire lifecycle of services, from inception and design to deployment and refinement
Supporting services pre-launch through system design consulting, tool development, capacity management, and launch reviews
Maintaining live services by monitoring availability, latency, and overall system health
Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
Practicing sustainable incident response and blameless postmortems
Participating in an on-call rotation to support production systems

The ideal candidate will have:

BS degree in Computer Science or a related technical field involving coding, or equivalent experience
5+ years of experience
Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems
Proficiency in Python, Go, Perl, or Ruby
In-depth knowledge of Linux, Networking, and Containers

NVIDIA offers a competitive base salary range of $148,000 - $276,000 USD, along with equity and comprehensive benefits. The company values diversity and fosters an inclusive work environment, encouraging collaboration, intellectual curiosity, and risk-taking in a blame-free setting.

Join NVIDIA to be part of a team tackling challenges in AI and digital twins, transforming major industries and making a profound impact on society.

Last updated 15 days ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
Practice sustainable incident response and blameless postmortems
Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - DGX Cloud

Kubernetes

Linux

Python

BS degree in Computer Science or related technical field involving coding, or equivalent experience
5+ years of experience
Experience with Infrastructure automation, distributed systems design
Experience with design, develop tools for running large scale private or public cloud system in Production
Experience in Python, Go, Perl or Ruby
In depth knowledge on Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Equity

Equity
Comprehensive benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.

Santa Clara, CA, USA

$148,000 - $276,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS · Cloud

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary and opportunity to work with cutting-edge cloud technologies.

Senior Production SRE Engineer - Storage

NVIDIA

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience and expertise in large-scale system reliability and automation.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer for AI Research Clusters at NVIDIA, designing and implementing GPU compute clusters for AI research.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

NVIDIA is seeking a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads.

Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.