Senior Site Reliability Engineer - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.
$148,000 - $276,000
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer - DGX Cloud

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role demands knowledge across various domains including systems, networking, coding, database management, capacity management, continuous delivery and deployment, and cloud technologies like Kubernetes and OpenStack.

As a Senior Site Reliability Engineer for DGX Cloud, you'll be responsible for:

  • Designing, implementing, and supporting operational and reliability aspects of large-scale Kubernetes clusters
  • Engaging in the entire lifecycle of services, from inception and design to deployment and refinement
  • Supporting services pre-launch through system design consulting, tool development, capacity management, and launch reviews
  • Maintaining live services by monitoring availability, latency, and overall system health
  • Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
  • Practicing sustainable incident response and blameless postmortems
  • Participating in an on-call rotation to support production systems

The ideal candidate will have:

  • BS degree in Computer Science or a related technical field involving coding, or equivalent experience
  • 5+ years of experience
  • Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems
  • Proficiency in Python, Go, Perl, or Ruby
  • In-depth knowledge of Linux, Networking, and Containers

NVIDIA offers a competitive base salary range of $148,000 - $276,000 USD, along with equity and comprehensive benefits. The company values diversity and fosters an inclusive work environment, encouraging collaboration, intellectual curiosity, and risk-taking in a blame-free setting.

Join NVIDIA to be part of a team tackling challenges in AI and digital twins, transforming major industries and making a profound impact on society.

Last updated 15 days ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - DGX Cloud

Kubernetes
Linux
Python
Go
  • BS degree in Computer Science or related technical field involving coding, or equivalent experience
  • 5+ years of experience
  • Experience with Infrastructure automation, distributed systems design
  • Experience with design, develop tools for running large scale private or public cloud system in Production
  • Experience in Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary and opportunity to work with cutting-edge cloud technologies.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience and expertise in large-scale system reliability and automation.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer for AI Research Clusters at NVIDIA, designing and implementing GPU compute clusters for AI research.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA is seeking a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads.

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.