Senior Site Reliability Engineer - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $276,000
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developer productivity through automation and optimization.

The position requires expertise in Kubernetes, distributed systems, and cloud technologies. You'll be responsible for the entire service lifecycle, from design through deployment and maintenance, focusing on performance at scale, monitoring, and incident response. The role offers opportunities to work with cutting-edge technology in AI and cloud computing.

NVIDIA's culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The company encourages collaboration, big thinking, and risk-taking while providing support and mentorship for professional growth. The compensation package includes a competitive base salary range of $148,000-$276,000, plus equity and comprehensive benefits.

The ideal candidate will have 5+ years of experience, strong Linux and container expertise, and programming skills in languages like Python or Go. You'll join a team that's transforming industries through accelerated computing and AI technology, making this an excellent opportunity for those passionate about large-scale distributed systems and cutting-edge technology.

Last updated 12 days ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and launch reviews
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation for production systems

Requirements For Senior Site Reliability Engineer - DGX Cloud

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with Infrastructure automation and distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - DGX Cloud

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.