Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This role demands knowledge across various domains including systems, networking, coding, database management, capacity management, continuous delivery and deployment, and cloud technologies like Kubernetes and OpenStack.
As a Senior Site Reliability Engineer for DGX Cloud, you'll be responsible for:
- Designing, implementing, and supporting operational and reliability aspects of large-scale Kubernetes clusters
- Engaging in the entire lifecycle of services, from inception and design to deployment and refinement
- Supporting services pre-launch through system design consulting, tool development, capacity management, and launch reviews
- Maintaining live services by monitoring availability, latency, and overall system health
- Scaling systems sustainably through automation and evolving systems to improve reliability and velocity
- Practicing sustainable incident response and blameless postmortems
- Participating in an on-call rotation to support production systems
The ideal candidate will have:
- BS degree in Computer Science or a related technical field involving coding, or equivalent experience
- 5+ years of experience
- Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems
- Proficiency in Python, Go, Perl, or Ruby
- In-depth knowledge of Linux, Networking, and Containers
NVIDIA offers a competitive base salary range of $148,000 - $276,000 USD, along with equity and comprehensive benefits. The company values diversity and fosters an inclusive work environment, encouraging collaboration, intellectual curiosity, and risk-taking in a blame-free setting.
Join NVIDIA to be part of a team tackling challenges in AI and digital twins, transforming major industries and making a profound impact on society.