NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developers to implement changes efficiently. The position requires expertise in systems, networking, coding, database management, and cloud technologies like Kubernetes and OpenStack.
The role emphasizes automation, performance tuning, and system optimization, with a focus on eliminating manual work. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The position offers opportunities to work on meaningful projects with support and mentorship for continuous learning and growth.
Key responsibilities include designing and implementing large-scale observability platforms, managing service lifecycles, and maintaining system health through monitoring and automation. The ideal candidate will have 5+ years of experience in infrastructure automation and observability platforms, strong programming skills in languages like Python or Go, and deep knowledge of Linux and containerization.
NVIDIA offers a competitive compensation package with a base salary range of $148,000 - $419,750 USD, plus equity benefits. The company promotes a blame-free environment that encourages collaboration, innovation, and risk-taking. This is an excellent opportunity for experienced engineers passionate about reliability, scalability, and system optimization to join a leading technology company transforming industries through AI and digital twins.