NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Cloud team. This role is at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems. The SRE team at NVIDIA ensures maximum reliability and uptime for both internal and external GPU cloud services.
The position offers an opportunity to work with cutting-edge technology in a culture that values diversity, intellectual curiosity, and problem-solving. You'll be responsible for managing large-scale Kubernetes clusters, implementing monitoring solutions, and ensuring system reliability through automation and proactive maintenance.
As an SRE at NVIDIA, you'll be part of a team that encourages collaboration, big thinking, and risk-taking in a blame-free environment. The role combines hands-on technical work with strategic system design, offering a perfect balance for those interested in both infrastructure and software development. You'll work with various tools and technologies, including Python, Go, Linux, and Kubernetes, while contributing to systems that power NVIDIA's AI and cloud initiatives.
The position offers a competitive salary range of $132,000 to $310,500, along with equity and comprehensive benefits. This is an excellent opportunity for experienced engineers who want to impact the future of cloud computing and AI infrastructure while working for a technology leader.