NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developer productivity through automation and optimization.
The position requires expertise in Kubernetes, distributed systems, and cloud technologies. You'll be responsible for the entire service lifecycle, from design through deployment and maintenance, focusing on performance at scale, monitoring, and incident response. The role offers opportunities to work with cutting-edge technology in AI and cloud computing.
NVIDIA's culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The company encourages collaboration, big thinking, and risk-taking while providing support and mentorship for professional growth. The compensation package includes a competitive base salary range of $148,000-$276,000, plus equity and comprehensive benefits.
The ideal candidate will have 5+ years of experience, strong Linux and container expertise, and programming skills in languages like Python or Go. You'll join a team that's transforming industries through accelerated computing and AI technology, making this an excellent opportunity for those passionate about large-scale distributed systems and cutting-edge technology.