NVIDIA, the pioneer in GPU technology and leader in accelerated computing, is seeking a Senior Site Reliability Engineer to spearhead the management of their GPU clusters. This role sits at the intersection of AI innovation and infrastructure management, where you'll be responsible for designing and maintaining the backbone of NVIDIA's AI computing capabilities.
The position offers an opportunity to work with cutting-edge technology in AI and machine learning, managing large-scale GPU clusters that power crucial workloads across multiple teams. You'll be part of a team that's directly contributing to the advancement of artificial intelligence and high-performance computing, working with the latest GPU technologies including the GB200.
As an SRE, you'll be responsible for ensuring the reliability and performance of critical infrastructure, implementing automation solutions, and maintaining high availability of services. The role combines traditional SRE responsibilities with specialized knowledge in GPU cluster management, making it a unique opportunity for those interested in high-performance computing infrastructure.
The position offers competitive compensation ranging from $180,000 to $339,250, plus equity benefits. You'll be working from one of NVIDIA's major tech hubs, collaborating with researchers, AI engineers, and infrastructure teams. This role is perfect for someone who has a strong background in infrastructure management, a passion for operational excellence, and wants to be at the forefront of AI technology advancement.
The ideal candidate will bring experience in cloud services, containerization, and infrastructure automation, along with strong problem-solving abilities and excellent communication skills. You'll be joining a company that's driving innovation in AI, autonomous vehicles, and high-performance computing, making this an excellent opportunity for career growth and impact.