NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The position requires an expert who can build and operate clusters with high reliability, efficiency, and performance while driving improvements and automation to enhance researcher productivity.
As an SRE, you'll be responsible for overseeing system interactions, utilizing various tools and approaches to address a wide range of challenges. The role emphasizes proactive problem-solving, including limiting reactive operational work and conducting blameless postmortems. NVIDIA's culture promotes diversity, intellectual curiosity, and innovation in a blame-free environment.
The role involves building and improving the ecosystem around GPU-accelerated computing, developing large-scale automation solutions, and maintaining AI-HPC GPU clusters. You'll support researchers in optimizing their workflows and focus on performance scaling, real-time monitoring, and system reliability.
Key responsibilities include designing state-of-the-art GPU compute clusters, optimizing operations, implementing automation, and participating in an on-call rotation. The ideal candidate will have extensive experience with GPU computing, AI infrastructure, and proven expertise in site reliability engineering for high-performance computing environments.
This position offers the opportunity to work with some of the largest and most complex systems in the world, contributing to groundbreaking advancements in AI research infrastructure. The role combines technical expertise with strategic thinking, making it perfect for those passionate about large-scale distributed systems and cutting-edge technology.