NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The ideal candidate will be responsible for building and operating these clusters with high reliability, efficiency, and performance while driving foundational improvements and automation to enhance researcher productivity.
As an SRE at NVIDIA, you'll be part of a diverse team that values intellectual curiosity and problem-solving. The role involves working with a broad spectrum of tools and approaches, implementing practices such as limiting reactive operational work, conducting blameless postmortems, and proactively identifying potential outages. You'll be immersed in an environment that promotes self-direction while providing necessary support and mentorship for growth.
The position requires expertise in GPU computing and AI infrastructure, with hands-on experience managing large-scale compute clusters of at least 2K GPUs. You'll work with technologies like Kubernetes, Docker, and various configuration management tools while utilizing programming skills in Python and Bash scripting. The role offers exposure to cutting-edge AI research infrastructure and the opportunity to impact NVIDIA's groundbreaking work in artificial intelligence.
NVIDIA's culture emphasizes diversity, openness, and collaboration in a blame-free environment. The company has a strong track record of innovation, from inventing the GPU that sparked the PC gaming market to revolutionizing parallel computing and advancing AI research worldwide. This role offers the chance to be part of a company that consistently evolves and adapts to new opportunities while making lasting impacts on the world.