NVIDIA, the pioneer in GPU technology and leader in accelerated computing, is seeking a Senior Site Reliability Engineer to spearhead the management of their large-scale GPU clusters. This role sits at the intersection of AI innovation and infrastructure management, supporting critical AI workloads across multiple teams and projects. The position offers an opportunity to work with cutting-edge technology in AI and machine learning infrastructure.
The role demands expertise in managing high-performance computing environments, with a focus on GPU clusters that power AI workloads. You'll be responsible for designing, deploying, and maintaining these systems while ensuring optimal performance and reliability. The position requires strong technical skills in cloud computing, containerization, and automation, along with the ability to work in a multi-cloud environment.
As a Senior SRE, you'll collaborate with researchers, AI engineers, and infrastructure teams, contributing to NVIDIA's mission of accelerating the next wave of artificial intelligence. The role offers competitive compensation ($184,000 - $356,500) plus equity, and the opportunity to work with a company at the forefront of AI and digital twins technology. You'll be part of a team that values operational excellence and innovation, working on projects that directly impact the future of machine learning and artificial intelligence.
The ideal candidate will bring 7+ years of software engineering experience, with specific expertise in GPU clusters or similar high-performance computing environments. This role is perfect for someone who combines technical expertise with a passion for operational excellence and automation, and who thrives in a fast-paced, innovative environment.