NVIDIA, known as "the AI computing company," is seeking a Site Reliability Engineering leader to manage the operations of their observability platform focused on multi-colo distributed NVIDIA GPU cloud clusters. This role is part of the NVIDIA GPU Cloud (NGC) team, a GPU-accelerated platform that enables data scientists and researchers to build, train, and deploy neural network models for complex AI challenges.
The position requires a seasoned leader who will manage all aspects of cluster operational excellence and team growth. The ideal candidate should thrive in a fast-paced iterative engineering environment and have extensive experience delivering scalable distributed systems. This role involves working across various domains including information retrieval, artificial intelligence, natural language processing, distributed computing, and large-scale system design.
As a manager, you'll be responsible for guiding the team in solving reliability challenges for both internal and external-facing systems. The role offers the opportunity to work with cutting-edge technology in AI and deep learning, while leading a team of skilled engineers. You'll collaborate with product management teams, drive technical projects, and contribute to the strategic direction of DGX Cloud Computing Services.
The position offers competitive compensation ranging from $200,000 to $385,250 USD, along with equity and comprehensive benefits. NVIDIA provides an inclusive work environment and values diversity in their workforce, making this an excellent opportunity for leaders who want to make an impact in the AI and cloud computing space.