NVIDIA, the pioneering force behind modern AI computing, is seeking a Site Reliability Engineering leader to manage their DGX Cloud Computing operations. This role sits at the intersection of cutting-edge AI technology and cloud infrastructure, overseeing the observability platform for multi-colo distributed NVIDIA GPU cloud clusters.
The position offers an opportunity to work with world-class software engineers on NVIDIA's GPU Cloud (NGC), a GPU-accelerated platform that enables data scientists and researchers to build, train, and deploy neural network models for complex AI challenges. As a leader, you'll be responsible for all aspects of cluster operational excellence, managing a team of Site Reliability engineers, and driving technical projects in an innovative, fast-paced environment.
The role requires a strong technical background with 10+ years of engineering experience and 3+ years of leadership experience. You'll be working with cutting-edge technologies including Kubernetes, OpenStack, Docker, and observability tools like Grafana, OpenTelemetry, and Prometheus. The position offers exposure to various domains such as information retrieval, artificial intelligence, natural language processing, and distributed computing.
NVIDIA offers competitive compensation with a base salary range of $200,000 - $391,000, plus equity benefits. The company is committed to fostering a diverse work environment and values creative, autonomous engineers with a passion for technology. This role provides an exceptional opportunity to lead and influence the direction of cloud infrastructure services at one of the world's leading AI computing companies.
The ideal candidate will combine technical expertise in distributed systems and cloud infrastructure with strong leadership abilities, capable of mentoring team members while driving technical excellence. You'll be working on projects that directly impact NVIDIA's cloud computing capabilities, making this an excellent opportunity for those looking to make a significant impact in the AI and cloud computing space.