NVIDIA is seeking a highly skilled and experienced Senior DevOps Engineer to lead the design, deployment, and management of large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA.
The ideal candidate will have a passion for operational excellence, automation, and working in a multi-cloud environment. They will collaborate with researchers, AI engineers, and infrastructure teams to ensure GPU clusters perform efficiently, scale well, and remain reliable.
Key responsibilities include:
- Designing, deploying, and supporting large-scale, distributed GPU clusters for high-performance AI and machine learning workloads
- Continuously improving infrastructure provisioning, management, and monitoring through automation
- Ensuring high uptime and quality of service through operational excellence and proactive monitoring
- Supporting globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-premises infrastructure
- Implementing and maintaining service level objectives (SLOs) and indicators (SLIs)
- Participating in on-call rotations and incident resolution
Requirements:
- BS in Computer Science or equivalent experience
- 7+ years of software engineering experience, with 3+ years managing GPU clusters or similar high-performance computing environments
- Expertise in cloud services, containerization (Kubernetes, Docker), and Infrastructure as Code (Terraform, Ansible)
- Proficiency in multiple programming languages and Linux systems
The role offers a competitive base salary range of $180,000 - $339,250 USD, along with equity and comprehensive benefits. Join NVIDIA's engineering team and contribute to groundbreaking developments in AI, High-Performance Computing, and Visualization.