Senior DevOps Engineer - GPU Clusters

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$180,000 - $339,250
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
7+ years of experience
AI · Enterprise SaaS

Description For Senior DevOps Engineer - GPU Clusters

NVIDIA is seeking a highly skilled and experienced Senior DevOps Engineer to lead the design, deployment, and management of large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA.

The ideal candidate will have a passion for operational excellence, automation, and working in a multi-cloud environment. They will collaborate with researchers, AI engineers, and infrastructure teams to ensure GPU clusters perform efficiently, scale well, and remain reliable.

Key responsibilities include:

  • Designing, deploying, and supporting large-scale, distributed GPU clusters for high-performance AI and machine learning workloads
  • Continuously improving infrastructure provisioning, management, and monitoring through automation
  • Ensuring high uptime and quality of service through operational excellence and proactive monitoring
  • Supporting globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-premises infrastructure
  • Implementing and maintaining service level objectives (SLOs) and indicators (SLIs)
  • Participating in on-call rotations and incident resolution

Requirements:

  • BS in Computer Science or equivalent experience
  • 7+ years of software engineering experience, with 3+ years managing GPU clusters or similar high-performance computing environments
  • Expertise in cloud services, containerization (Kubernetes, Docker), and Infrastructure as Code (Terraform, Ansible)
  • Proficiency in multiple programming languages and Linux systems

The role offers a competitive base salary range of $180,000 - $339,250 USD, along with equity and comprehensive benefits. Join NVIDIA's engineering team and contribute to groundbreaking developments in AI, High-Performance Computing, and Visualization.

Last updated 9 days ago

Responsibilities For Senior DevOps Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and machine learning workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and quality of service through operational excellence and proactive monitoring
  • Support globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-premises infrastructure
  • Define and implement service level objectives (SLOs) and indicators (SLIs)
  • Write high-quality Root Cause Analysis (RCA) reports for production-level incidents
  • Participate in on-call rotation to support critical infrastructure
  • Drive evaluation and integration of new GPU technologies (like GB200) and cloud technologies

Requirements For Senior DevOps Engineer - GPU Clusters

Kubernetes
Python
Go
Ruby
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar high-performance computing environments
  • Expertise in designing, deploying, and running production-level cloud services
  • Proficiency with orchestration and containerization tools (Kubernetes, Docker)
  • Experience coding/scripting in at least two high-level programming languages (Python, Go, Ruby)
  • Strong proficiency with Linux operating systems and TCP/IP fundamentals
  • Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (Terraform, Ansible)
  • Strong communication and documentation skills

Benefits For Senior DevOps Engineer - GPU Clusters

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior DevOps Engineer - GPU Clusters

Senior Software Development Engineer in Test

Senior SDET role at NVIDIA focusing on cloud infrastructure and distributed systems testing

Senior Release Engineer - Server Software

Senior Release Engineer position at NVIDIA, managing software and firmware releases for enterprise AI infrastructure, offering competitive salary and benefits.

Senior PCIe DevOps, Automation and Verification Engineer

Senior PCIe DevOps Engineer role at NVIDIA, focusing on automation and verification of PCIe technology, requiring 6+ years of experience in DevOps and hardware verification.

Senior Software Test Development Engineer

NVIDIA seeks a Senior Software Test Development Engineer for platform SWQA, focusing on test plan development, automation, and reliability analysis.

Senior Engineer - DevOps

NVIDIA seeks a Senior DevOps Engineer to design and maintain Kubernetes-based environments, manage cloud infrastructure, and implement data analytics solutions.