Senior DevOps Engineer - GPU Clusters

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$180,000 - $339,250
DevOps
Senior Software Engineer
In-Person
7+ years of experience
AI · Enterprise SaaS

Description For Senior DevOps Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and AI innovation, is seeking a Senior DevOps Engineer to lead their GPU clusters infrastructure. This role sits at the intersection of high-performance computing and artificial intelligence, where you'll be responsible for designing and managing large-scale GPU clusters that power cutting-edge AI workloads.

The position offers an opportunity to work with state-of-the-art technology in a company that's driving the future of AI and machine learning. You'll be managing infrastructure that supports multiple teams and projects, making a direct impact on NVIDIA's AI initiatives. The role requires expertise in cloud technologies, infrastructure automation, and high-performance computing environments.

As a Senior DevOps Engineer, you'll be responsible for ensuring the reliability and efficiency of GPU clusters, implementing best practices in infrastructure as code, and maintaining high availability for critical systems. You'll work in a multi-cloud environment, dealing with AWS, GCP, Azure, and OCI, as well as on-premises infrastructure.

The ideal candidate should have a strong background in software engineering with specific experience in GPU cluster management or similar high-performance computing environments. You'll need to be proficient in container orchestration, infrastructure automation, and have excellent problem-solving skills. The role offers competitive compensation between $180,000 and $339,250, plus equity benefits.

This is an excellent opportunity for someone passionate about infrastructure automation and operational excellence, who wants to work at the forefront of AI technology. You'll be joining a diverse and experienced team, contributing to groundbreaking developments in artificial intelligence and high-performance computing at NVIDIA.

Last updated 20 days ago

Responsibilities For Senior DevOps Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and QoS through operational excellence and monitoring
  • Support multi-cloud environment (AWS, GCP, Azure, OCI) and on-prem
  • Define and implement SLOs and SLIs
  • Write RCA reports for production incidents
  • Participate in on-call rotation
  • Drive evaluation and integration of new GPU technologies

Requirements For Senior DevOps Engineer - GPU Clusters

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar environments
  • Expertise in production-level cloud services
  • Proficiency with Kubernetes, Docker, or similar tools
  • Experience in Python, Go, or Ruby
  • Strong Linux and TCP/IP knowledge
  • Proficiency in CI/CD, GitOps, and Infrastructure as Code
  • Strong communication and documentation skills

Benefits For Senior DevOps Engineer - GPU Clusters

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior DevOps Engineer - GPU Clusters

Senior DevOps Engineer

Senior DevOps Engineer role at NVIDIA, leading CI/CD infrastructure development and automation, offering competitive salary and opportunity to work with cutting-edge AI technology.

Senior DevOps Engineer - AI Infrastructure

Senior DevOps Engineer position at NVIDIA focusing on AI infrastructure and autonomous vehicle systems, requiring expertise in cloud technologies and automation.

Senior HPC DevOps Engineer

Senior HPC DevOps Engineer role at NVIDIA focusing on building and maintaining large-scale supercomputers and HPC clusters for AI and GPU computing advancement.

Senior DevOps and Automation Engineer, Fabric Networking - GPU

Senior DevOps role at NVIDIA focusing on GPU cluster management, automation, and infrastructure development for high-performance computing systems.

Senior CUDA Driver, Legate, and Build Engineer

Senior DevOps role at NVIDIA focusing on CUDA driver development and build system automation, offering competitive compensation and opportunity to work with cutting-edge technology.