Senior Software Engineer - GPU Clusters

World leader in accelerated computing, pioneering AI and digital twins technology.
$180,000 - $339,250
Cloud
Senior Software Engineer
In-Person
7+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and AI innovation, is seeking a Senior Software Engineer to lead their GPU clusters initiative. This role sits at the intersection of high-performance computing and artificial intelligence, where you'll be responsible for designing and managing large-scale GPU clusters that power cutting-edge AI workloads.

The position offers an opportunity to work with state-of-the-art technology in a company that's driving the future of AI and computing. You'll be joining a team that values operational excellence and innovation, working on infrastructure that directly impacts the development of next-generation AI solutions.

As a Senior Software Engineer, you'll be responsible for ensuring the reliability and efficiency of GPU clusters across multiple cloud platforms and on-premises environments. This includes implementing automation, maintaining high availability, and continuously improving infrastructure performance. You'll work with technologies like Kubernetes, various cloud platforms (AWS, GCP, Azure, OCI), and modern DevOps tools.

The ideal candidate brings strong technical expertise in cloud infrastructure, containerization, and programming, combined with experience in GPU or high-performance computing environments. You'll need excellent problem-solving skills and the ability to work effectively in a fast-paced, collaborative environment.

This role offers competitive compensation, including a base salary range of $180,000 to $339,250, plus equity. You'll be working at the forefront of AI technology, contributing to infrastructure that powers groundbreaking developments in artificial intelligence, autonomous vehicles, and high-performance computing.

Join NVIDIA to be part of a team that's shaping the future of computing and AI, while working with some of the most advanced technology in the industry. This position provides an excellent opportunity for growth and impact in a company that's leading the AI revolution.

Last updated 20 days ago

Responsibilities For Senior Software Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and QoS through operational excellence
  • Support globally distributed cloud environments (AWS, GCP, Azure, OCI) and on-prem
  • Define and implement SLOs and SLIs
  • Write RCA reports for production incidents
  • Participate in on-call rotation
  • Drive evaluation and integration of new GPU technologies

Requirements For Senior Software Engineer - GPU Clusters

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar environments
  • Expertise in production-level cloud services
  • Proficiency with Kubernetes, Docker, or similar tools
  • Experience in Python, Go, or Ruby programming
  • Strong Linux and TCP/IP knowledge
  • Proficiency in CI/CD, GitOps, and Infrastructure as Code
  • Strong communication and documentation skills

Benefits For Senior Software Engineer - GPU Clusters

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Software Engineer - GPU Clusters

Senior Software Engineer, Kubernetes - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on Kubernetes development for DGX Cloud, working on GPU resource scheduling and cluster management for AI workloads.

Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems

Senior Cloud Engineer role at NVIDIA focusing on infrastructure automation and distributed systems for DGX cloud services.

Senior AI-HPC Storage Engineer

Senior AI-HPC Storage Engineer role at NVIDIA, focusing on designing and implementing advanced storage solutions for AI and high-performance computing environments.

Senior Software Engineer, Bare Metal Automation - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on bare metal automation for DGX Cloud, managing GPU clusters and implementing monitoring systems for AI infrastructure.

Senior Cloud Platform Software Engineer

Senior Cloud Platform Engineer role at NVIDIA building scalable cloud services for AI workloads, requiring 12+ years of experience in platform engineering and expertise in Kubernetes.