Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $276,000
Cloud
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to join their DGX Cloud team focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for both internal and external GPU cloud services.

The position combines systems engineering with software development, focusing on building tooling, reporting, and automation to enable operational excellence across a highly dynamic organization. You'll be working with cloud infrastructure, developing essential data pipelines, and streamlining incident management processes.

As a Senior Software Engineer in this role, you'll be at the forefront of maintaining and improving NVIDIA's cloud services, working with cutting-edge technologies including Python, Go, TypeScript, and Kubernetes. The role offers an exciting opportunity to work with distributed systems at scale while contributing to NVIDIA's groundbreaking developments in Artificial Intelligence and High-Performance Computing.

The position offers competitive compensation with a base salary range of $148,000 to $276,000, plus equity and benefits. NVIDIA provides an environment that values creativity, autonomy, and technical innovation. You'll be joining a company that's leading the way in AI and digital twins, transforming the world's largest industries and profoundly impacting society.

This role is perfect for someone who combines strong technical skills with excellent problem-solving abilities and communication skills. You'll have the opportunity to work on challenging problems, collaborate with talented peers, and make a significant impact on the reliability and efficiency of NVIDIA's cloud infrastructure. The position offers the flexibility of remote work while being part of a team that's pushing the boundaries of what's possible in cloud computing and AI.

Last updated 23 days ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure
  • Design, implement, ship, and maintain essential data pipelines for executive leadership
  • Integrate tooling with internal and customer workflows
  • Reduce the toil of running an incident, writing a postmortem, running an oncall
  • Evangelize sustainable blameless incident prevention and incident response
  • Consult with peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • In-depth knowledge in Linux, Networking, Storage, and Containers
  • Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems

Senior Cloud Engineer role at NVIDIA focusing on infrastructure automation and distributed systems for DGX Cloud platform, offering competitive compensation and remote work options.

Senior Software Engineer, Kubernetes - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on Kubernetes development for DGX Cloud, working on GPU resource scheduling and cluster management for AI workloads.

Senior AI-HPC Storage Engineer

Senior AI-HPC Storage Engineer role at NVIDIA, focusing on designing and implementing advanced storage solutions for AI and high-performance computing environments.

Senior Software Engineer, Bare Metal Automation - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on bare metal automation for DGX Cloud, managing GPU clusters and implementing monitoring systems for AI infrastructure.

Senior Cloud Platform Software Engineer

Senior Cloud Platform Engineer role at NVIDIA building scalable cloud services for AI workloads, requiring 12+ years of experience in platform engineering and expertise in Kubernetes.