Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $276,000
Cloud
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Senior Cloud Platform Software Engineer

Senior Cloud Platform Software Engineer role at NVIDIA working on AI super compute infrastructure using Kubernetes, offering $224K-$425.5K salary plus equity.

Senior AI Infrastructure Engineer - DGX Cloud

Senior AI Infrastructure Engineer position at NVIDIA, focusing on DGX Cloud services, offering $148K-$287.5K salary plus benefits. Requires 5+ years experience in cloud infrastructure and distributed systems.

Senior AI Infrastructure Engineer - DGX Cloud

Senior AI Infrastructure Engineer position at NVIDIA, focusing on DGX Cloud SRE, offering $148K-$287.5K salary plus equity, requiring 5+ years experience in distributed systems and cloud technologies.

Senior Software Engineer, DGX Cloud Orchestration

Senior Software Engineer position at NVIDIA focusing on DGX Cloud orchestration, building scalable automation solutions and APIs for high-performance GPU infrastructure.

Software Engineer / Senior Software Engineer

Senior Software Engineer position at Microsoft's Azure Core team in Romania, developing cloud infrastructure with up to 100% remote work option.

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to join their DGX Cloud team focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for both internal and external GPU cloud services.

The position combines systems engineering with software development, focusing on building tooling, reporting, and automation to enable operational excellence across a highly dynamic organization. You'll be working with cloud infrastructure, developing essential data pipelines, and streamlining incident management processes.

As a Senior Software Engineer in this role, you'll be at the forefront of maintaining and improving NVIDIA's cloud services, working with cutting-edge technologies including Python, Go, TypeScript, and Kubernetes. The role offers an exciting opportunity to work with distributed systems at scale while contributing to NVIDIA's groundbreaking developments in Artificial Intelligence and High-Performance Computing.

The position offers competitive compensation with a base salary range of $148,000 to $276,000, plus equity and benefits. NVIDIA provides an environment that values creativity, autonomy, and technical innovation. You'll be joining a company that's leading the way in AI and digital twins, transforming the world's largest industries and profoundly impacting society.

This role is perfect for someone who combines strong technical skills with excellent problem-solving abilities and communication skills. You'll have the opportunity to work on challenging problems, collaborate with talented peers, and make a significant impact on the reliability and efficiency of NVIDIA's cloud infrastructure. The position offers the flexibility of remote work while being part of a team that's pushing the boundaries of what's possible in cloud computing and AI.

Last updated 5 months ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure
  • Design, implement, ship, and maintain essential data pipelines for executive leadership
  • Integrate tooling with internal and customer workflows
  • Reduce the toil of running an incident, writing a postmortem, running an oncall
  • Evangelize sustainable blameless incident prevention and incident response
  • Consult with peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • In-depth knowledge in Linux, Networking, Storage, and Containers
  • Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity
  • Benefits package

Interested in this job?