Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $276,000
Cloud
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to join their DGX Cloud team focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for both internal and external GPU cloud services.

The position combines systems engineering with software development, focusing on building tooling, reporting, and automation to enable operational excellence across a highly dynamic organization. You'll be working with cloud infrastructure, developing essential data pipelines, and streamlining incident management processes.

As a Senior Software Engineer in this role, you'll be at the forefront of maintaining and improving NVIDIA's cloud services, working with cutting-edge technologies including Python, Go, TypeScript, and Kubernetes. The role offers an exciting opportunity to work with distributed systems at scale while contributing to NVIDIA's groundbreaking developments in Artificial Intelligence and High-Performance Computing.

The position offers competitive compensation with a base salary range of $148,000 to $276,000, plus equity and benefits. NVIDIA provides an environment that values creativity, autonomy, and technical innovation. You'll be joining a company that's leading the way in AI and digital twins, transforming the world's largest industries and profoundly impacting society.

This role is perfect for someone who combines strong technical skills with excellent problem-solving abilities and communication skills. You'll have the opportunity to work on challenging problems, collaborate with talented peers, and make a significant impact on the reliability and efficiency of NVIDIA's cloud infrastructure. The position offers the flexibility of remote work while being part of a team that's pushing the boundaries of what's possible in cloud computing and AI.

Last updated 4 months ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure
  • Design, implement, ship, and maintain essential data pipelines for executive leadership
  • Integrate tooling with internal and customer workflows
  • Reduce the toil of running an incident, writing a postmortem, running an oncall
  • Evangelize sustainable blameless incident prevention and incident response
  • Consult with peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • In-depth knowledge in Linux, Networking, Storage, and Containers
  • Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Senior Software Engineer, Bare Metal Automation - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on bare metal automation for DGX Cloud, managing GPU clusters and implementing monitoring systems.

Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems

Senior Cloud Engineer role at NVIDIA focusing on infrastructure automation and distributed systems for DGX Cloud platform, offering competitive salary and remote work options.

Senior DGX Cloud Software Engineer- AI NeoCloud Infrastructure Automation

Senior cloud engineering role at NVIDIA focusing on AI infrastructure automation and distributed systems, offering competitive compensation and remote work options.

Senior System Software Engineer - Scientific Computing PaaS

Senior System Software Engineer role at NVIDIA focusing on building scientific computing platform on DGX Cloud.

Senior Software Engineer - HPC

Senior Software Engineer position at NVIDIA focusing on HPC infrastructure development and management using cloud technologies.