Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

San Francisco, CA, USA

$148,000 - $276,000

Cloud

Senior Software Engineer

Remote

5+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to join their DGX Cloud team focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for both internal and external GPU cloud services.

The position combines systems engineering with software development, focusing on building tooling, reporting, and automation to enable operational excellence across a highly dynamic organization. You'll be working with cloud infrastructure, developing essential data pipelines, and streamlining incident management processes.

As a Senior Software Engineer in this role, you'll be at the forefront of maintaining and improving NVIDIA's cloud services, working with cutting-edge technologies including Python, Go, TypeScript, and Kubernetes. The role offers an exciting opportunity to work with distributed systems at scale while contributing to NVIDIA's groundbreaking developments in Artificial Intelligence and High-Performance Computing.

The position offers competitive compensation with a base salary range of $148,000 to $276,000, plus equity and benefits. NVIDIA provides an environment that values creativity, autonomy, and technical innovation. You'll be joining a company that's leading the way in AI and digital twins, transforming the world's largest industries and profoundly impacting society.

This role is perfect for someone who combines strong technical skills with excellent problem-solving abilities and communication skills. You'll have the opportunity to work on challenging problems, collaborate with talented peers, and make a significant impact on the reliability and efficiency of NVIDIA's cloud infrastructure. The position offers the flexibility of remote work while being part of a team that's pushing the boundaries of what's possible in cloud computing and AI.

Last updated 7 months ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Design, build, deploy, and run internal tooling built on top of cloud infrastructure
Design, implement, ship, and maintain essential data pipelines for executive leadership
Integrate tooling with internal and customer workflows
Reduce the toil of running an incident, writing a postmortem, running an oncall
Evangelize sustainable blameless incident prevention and incident response
Consult with peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python

TypeScript

Java

Kubernetes

Linux

BS degree in Computer Science or related technical field
5+ years of experience
Experience with infrastructure automation and distributed systems design
Experience in Python, Go, Typescript, C/C++, or Java
In-depth knowledge in Linux, Networking, Storage, and Containers
Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity

Equity
Benefits package