Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$148,000 - $276,000
Cloud
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is seeking a Senior Software Engineer to join their DGX Cloud team, focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for NVIDIA's internal and external GPU cloud services. The position combines systems engineering with tooling development, making it perfect for engineers passionate about cloud infrastructure and operational efficiency.

The role involves designing and implementing tools and automation systems that will form the foundation of NVIDIA's operational excellence. You'll be working with cloud infrastructure, building data pipelines for executive decision-making, and streamlining incident management processes. This position offers the opportunity to work with cutting-edge technology in AI and high-performance computing.

The ideal candidate will bring strong experience in cloud infrastructure, distributed systems, and programming languages like Python, Go, or TypeScript. Knowledge of Linux systems, containers, and modern DevOps practices is essential. You'll be joining a company at the forefront of AI and accelerated computing innovation, with the opportunity to impact how cloud services are delivered and maintained at scale.

NVIDIA offers competitive compensation, including a base salary range of $148,000 to $276,000, plus equity benefits. The company is known for its innovative culture and commitment to pushing technological boundaries. As part of the DGX Cloud team, you'll be working on systems that power some of the most advanced AI and computing solutions in the industry.

Last updated a day ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure
  • Design, implement, ship, and maintain essential data pipelines for executive leadership
  • Integrate tooling with internal and customer workflows
  • Reduce operational toil in incident management
  • Evangelize sustainable blameless incident prevention and response
  • Provide consultation for peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • In-depth knowledge of Linux, Networking, Storage, and Containers
  • Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Senior DGX Cloud Software Engineer- Infrastructure Automation and Distributed Systems

Senior Cloud Engineer role at NVIDIA focusing on infrastructure automation and distributed systems, offering competitive compensation and opportunity to work with cutting-edge technology.

Senior System Software Engineer - Scientific Computing PaaS

Senior System Software Engineer position at NVIDIA focusing on building scientific computing platform on DGX Cloud, requiring expertise in cloud computing and distributed systems.

Senior Software Engineer, Kubernetes - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on Kubernetes and GPU infrastructure for DGX Cloud, offering competitive salary and opportunity to work with cutting-edge AI technology.

Senior Software Engineer, Bare Metal Automation - DGX Cloud

Senior Software Engineer position at NVIDIA focusing on bare metal automation for DGX Cloud, managing large-scale GPU clusters for AI workloads.

Senior Software Engineer - HPC

Senior Software Engineer position at NVIDIA focusing on HPC infrastructure, requiring 10+ years of experience in designing and implementing large-scale distributed systems.