Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.
Santa Clara, CA, USA
$148,000 - $276,000
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

DGXC SRE at NVIDIA ensures that our internal and external facing GPU cloud services run with maximum reliability and uptime as promised to the users, while enabling developers to make changes to the existing system through careful preparation and planning, keeping an eye on capacity, latency and performance.

We are looking for systems and software engineers interested in building tooling, reporting, automation, and ML to enable operational excellence across a highly dynamic organization, solving technical problems that will improve the state of operations across many teams.

What you'll be doing:

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure to provide foundations for operational excellence.
  • Design, implement, ship, and maintain essential data pipelines used by executive leadership to decide on business priorities.
  • Integrate tooling with internal and customer workflows along with cloud service providers to streamline incident management process.
  • Reduce the toil of running an incident, writing a postmortem, running an oncall, etc.
  • Evangelize sustainable blameless incident prevention and incident response.
  • Consult with and provide consultation for peer teams on operations best practices.

What we need to see:

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 5+ years of experience.
  • A track record showing a good balance between initiating your own projects, convincing others to collaborate with you and collaborating well on projects initiated by others.
  • Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.
  • Experience in one or more of the following: Python, Go, Typescript, C/C++, Java.
  • In-depth knowledge in one or more of Linux, Networking, Storage, and Containers.

Ways to stand out from the crowd:

  • Experience building and integrating with incident tooling such as FireHydrant, Rootly, incident.io, blameless.
  • Experience building plugins, templates, and entity schemas in Backstage.
  • Background with infrastructure technologies such as Kubernetes, terraform, docker, helm charts.
  • Experience with basic ML and data science concepts and tooling such as Hive, Apache Beam, Apache Spark, etc.
  • Experience with business analytics tooling such as Looker, Tableau.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hard-working people in the world working for us. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. If you're creative and self-motivated, we want to hear from you!

Last updated 6 minutes ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design and build internal tooling for operational excellence
  • Implement and maintain data pipelines for executive decision-making
  • Integrate tooling with workflows to streamline incident management
  • Reduce operational toil
  • Evangelize sustainable incident prevention and response
  • Consult on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • Knowledge of Linux, Networking, Storage, and Containers

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity
  • Benefits

Interested in this job?

Jobs Related To NVIDIA Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Senior Platform Engineer

Senior Platform Engineer at Capco: Design and manage scalable cloud infrastructure, implement DevOps practices, and drive innovation in financial services technology.

DevSecOps Engineer - WGS

DevSecOps Engineer for Auria's WGS GSCCE project, focusing on CI/CD, cloud services, and secure software development.

Senior DevOps Engineer (AWS)

Senior DevOps Engineer (AWS) role at Dev.Pro, building innovative POS applications with modern tech stack and specializing in payment systems.

Systems Engineer (Windows) - BOT

Senior Systems Engineer role at Bounteous, focusing on Windows systems, Microsoft Intune, Entra ID, and network management.

Software Engineer - Cloud DevOps & Security

Julius is hiring a Senior Software Engineer for Cloud DevOps & Security to build and scale cloud infrastructure for AI-powered code execution, serving millions of users.