Senior Site Reliability Engineer - Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$136,000 - $322,000
Site Reliability
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform, requiring 5+ years of experience in system administration and reliability engineering.

Senior Software Engineer - Site Reliability Engineering

Senior SRE position at Roblox focusing on building resilient systems, automation tools, and monitoring solutions for a gaming platform serving millions of users.

Senior Site Reliability Engineer (Distributed Systems)

Senior Site Reliability Engineer position at Workday focusing on distributed systems and infrastructure reliability.

Senior Software Engineer, Site Reliability Tooling

Senior SRE Engineer role at Upstart focusing on building tooling and automation for monitoring infrastructure health and creating reliable systems.

Service Reliability Engineer

Senior Service Reliability Engineer position at Jobgether, offering remote work across Asia, focusing on system stability and technical problem-solving with competitive benefits and equity.

Description For Senior Site Reliability Engineer - Cloud

NVIDIA is seeking a Senior Site Reliability Engineer to join their Cloud team. This role is at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems. SRE at NVIDIA is a critical discipline that ensures both internal and external GPU cloud services maintain maximum reliability and uptime.

The position demands expertise across systems, networking, coding, database management, and cloud technologies. You'll be working with cutting-edge tools and technologies, including Kubernetes and OpenStack, to support NVIDIA's GPU cloud services. The role emphasizes automation, performance tuning, and system optimization to eliminate manual work and improve efficiency.

NVIDIA's SRE culture values diversity, intellectual curiosity, and problem-solving in a blame-free environment. The team promotes collaboration and risk-taking while providing support and mentorship for professional growth. You'll be part of a team that handles critical infrastructure supporting NVIDIA's AI and accelerated computing initiatives.

The role offers competitive compensation with a base salary range of $136,000 - $322,000 USD, plus equity and benefits. You'll have the opportunity to work with state-of-the-art technology while contributing to systems that power some of the most advanced AI and computing solutions in the world. The position offers flexibility with remote work options and the chance to work with a diverse team of talented engineers.

This is an excellent opportunity for experienced SREs who want to make a significant impact in a company that's at the forefront of AI and accelerated computing technology. You'll be challenged with complex problems, have opportunities for continuous learning, and work on systems that operate at massive scale.

Last updated 4 months ago

Responsibilities For Senior Site Reliability Engineer - Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and launch reviews
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation for production systems

Requirements For Senior Site Reliability Engineer - Cloud

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • Experience in Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers
  • Strong problem-solving and communication skills
  • Experience with Kubernetes, OpenStack and Docker

Benefits For Senior Site Reliability Engineer - Cloud

  • Equity

Interested in this job?