Senior Site Reliability Engineer - Cloud

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$132,000 - $310,500
Site Reliability
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer - Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Cloud team. This role is at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems. The SRE team at NVIDIA ensures maximum reliability and uptime for both internal and external GPU cloud services.

The position offers an opportunity to work with cutting-edge technology in a culture that values diversity, intellectual curiosity, and problem-solving. You'll be responsible for managing large-scale Kubernetes clusters, implementing monitoring solutions, and ensuring system reliability through automation and proactive maintenance.

As an SRE at NVIDIA, you'll be part of a team that encourages collaboration, big thinking, and risk-taking in a blame-free environment. The role combines hands-on technical work with strategic system design, offering a perfect balance for those interested in both infrastructure and software development. You'll work with various tools and technologies, including Python, Go, Linux, and Kubernetes, while contributing to systems that power NVIDIA's AI and cloud initiatives.

The position offers a competitive salary range of $132,000 to $310,500, along with equity and comprehensive benefits. This is an excellent opportunity for experienced engineers who want to impact the future of cloud computing and AI infrastructure while working for a technology leader.

Last updated 2 days ago

Responsibilities For Senior Site Reliability Engineer - Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and launch reviews
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation and improve reliability
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Cloud

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field involving coding
  • 5+ years of experience with Infrastructure automation, distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers
  • Systematic problem-solving approach with strong communication skills
  • Experience in using or running large private and public cloud systems

Benefits For Senior Site Reliability Engineer - Cloud

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Cloud

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge AI technology.