Manager, Site Reliability Engineer - GeForce Now Cloud

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.
Santa Clara, CA, USA
$220,000 - $419,750
DevOps
Staff Software Engineer
Hybrid
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS

Description For Manager, Site Reliability Engineer - GeForce Now Cloud

NVIDIA is seeking a Manager for Site Reliability Engineering to build and lead its cloud service team for supporting, triaging, and building generative AI-powered visual applications. The role involves developing a team of SREs, nurturing a culture of collaboration and innovation, and being responsible for supporting groundbreaking Generative AI inferencing workloads in a globally distributed environment. Key responsibilities include collaborating with service owners and various teams, participating in on-call rotations, communicating service KPIs, and ensuring the implementation of security best practices.

The ideal candidate should have:

  • MS or PhD in engineering or computer science-related field or equivalent experience
  • 8+ years of experience in operating and owning end-to-end availability of critical services
  • 6+ years of technical leadership experience
  • Solid understanding of cloud technologies, containerization, and microservices architecture
  • Experience with SLO/SLIs, error budgeting, and KPIs

Preferred skills include:

  • Experience with AI model deployments
  • Excellent coding skills in Python or Go
  • Understanding of Deep Learning / Machine Learning / AI
  • Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton

NVIDIA offers competitive salaries, a generous benefits package, and is known for being one of the most desirable employers in the technology industry. They are at the forefront of Deep Learning, Artificial Intelligence, and Autonomous Vehicles, making it an exciting opportunity for creative engineers who enjoy autonomy and are passionate about technology.

Last updated 5 days ago

Responsibilities For Manager, Site Reliability Engineer - GeForce Now Cloud

  • Develop and lead a team of SREs
  • Support and work on Generative AI inferencing workloads
  • Collaborate with service owners, architecture, research, and tools teams
  • Participate in on-call rotation
  • Communicate and report service KPIs, priorities, and issues to leadership
  • Ensure implementation of security best practices

Requirements For Manager, Site Reliability Engineer - GeForce Now Cloud

Kubernetes
Python
Go
  • MS or PhD in engineering or computer science-related field or equivalent experience
  • 8+ years of experience operating & owning end-to-end availability of critical services
  • 6+ years of technical leadership experience
  • Experience with cloud technologies (AWS/AZURE/GCP/OCI)
  • Solid understanding of containerization and microservices architecture, K8s
  • Experience with SLO/SLIs, error budgeting, and KPIs

Benefits For Manager, Site Reliability Engineer - GeForce Now Cloud

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Interested in this job?

Jobs Related To NVIDIA Manager, Site Reliability Engineer - GeForce Now Cloud

Staff/Sr. Staff DevSecOps Engineer

Staff/Sr. Staff DevSecOps Engineer role at SciTec, supporting national security with advanced tech.

Manager - Support Solutions Engineering

Lead Netflix's Engineering Support Organization as Manager, driving excellence in customer service and technical solutions for global engineering teams.

IT DevOps Engineer (High Seniority)

Benchling seeks a seasoned IT DevOps Engineer to optimize infrastructure, manage endpoints, and implement IAM solutions.

Senior DevOps Engineer

Senior DevOps Engineer at Loop to architect resilient platforms and create developer-friendly tools using AWS, Kubernetes, and Terraform.