Manager, Site Reliability Engineer - GeForce Now Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.

Santa Clara, CA, USA

$220,000 - $419,750

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Description For Manager, Site Reliability Engineer - GeForce Now Cloud

NVIDIA is seeking a Manager for Site Reliability Engineering to build and lead its cloud service team for supporting, triaging, and building generative AI-powered visual applications. The role involves developing a team of SREs, nurturing a culture of collaboration and innovation, and being responsible for supporting groundbreaking Generative AI inferencing workloads in a globally distributed environment. Key responsibilities include collaborating with service owners and various teams, participating in on-call rotations, communicating service KPIs, and ensuring the implementation of security best practices.

The ideal candidate should have:

MS or PhD in engineering or computer science-related field or equivalent experience
8+ years of experience in operating and owning end-to-end availability of critical services
6+ years of technical leadership experience
Solid understanding of cloud technologies, containerization, and microservices architecture
Experience with SLO/SLIs, error budgeting, and KPIs

Preferred skills include:

Experience with AI model deployments
Excellent coding skills in Python or Go
Understanding of Deep Learning / Machine Learning / AI
Experience with Cuda, PyTorch, TensorRT, TensorFlow, and/or Triton

NVIDIA offers competitive salaries, a generous benefits package, and is known for being one of the most desirable employers in the technology industry. They are at the forefront of Deep Learning, Artificial Intelligence, and Autonomous Vehicles, making it an exciting opportunity for creative engineers who enjoy autonomy and are passionate about technology.

Last updated 5 days ago

Responsibilities For Manager, Site Reliability Engineer - GeForce Now Cloud

Develop and lead a team of SREs
Support and work on Generative AI inferencing workloads
Collaborate with service owners, architecture, research, and tools teams
Participate in on-call rotation
Communicate and report service KPIs, priorities, and issues to leadership
Ensure implementation of security best practices

Requirements For Manager, Site Reliability Engineer - GeForce Now Cloud

Kubernetes

Python

MS or PhD in engineering or computer science-related field or equivalent experience
8+ years of experience operating & owning end-to-end availability of critical services
6+ years of technical leadership experience
Experience with cloud technologies (AWS/AZURE/GCP/OCI)
Solid understanding of containerization and microservices architecture, K8s
Experience with SLO/SLIs, error budgeting, and KPIs

Benefits For Manager, Site Reliability Engineer - GeForce Now Cloud

Equity

Medical Insurance

Equity
Medical Insurance

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.

Santa Clara, CA, USA

$220,000 - $419,750

DevOps

Staff Software Engineer

Hybrid

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Manager, Site Reliability Engineer - GeForce Now Cloud

Staff/Sr. Staff DevSecOps Engineer

SciTec

Staff/Sr. Staff DevSecOps Engineer role at SciTec, supporting national security with advanced tech.

AVP, DevOps Engineer

Global Atlantic Financial Group

Manager - Support Solutions Engineering

Netflix

Lead Netflix's Engineering Support Organization as Manager, driving excellence in customer service and technical solutions for global engineering teams.

IT DevOps Engineer (High Seniority)

Benchling

Benchling seeks a seasoned IT DevOps Engineer to optimize infrastructure, manage endpoints, and implement IAM solutions.

Senior DevOps Engineer

Loop

Senior DevOps Engineer at Loop to architect resilient platforms and create developer-friendly tools using AWS, Kubernetes, and Terraform.