Senior Site Reliability Engineering - Infrastructure

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

London, UK

$150,000 - $250,000

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

Enterprise SaaS · AI

Description For Senior Site Reliability Engineering - Infrastructure

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their Infrastructure team. This role is part of NVIDIA's SRE discipline, which combines software and systems engineering practices to design, build, and maintain large-scale production systems. The position focuses on ensuring maximum reliability and uptime for NVIDIA's internal and external GPU cloud services.

As an SRE at NVIDIA, you'll work with cutting-edge technologies including Kubernetes and OpenStack, focusing on eliminating manual work through automation and performance tuning. The role demands expertise across systems, networking, coding, database management, and continuous delivery. You'll be responsible for maintaining high-efficiency production systems while enabling developers to implement changes safely.

The ideal candidate will have strong experience in infrastructure automation and distributed systems design, with expertise in languages like Python or Go. You'll need deep knowledge of Linux, networking, and containers, along with the ability to design and implement monitoring, logging, and alerting systems at scale.

NVIDIA offers a unique environment that values diversity, intellectual curiosity, and problem-solving in a blame-free setting. The company encourages self-direction while providing support and mentorship for professional growth. This is an excellent opportunity to join one of technology's most desirable employers and work on meaningful projects that impact the future of AI and accelerated computing.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineering - Infrastructure

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
Engage in service lifecycle from inception through deployment and refinement
Support services through system design consulting and launch reviews
Maintain services by monitoring availability, latency and system health
Scale systems through automation
Practice sustainable incident response and blameless postmortems
Participate in on-call rotation for production systems

Requirements For Senior Site Reliability Engineering - Infrastructure

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field
5+ years of experience with Infrastructure automation and distributed systems design
Experience with Python, Go, Perl or Ruby
In-depth knowledge of Linux, Networking and Containers
Experience with Kubernetes, OpenStack and Docker
Strong communication skills and systematic problem-solving approach
Ability to debug and optimize code and automate routine tasks

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

London, UK

$150,000 - $250,000

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

Enterprise SaaS · AI

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineering - Infrastructure

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer

NVIDIA

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - Cloud

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining high-reliability systems for GPU cloud services.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.