Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

San Francisco, CA, USA

$144,000 - $270,250

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. SRE at NVIDIA is a specialized discipline combining software and systems engineering practices to design, build, and maintain large-scale production systems. The role focuses on ensuring maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.

The position requires expertise in infrastructure automation, distributed systems, and observability platforms. You'll work with cutting-edge technologies including Kubernetes, OpenStack, and various observability tools like Grafana and Prometheus. The role involves designing and implementing large-scale observability solutions, maintaining service reliability, and participating in on-call rotations.

As an SRE at NVIDIA, you'll be part of a diverse, intellectually curious team that values problem-solving and openness. The company promotes self-direction and provides support for learning and growth. You'll contribute to NVIDIA's mission as the world leader in accelerated computing, working on systems that transform industries through AI and digital twins.

The role offers competitive compensation with a base salary range of $144,000 - $270,250 USD, plus equity and benefits. You'll have the opportunity to work with a team that emphasizes continuous improvement, automation, and proactive system optimization. The position combines technical depth with the chance to impact critical infrastructure supporting NVIDIA's innovative technology solutions.

Last updated 2 months ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
Support services before they go live through system design consulting and tools development
Maintain services by measuring and monitoring availability, latency and system health
Scale systems through automation and evolve systems for improved reliability
Practice sustainable incident response and blameless postmortems
Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field involving coding
5+ years of experience with Infrastructure automation and distributed systems design
5+ years experience delivering foundational infrastructure and observability platforms
Experience in Python, Go, Perl or Ruby
In depth knowledge on Linux, Networking and Containers
Experience with Grafana, OpenTelemetry, Prometheus, and similar observability tools

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity

Equity
Benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

San Francisco, CA, USA

$144,000 - $270,250

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior Site Reliability Engineer

NVIDIA

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.