Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$144,000 - $270,250
Site Reliability
Senior Software Engineer
Hybrid
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. SRE at NVIDIA is a specialized discipline combining software and systems engineering practices to design, build, and maintain large-scale production systems. The role focuses on ensuring maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.

The position requires expertise in infrastructure automation, distributed systems, and observability platforms. You'll work with cutting-edge technologies including Kubernetes, OpenStack, and various observability tools like Grafana and Prometheus. The role involves designing and implementing large-scale observability solutions, maintaining service reliability, and participating in on-call rotations.

As an SRE at NVIDIA, you'll be part of a diverse, intellectually curious team that values problem-solving and openness. The company promotes self-direction and provides support for learning and growth. You'll contribute to NVIDIA's mission as the world leader in accelerated computing, working on systems that transform industries through AI and digital twins.

The role offers competitive compensation with a base salary range of $144,000 - $270,250 USD, plus equity and benefits. You'll have the opportunity to work with a team that emphasizes continuous improvement, automation, and proactive system optimization. The position combines technical depth with the chance to impact critical infrastructure supporting NVIDIA's innovative technology solutions.

Last updated 18 minutes ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through system design consulting and tools development
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and evolve systems for improved reliability
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field involving coding
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • 5+ years experience delivering foundational infrastructure and observability platforms
  • Experience in Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers
  • Experience with Grafana, OpenTelemetry, Prometheus, and similar observability tools

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining high-reliability systems for GPU cloud services.

Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.