Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $419,750
Site Reliability
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developers to implement changes efficiently. The position requires expertise in systems, networking, coding, database management, and cloud technologies like Kubernetes and OpenStack.

The role emphasizes automation, performance tuning, and system optimization, with a focus on eliminating manual work. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The position offers opportunities to work on meaningful projects with support and mentorship for continuous learning and growth.

Key responsibilities include designing and implementing large-scale observability platforms, managing service lifecycles, and maintaining system health through monitoring and automation. The ideal candidate will have 5+ years of experience in infrastructure automation and observability platforms, strong programming skills in languages like Python or Go, and deep knowledge of Linux and containerization.

NVIDIA offers a competitive compensation package with a base salary range of $148,000 - $419,750 USD, plus equity benefits. The company promotes a blame-free environment that encourages collaboration, innovation, and risk-taking. This is an excellent opportunity for experienced engineers passionate about reliability, scalability, and system optimization to join a leading technology company transforming industries through AI and digital twins.

Last updated 6 hours ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through system design consulting and tools development
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and evolve systems for improved reliability
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • 5+ years experience delivering foundational infrastructure and observability platforms
  • Experience in Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience and expertise in large-scale system reliability and automation.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer for AI Research Clusters at NVIDIA, designing and implementing GPU compute clusters for AI research.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA is seeking a Senior Site Reliability Engineer to lead the design, deployment, and management of large-scale GPU clusters for AI workloads.

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.

Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer role at NVIDIA, focusing on DGX Cloud infrastructure and large-scale Kubernetes clusters.