Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $419,750
Site Reliability
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developers to implement changes efficiently. The position requires expertise in systems, networking, coding, database management, and cloud technologies like Kubernetes and OpenStack.

The role focuses on eliminating manual work through automation, performance tuning, and system optimization. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The position offers the opportunity to work on meaningful projects with support and mentorship for continuous learning and growth.

Key responsibilities include designing and implementing large-scale observability platforms, managing the complete service lifecycle, and maintaining system health through monitoring and automation. The ideal candidate brings 5+ years of experience in infrastructure automation and observability platforms, strong programming skills in languages like Python or Go, and deep knowledge of Linux and containers.

NVIDIA offers a competitive compensation package with a base salary range of $148,000 - $419,750 USD, plus equity and benefits. The company is committed to fostering a diverse work environment and provides equal opportunities to all candidates. This role offers the flexibility of remote work while being part of a team that's transforming industries through AI and digital twins technology.

Last updated 10 days ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through system design consulting and tools development
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and evolve systems for improved reliability
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • 5+ years experience delivering foundational infrastructure and observability platforms
  • Experience in Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge AI technology.