NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developers to implement changes efficiently. The position requires expertise in systems, networking, coding, database management, and cloud technologies like Kubernetes and OpenStack.
The role focuses on eliminating manual work through automation, performance tuning, and system optimization. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The position offers the opportunity to work on meaningful projects with support and mentorship for continuous learning and growth.
Key responsibilities include designing and implementing large-scale observability platforms, managing the complete service lifecycle, and maintaining system health through monitoring and automation. The ideal candidate brings 5+ years of experience in infrastructure automation and observability platforms, strong programming skills in languages like Python or Go, and deep knowledge of Linux and containers.
NVIDIA offers a competitive compensation package with a base salary range of $148,000 - $419,750 USD, plus equity and benefits. The company is committed to fostering a diverse work environment and provides equal opportunities to all candidates. This role offers the flexibility of remote work while being part of a team that's transforming industries through AI and digital twins technology.