NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. SRE at NVIDIA is a specialized discipline combining software and systems engineering practices to design, build, and maintain large-scale production systems. The role focuses on ensuring maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.
The position requires expertise in infrastructure automation, distributed systems, and observability platforms. You'll work with cutting-edge technologies including Kubernetes, OpenStack, and various observability tools like Grafana and Prometheus. The role involves designing and implementing large-scale observability solutions, maintaining service reliability, and participating in on-call rotations.
As an SRE at NVIDIA, you'll be part of a diverse, intellectually curious team that values problem-solving and openness. The company promotes self-direction and provides support for learning and growth. You'll contribute to NVIDIA's mission as the world leader in accelerated computing, working on systems that transform industries through AI and digital twins.
The role offers competitive compensation with a base salary range of $144,000 - $270,250 USD, plus equity and benefits. You'll have the opportunity to work with a team that emphasizes continuous improvement, automation, and proactive system optimization. The position combines technical depth with the chance to impact critical infrastructure supporting NVIDIA's innovative technology solutions.