NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their Infrastructure team. This role is part of NVIDIA's SRE discipline, which combines software and systems engineering practices to design, build, and maintain large-scale production systems. The position focuses on ensuring maximum reliability and uptime for NVIDIA's internal and external GPU cloud services.
As an SRE at NVIDIA, you'll work with cutting-edge technologies including Kubernetes and OpenStack, focusing on eliminating manual work through automation and performance tuning. The role demands expertise across systems, networking, coding, database management, and continuous delivery. You'll be responsible for maintaining high-efficiency production systems while enabling developers to implement changes safely.
The ideal candidate will have strong experience in infrastructure automation and distributed systems design, with expertise in languages like Python or Go. You'll need deep knowledge of Linux, networking, and containers, along with the ability to design and implement monitoring, logging, and alerting systems at scale.
NVIDIA offers a unique environment that values diversity, intellectual curiosity, and problem-solving in a blame-free setting. The company encourages self-direction while providing support and mentorship for professional growth. This is an excellent opportunity to join one of technology's most desirable employers and work on meaningful projects that impact the future of AI and accelerated computing.