NVIDIA, the world leader in accelerated computing, is seeking a Senior Production SRE Engineer for their Storage team. This role is integral to ensuring the reliability and performance of NVIDIA's GPU cloud services. Site Reliability Engineering at NVIDIA combines software and systems engineering practices to design, build, and maintain large-scale production systems with high efficiency and availability.
The position requires expertise in storage, data management, and cloud services, with a focus on maintaining system reliability while enabling continuous improvement. You'll work with cutting-edge AI/ML workloads, manage large-scale storage clusters, and implement sophisticated monitoring solutions. The role offers opportunities to work with state-of-the-art technology while solving complex challenges in system reliability and performance optimization.
As an SRE at NVIDIA, you'll be part of a diverse and collaborative environment that encourages intellectual curiosity and innovation. The company promotes a blame-free culture focused on learning and growth, offering opportunities to work on meaningful projects with significant impact. You'll be involved in everything from system design to production support, using AI/ML to scale systems and improve reliability.
The position offers competitive compensation, including a base salary range of $148,000 - $339,250, plus equity and comprehensive benefits. This is an excellent opportunity for experienced engineers passionate about large-scale systems, storage architecture, and site reliability engineering to join a technology leader that's transforming industries through AI and accelerated computing.