Senior Production SRE Engineer - Storage

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $339,250
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Production SRE Engineer - Storage

NVIDIA, the world leader in accelerated computing, is seeking a Senior Production SRE Engineer for their Storage team. This role is integral to ensuring the reliability and performance of NVIDIA's GPU cloud services. Site Reliability Engineering at NVIDIA combines software and systems engineering practices to design, build, and maintain large-scale production systems with high efficiency and availability.

The position requires expertise in storage, data management, and cloud services, with a focus on maintaining system reliability while enabling continuous improvement. You'll work with cutting-edge AI/ML workloads, manage large-scale storage clusters, and implement sophisticated monitoring solutions. The role offers opportunities to work with state-of-the-art technology while solving complex challenges in system reliability and performance optimization.

As an SRE at NVIDIA, you'll be part of a diverse and collaborative environment that encourages intellectual curiosity and innovation. The company promotes a blame-free culture focused on learning and growth, offering opportunities to work on meaningful projects with significant impact. You'll be involved in everything from system design to production support, using AI/ML to scale systems and improve reliability.

The position offers competitive compensation, including a base salary range of $148,000 - $339,250, plus equity and comprehensive benefits. This is an excellent opportunity for experienced engineers passionate about large-scale systems, storage architecture, and site reliability engineering to join a technology leader that's transforming industries through AI and accelerated computing.

Last updated 10 days ago

Responsibilities For Senior Production SRE Engineer - Storage

  • Design, implement, and support large-scale storage clusters
  • Work with AI/ML workloads to capture and correlate behavior in large clusters
  • Improve service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and capacity management
  • Maintain service availability, latency, and system health monitoring
  • Scale systems through AI/ML and automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation for production systems

Requirements For Senior Production SRE Engineer - Storage

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years practical experience
  • Experience with algorithms, data structures, and large-scale Linux systems
  • Experience in C/C++, Java, Python, Go, Perl or Ruby
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, and Elastic stack

Benefits For Senior Production SRE Engineer - Storage

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Production SRE Engineer - Storage

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge AI technology.