Senior Production SRE Engineer - Storage

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $339,250
Site Reliability
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For Senior Production SRE Engineer - Storage

NVIDIA is seeking a Senior Production SRE Engineer focused on Storage systems to join their team. This role combines software engineering practices with systems operations to design, build, and maintain large-scale production systems. The position requires expertise in storage, data management, and cloud services, with a focus on ensuring high reliability and uptime for NVIDIA's GPU cloud services.

The role involves working with cutting-edge AI/ML workloads and managing large-scale storage clusters. You'll be part of a diverse and collaborative team that values intellectual curiosity and problem-solving. The position offers opportunities to work with advanced technologies while maintaining and scaling critical infrastructure.

Key responsibilities include designing and implementing storage solutions, working with AI/ML workloads, and ensuring system reliability through monitoring and automation. You'll collaborate with various teams, participate in on-call rotations, and contribute to system design and improvement initiatives.

The ideal candidate should have strong experience with Linux systems, programming languages like Python or Go, and infrastructure management tools. Knowledge of Kubernetes, containers, and observability tools is highly valued. NVIDIA offers competitive compensation, including a base salary range of $148,000 - $339,250, plus equity and benefits.

This role is perfect for someone who enjoys tackling complex technical challenges, has a strong SRE mindset, and wants to work at the intersection of storage systems and AI technology. Join NVIDIA to be part of a team that's driving innovation in accelerated computing and transforming major industries through AI and digital twins.

Last updated 13 minutes ago

Responsibilities For Senior Production SRE Engineer - Storage

  • Design, implement, and support large-scale storage clusters, including monitoring, logging, and alerting
  • Work with AI/ML workloads to capture and correlate behavior in large clusters
  • Improve service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and capacity management
  • Maintain service availability, latency, and system health
  • Scale systems through AI/ML and automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation

Requirements For Senior Production SRE Engineer - Storage

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years practical experience
  • Experience with algorithms, data structures, and large-scale Linux systems
  • Experience in C/C++, Java, Python, Go, Perl or Ruby
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, and Elastic stack

Benefits For Senior Production SRE Engineer - Storage

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Production SRE Engineer - Storage

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Truecaller, focusing on infrastructure management and system reliability for a global communication platform.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with emphasis on reliability and scalability.

Senior Software Engineer, ATS Matrix Site Reliability Engineer

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems with emphasis on reliability, automation, and technical leadership.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior Site Reliability Engineer role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and benefits.

Senior Systems Engineer, Site Reliability Engineering

Senior Systems Engineer position at Google focusing on Site Reliability Engineering, building and maintaining large-scale distributed systems.