Senior Manager - Storage Production Engineering and SRE

Technology company that invented the GPU, revolutionized computer graphics, and leads in AI computing.
$272,000 - $419,750
Site Reliability
Staff Software Engineer
In-Person
10+ years of experience
AI · Enterprise SaaS

Description For Senior Manager - Storage Production Engineering and SRE

NVIDIA, a pioneer in GPU technology and AI computing, is seeking a Senior Manager for their Storage Production Engineering and SRE team. This role combines technical leadership with people management, focusing on designing and maintaining large-scale production systems with an emphasis on storage solutions. The position requires expertise in cloud-scale storage, data management, and modern technologies like Kubernetes and containerization.

The role involves leading a team dedicated to ensuring the reliability and performance of NVIDIA's GPU cloud services, both internal and external. You'll be responsible for implementing storage solutions, managing incidents, conducting capacity planning, and driving automation initiatives. The position requires a blend of technical expertise in storage systems and leadership capabilities to guide a team of SRE professionals.

As a senior leader, you'll work closely with cross-functional teams to optimize storage systems, implement best practices, and ensure seamless integration with other technology stacks. The role demands a deep understanding of cloud storage solutions, including file, block, and object storage, and experience with platforms like AWS S3 and Azure Blob Storage.

NVIDIA offers a competitive compensation package, including a base salary range of $272,000 - $419,750, plus equity benefits. The company is known for its innovative culture and commitment to pushing technological boundaries in AI and deep learning. This role presents an opportunity to work with cutting-edge technology while leading a team that directly impacts the company's cloud infrastructure reliability and performance.

The ideal candidate will bring 10+ years of relevant experience, including 5+ years in management, along with a strong technical background in storage systems and SRE principles. This position offers the chance to work at one of technology's most desirable employers, contributing to groundbreaking developments in AI and computing technology.

Last updated a day ago

Responsibilities For Senior Manager - Storage Production Engineering and SRE

  • Lead and mentor a team of Storage SRE professionals
  • Formulate and execute strategic initiatives for storage systems
  • Supervise planning and enhancement of storage solutions
  • Oversee incident response and resolution for storage-related issues
  • Conduct capacity planning and storage demand forecasting
  • Drive automation initiatives for storage operations
  • Implement continuous improvement processes
  • Collaborate with multi-functional teams for system optimization

Requirements For Senior Manager - Storage Production Engineering and SRE

Kubernetes
  • 10+ years overall experience and 5+ years of management experience
  • Master's degree in Computer Science, IT, or related field or equivalent experience
  • In-depth knowledge of storage technologies and cloud-based storage solutions
  • Strong leadership and people management skills
  • Exceptional analytical and problem-solving skills
  • Prior engineering experience with hands-on coding background in storage systems
  • Proficiency in scripting and automation tools

Benefits For Senior Manager - Storage Production Engineering and SRE

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Manager - Storage Production Engineering and SRE

Senior SRE Engineering Leader – AI Research Clusters

Lead globally distributed GPU clusters for AI research at NVIDIA, driving innovation in accelerated computing and AI services.

Software Engineering Manager II, Site Reliability Engineering, Google Cloud

Lead Site Reliability Engineering team at Google Cloud, managing distributed systems and service reliability while driving technical excellence and team development.

Software Engineering Manager II, Site Reliability Engineering

Lead Google's Site Reliability Engineering team as a Software Engineering Manager II, overseeing system reliability, performance, and team development.

Engineering Manager II, AdsML SRE

Lead Google's AdsML SRE team in Dublin, managing distributed systems and engineering teams while ensuring service reliability and optimization.

Software Engineering Manager, Site Reliability Engineering, FM Store

Lead Site Reliability Engineering team at Google, managing distributed systems and ensuring service reliability while driving technical excellence and team growth.