Senior Production Engineer - Storage

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $356,500
DevOps
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For Senior Production Engineer - Storage

NVIDIA is seeking a Senior Production Engineer for their Storage team to join their Site Reliability Engineering (SRE) organization. This role combines software engineering practices with systems operations to build and maintain large-scale production systems. The position focuses on ensuring reliable storage solutions and managing data efficiently for NVIDIA's GPU cloud services.

The role requires expertise in various domains including systems, networking, storage, coding, and database management. You'll work with cutting-edge technologies including Kubernetes, containers, and virtualization while ensuring high reliability and uptime for both internal and external facing services.

As a Senior Production Engineer, you'll be responsible for designing and implementing large-scale storage clusters, working with AI/ML workloads, and improving service lifecycles. The role involves hands-on work with monitoring systems, automation, and performance optimization. You'll be part of a diverse team that values intellectual curiosity and problem-solving in a blame-free environment.

Key responsibilities include supporting services before they go live through system design consulting, developing software frameworks, and managing capacity. You'll maintain services by monitoring availability and system health, often leveraging machine learning models. The role requires participation in an on-call rotation and practicing sustainable incident response.

The ideal candidate will have strong experience with Linux systems, infrastructure configuration management tools, and observability solutions. You'll need to demonstrate excellent debugging skills and thrive in collaborative environments. This position offers competitive compensation, including equity, and the opportunity to work with some of the most forward-thinking professionals in technology.

NVIDIA's culture promotes self-direction and provides support for learning and growth. You'll be part of an organization that brings together people with diverse backgrounds and perspectives, encouraging collaboration and innovation. This role offers the chance to work on meaningful projects while contributing to the reliability and efficiency of NVIDIA's critical storage infrastructure.

Last updated 3 days ago

Responsibilities For Senior Production Engineer - Storage

  • Design, implement, and support large-scale storage clusters
  • Work with AI/ML workloads to analyze behavior in large clusters
  • Improve service lifecycle from design through deployment and refinement
  • Support services through system design consulting and capacity management
  • Monitor and maintain service availability, latency, and system health
  • Implement automation and machine learning models for system scaling
  • Participate in on-call rotation for production systems
  • Practice sustainable incident response and blameless postmortems

Requirements For Senior Production Engineer - Storage

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years practical experience
  • Experience with algorithms, data structures, and large-scale Linux systems
  • Experience in C/C++, Java, Python, Go, Perl or Ruby
  • Knowledge of infrastructure tools like Ansible, Chef, Puppet, and Terraform
  • Experience with observability tools like InfluxDB, Prometheus, and Elastic stack

Benefits For Senior Production Engineer - Storage

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior Production Engineer - Storage

Senior DevOps and Automation Engineer, Fabric Networking - GPU

Senior DevOps role at NVIDIA focusing on GPU cluster automation and management, offering competitive compensation and remote work options.

Senior Automation Engineer - Networking

Senior Automation Engineer role at NVIDIA focusing on network automation and infrastructure management for GPU Cloud and SuperPod deployments.

Senior DevOps Engineer

Senior DevOps Engineer role at NVIDIA focusing on infrastructure development and CI/CD implementation for DPU and Network Adapters platforms.

Senior Software Engineer - Build and Deployment Tools

Senior Software Engineer position at NVIDIA focusing on build and deployment tools development, requiring 5+ years of experience in software development and DevOps.

Senior Build and Release Methodology Engineer

Senior Build and Release Methodology Engineer role at NVIDIA, focusing on developing scalable infrastructure for SOC development and IP release processes.