SRE Software Engineer, Storage and Data

NVIDIA is a technology company developing GPU-based solutions for gaming, AI, and cloud computing platforms.
Site Reliability
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For SRE Software Engineer, Storage and Data

NVIDIA is seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their DGX Cloud platform. This role focuses on maintaining and optimizing storage infrastructures for NVIDIA's GPU cloud platforms. The position combines traditional SRE responsibilities with storage system expertise, requiring both technical depth and cross-functional collaboration.

As an SRE at NVIDIA, you'll be responsible for designing, implementing, and maintaining scalable storage solutions that support mission-critical applications. The role emphasizes automation, performance optimization, and system reliability, with opportunities to work on cutting-edge AI/ML workloads and large-scale distributed systems.

The ideal candidate will bring strong technical expertise in storage systems, Linux administration, and modern DevOps practices. You'll work with various technologies including Kubernetes, cloud platforms, and monitoring tools while contributing to NVIDIA's mission of advancing GPU-accelerated computing.

This position offers the opportunity to work with some of the most advanced computing systems in the industry, solving complex challenges in system reliability and performance. NVIDIA provides competitive compensation and benefits, fostering an inclusive environment that values diversity and innovation.

The role requires participation in on-call rotations and collaboration with multiple teams, making it ideal for someone who enjoys both technical depth and cross-functional teamwork. You'll be part of a team that values self-direction while providing support and mentorship for professional growth.

Last updated 8 days ago

Responsibilities For SRE Software Engineer, Storage and Data

  • Develop strategies for storage systems reliability and availability, including redundancy and disaster recovery
  • Analyze and optimize storage systems performance
  • Develop and maintain automation scripts for storage provisioning
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For SRE Software Engineer, Storage and Data

Python
Go
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of practical experience
  • Experience with Git, RESTful API, Linux service operation
  • Experience with AWS S3 and large-scale Linux systems
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, Grafana
  • Experience in programming languages: Ansible, Bash, Python, Go, YAML, Java

Benefits For SRE Software Engineer, Storage and Data

Medical Insurance
Dental Insurance
Vision Insurance
  • Competitive salaries
  • Generous benefits package

Interested in this job?

Jobs Related To NVIDIA SRE Software Engineer, Storage and Data

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge AI technology.