SRE Software Engineer, Storage and Data

NVIDIA

NVIDIA is a technology company developing GPU-based solutions for gaming, AI, and cloud computing platforms.

Taipei City, Taiwan

Site Reliability

Senior Software Engineer

In-Person

5+ years of experience

AI · Enterprise SaaS

Description For SRE Software Engineer, Storage and Data

NVIDIA is seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their DGX Cloud platform. This role focuses on maintaining and optimizing storage infrastructures for NVIDIA's GPU cloud platforms. The position combines traditional SRE responsibilities with storage system expertise, requiring both technical depth and cross-functional collaboration.

As an SRE at NVIDIA, you'll be responsible for designing, implementing, and maintaining scalable storage solutions that support mission-critical applications. The role emphasizes automation, performance optimization, and system reliability, with opportunities to work on cutting-edge AI/ML workloads and large-scale distributed systems.

The ideal candidate will bring strong technical expertise in storage systems, Linux administration, and modern DevOps practices. You'll work with various technologies including Kubernetes, cloud platforms, and monitoring tools while contributing to NVIDIA's mission of advancing GPU-accelerated computing.

This position offers the opportunity to work with some of the most advanced computing systems in the industry, solving complex challenges in system reliability and performance. NVIDIA provides competitive compensation and benefits, fostering an inclusive environment that values diversity and innovation.

The role requires participation in on-call rotations and collaboration with multiple teams, making it ideal for someone who enjoys both technical depth and cross-functional teamwork. You'll be part of a team that values self-direction while providing support and mentorship for professional growth.

Last updated 4 months ago

Responsibilities For SRE Software Engineer, Storage and Data

Develop strategies for storage systems reliability and availability, including redundancy and disaster recovery
Analyze and optimize storage systems performance
Develop and maintain automation scripts for storage provisioning
Implement monitoring and alerting systems
Participate in on-call rotation
Collaborate with cross-functional teams
Work with AI/ML workloads in large clusters

Requirements For SRE Software Engineer, Storage and Data

Python

Java

Kubernetes

Linux

BS degree in Computer Science or related technical field
5+ years of practical experience
Experience with Git, RESTful API, Linux service operation
Experience with AWS S3 and large-scale Linux systems
Knowledge of infrastructure configuration management tools
Experience with observability tools like InfluxDB, Prometheus, Grafana
Experience in programming languages: Ansible, Bash, Python, Go, YAML, Java

Benefits For SRE Software Engineer, Storage and Data

Medical Insurance

Dental Insurance

Vision Insurance

Competitive salaries
Generous benefits package

NVIDIA

NVIDIA is a technology company developing GPU-based solutions for gaming, AI, and cloud computing platforms.

Taipei City, Taiwan

Site Reliability

Senior Software Engineer

In-Person

5+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA SRE Software Engineer, Storage and Data

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.