Senior SRE Software Engineer, Storage and Data

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
Site Reliability
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For Senior SRE Software Engineer, Storage and Data

NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on ensuring the reliability and performance of their DGX Cloud platform. This role is crucial in maintaining and optimizing storage infrastructures that support NVIDIA's mission-critical applications and services. The position combines site reliability engineering principles with storage system expertise to build and maintain scalable, fault-tolerant solutions.

The ideal candidate will be responsible for developing reliability strategies, implementing automation, and maintaining high-performance storage systems. They will work with cutting-edge AI/ML workloads and collaborate across teams to ensure seamless integration of large-scale storage solutions. The role requires strong technical skills in storage systems, Linux administration, and modern automation practices.

NVIDIA offers the opportunity to work with state-of-the-art technology in AI and accelerated computing. The company is at the forefront of transforming major industries through AI and digital twins technology. This role provides a unique chance to impact critical infrastructure supporting NVIDIA's innovative platforms while working with a team that values self-direction and provides strong mentorship for professional growth.

The position requires expertise in storage system administration, site reliability engineering, and programming skills in languages like Python and Go. Knowledge of modern observability tools and infrastructure configuration management is essential. Experience with cloud platforms, Kubernetes, and strong problem-solving skills will be valuable assets for success in this role.

Last updated 8 days ago

Responsibilities For Senior SRE Software Engineer, Storage and Data

  • Develop strategies to ensure reliability and availability of storage systems
  • Analyze and fine-tune storage systems for optimal performance
  • Develop and maintain automation scripts and tools
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For Senior SRE Software Engineer, Storage and Data

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years equivalent practical experience
  • Experience with Git, RESTful API, Linux service operation
  • Experience with Ansible, Bash, Python, Go, YAML, Java
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, Elastic stack, Grafana

Interested in this job?

Jobs Related To NVIDIA Senior SRE Software Engineer, Storage and Data

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Production SRE Engineer - Storage

Senior Production SRE Engineer position at NVIDIA focusing on storage systems, requiring 5+ years experience in large-scale system reliability and storage architecture.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary, equity, and remote work options.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge AI technology.