Senior SRE Software Engineer, Storage and Data

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
Site Reliability
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For Senior SRE Software Engineer, Storage and Data

NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on ensuring the reliability and performance of their DGX Cloud platform. This role is crucial in maintaining and optimizing storage infrastructures that support NVIDIA's mission-critical applications and services. The position combines site reliability engineering principles with storage system expertise to build and maintain scalable, fault-tolerant solutions.

The ideal candidate will be responsible for developing reliability strategies, implementing automation, and maintaining high-performance storage systems. They will work with cutting-edge AI/ML workloads and collaborate across teams to ensure seamless integration of large-scale storage solutions. The role requires strong technical skills in storage systems, Linux administration, and modern automation practices.

NVIDIA offers the opportunity to work with state-of-the-art technology in AI and accelerated computing. The company is at the forefront of transforming major industries through AI and digital twins technology. This role provides a unique chance to impact critical infrastructure supporting NVIDIA's innovative platforms while working with a team that values self-direction and provides strong mentorship for professional growth.

The position requires expertise in storage system administration, site reliability engineering, and programming skills in languages like Python and Go. Knowledge of modern observability tools and infrastructure configuration management is essential. Experience with cloud platforms, Kubernetes, and strong problem-solving skills will be valuable assets for success in this role.

Last updated 4 months ago

Responsibilities For Senior SRE Software Engineer, Storage and Data

  • Develop strategies to ensure reliability and availability of storage systems
  • Analyze and fine-tune storage systems for optimal performance
  • Develop and maintain automation scripts and tools
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For Senior SRE Software Engineer, Storage and Data

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years equivalent practical experience
  • Experience with Git, RESTful API, Linux service operation
  • Experience with Ansible, Bash, Python, Go, YAML, Java
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, Elastic stack, Grafana

Interested in this job?

Jobs Related To NVIDIA Senior SRE Software Engineer, Storage and Data

Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.

Senior Site Reliability Engineer - DGX Cloud

Senior SRE position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.