NVIDIA is seeking a Site Reliability Engineer (SRE) to ensure the reliability and performance of their DGX Cloud platform. This role focuses on maintaining and optimizing storage infrastructures for NVIDIA's GPU cloud platforms. The position combines traditional SRE responsibilities with storage system expertise, requiring both technical depth and cross-functional collaboration.
As an SRE at NVIDIA, you'll be responsible for designing, implementing, and maintaining scalable storage solutions that support mission-critical applications. The role emphasizes automation, performance optimization, and system reliability, with opportunities to work on cutting-edge AI/ML workloads and large-scale distributed systems.
The ideal candidate will bring strong technical expertise in storage systems, Linux administration, and modern DevOps practices. You'll work with various technologies including Kubernetes, cloud platforms, and monitoring tools while contributing to NVIDIA's mission of advancing GPU-accelerated computing.
This position offers the opportunity to work with some of the most advanced computing systems in the industry, solving complex challenges in system reliability and performance. NVIDIA provides competitive compensation and benefits, fostering an inclusive environment that values diversity and innovation.
The role requires participation in on-call rotations and collaboration with multiple teams, making it ideal for someone who enjoys both technical depth and cross-functional teamwork. You'll be part of a team that values self-direction while providing support and mentorship for professional growth.