NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on ensuring the reliability and performance of their DGX Cloud platform. This role is crucial in maintaining and optimizing storage infrastructures that support NVIDIA's mission-critical applications and services. The position combines site reliability engineering principles with storage system expertise to build and maintain scalable, fault-tolerant solutions.
The ideal candidate will be responsible for developing reliability strategies, implementing automation, and maintaining high-performance storage systems. They will work with cutting-edge AI/ML workloads and collaborate across teams to ensure seamless integration of large-scale storage solutions. The role requires strong technical skills in storage systems, Linux administration, and modern automation practices.
NVIDIA offers the opportunity to work with state-of-the-art technology in AI and accelerated computing. The company is at the forefront of transforming major industries through AI and digital twins technology. This role provides a unique chance to impact critical infrastructure supporting NVIDIA's innovative platforms while working with a team that values self-direction and provides strong mentorship for professional growth.
The position requires expertise in storage system administration, site reliability engineering, and programming skills in languages like Python and Go. Knowledge of modern observability tools and infrastructure configuration management is essential. Experience with cloud platforms, Kubernetes, and strong problem-solving skills will be valuable assets for success in this role.