NVIDIA is seeking a Senior Production Engineer for their Storage team to join their Site Reliability Engineering (SRE) organization. This role combines software engineering practices with systems operations to build and maintain large-scale production systems. The position focuses on ensuring reliable storage solutions and managing data efficiently for NVIDIA's GPU cloud services.
The role requires expertise in various domains including systems, networking, storage, coding, and database management. You'll work with cutting-edge technologies including Kubernetes, containers, and virtualization while ensuring high reliability and uptime for both internal and external facing services.
As a Senior Production Engineer, you'll be responsible for designing and implementing large-scale storage clusters, working with AI/ML workloads, and improving service lifecycles. The role involves hands-on work with monitoring systems, automation, and performance optimization. You'll be part of a diverse team that values intellectual curiosity and problem-solving in a blame-free environment.
Key responsibilities include supporting services before they go live through system design consulting, developing software frameworks, and managing capacity. You'll maintain services by monitoring availability and system health, often leveraging machine learning models. The role requires participation in an on-call rotation and practicing sustainable incident response.
The ideal candidate will have strong experience with Linux systems, infrastructure configuration management tools, and observability solutions. You'll need to demonstrate excellent debugging skills and thrive in collaborative environments. This position offers competitive compensation, including equity, and the opportunity to work with some of the most forward-thinking professionals in technology.
NVIDIA's culture promotes self-direction and provides support for learning and growth. You'll be part of an organization that brings together people with diverse backgrounds and perspectives, encouraging collaboration and innovation. This role offers the chance to work on meaningful projects while contributing to the reliability and efficiency of NVIDIA's critical storage infrastructure.