Senior HPC DevOps Engineer

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

Breckenridge, CO 80424, USA • Estes Park, CO 80517, USA • Reno, NV, USA…

DevOps

Senior Software Engineer

Remote

5+ years of experience

AI · Enterprise SaaS

Description For Senior HPC DevOps Engineer

NVIDIA is seeking a Senior HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role is at the intersection of artificial intelligence and GPU computing, where you'll drive breakthrough innovations in at-scale system design. You'll work with cutting-edge Accelerated computing and Deep Learning platforms, collaborating with scientific researchers, developers, and customers to enhance workflows and develop innovative solutions.

The position involves designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, and developing automated CI/CD pipelines. You'll be responsible for creating automation scripts, deploying monitoring solutions, and performing complex troubleshooting from bare metal to application level. As a technical leader, you'll share best practices and drive innovation through R&D activities.

The ideal candidate brings 5+ years of experience with a strong background in HPC and AI technologies, including expertise in CPUs, GPUs, and high-speed interconnects. You should be proficient in programming, familiar with tools like Jenkins and Ansible, and have deep knowledge of both Windows and Linux environments. Experience with job scheduling, storage solutions, and cloud platforms is essential.

NVIDIA offers a competitive package and a diverse, inclusive work environment. You'll be part of a company that's revolutionizing industries through AI and High-Performance Computing, working with the latest technologies and brilliant minds in the field. This role provides an opportunity to shape the future of computing while working on some of the most challenging technical problems in the industry.

Last updated 3 months ago

Responsibilities For Senior HPC DevOps Engineer

Design, implement, and maintain large-scale HPC/AI clusters with monitoring, logging, and alerting systems
Utilize and develop tools to manage infrastructure as code
Develop and maintain CI/CD pipelines
Develop automation scripts and tools
Deploy advanced monitoring solutions
Perform comprehensive troubleshooting
Serve as a technical resource and share best practices
Support R&D activities and engage in proof of concepts

Requirements For Senior HPC DevOps Engineer

Kubernetes

Linux

B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
Deep knowledge of HPC and AI solution technologies
Advanced proficiency in programming and scripting languages
Familiarity with Jenkins, Ansible, Puppet/Chef
Excellent knowledge of Windows and Linux
Deep understanding of networking protocols
Experience with job scheduling workloads and orchestration tools
Experience with multiple storage solutions
Expertise with virtual systems
Familiarity with cloud platforms

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

Breckenridge, CO 80424, USA • Estes Park, CO 80517, USA • Reno, NV, USA…

DevOps

Senior Software Engineer

Remote

5+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Senior HPC DevOps Engineer

Senior Tools Development Engineer

NVIDIA

Senior Tools Development Engineer role at NVIDIA focusing on building Python-based automated testing solutions and tools for simulation software, GPU drivers, and AI applications.

Product Validation Tools Software Engineer

NVIDIA

Senior Software Engineer role at NVIDIA focusing on product validation tools development and silicon validation infrastructure.

Senior Production Engineer - Storage

NVIDIA

Senior Production Engineer role at NVIDIA focusing on storage platform reliability and scalability using DevOps practices and AI/ML technologies.

Senior Automation Engineer - Networking

NVIDIA

Senior Automation Engineer role at NVIDIA focusing on network automation and infrastructure management for GPU Cloud and SuperPod deployments.

Senior DevOps Engineer

NVIDIA

Senior DevOps Engineer role at NVIDIA focusing on infrastructure development and CI/CD implementation for DPU and Network Adapters platforms.