Senior HPC DevOps Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
Breckenridge, CO 80424, USAEstes Park, CO 80517, USAReno, NV, USA
DevOps
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior HPC DevOps Engineer

NVIDIA is seeking a Senior HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role is at the intersection of artificial intelligence and GPU computing, where you'll drive breakthrough innovations in at-scale system design. You'll work with cutting-edge Accelerated computing and Deep Learning platforms, collaborating with scientific researchers, developers, and customers to enhance workflows and develop innovative solutions.

The position involves designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, and developing automated CI/CD pipelines. You'll be responsible for creating automation scripts, deploying monitoring solutions, and performing complex troubleshooting from bare metal to application level. As a technical leader, you'll share best practices and drive innovation through R&D activities.

The ideal candidate brings 5+ years of experience with a strong background in HPC and AI technologies, including expertise in CPUs, GPUs, and high-speed interconnects. You should be proficient in programming, familiar with tools like Jenkins and Ansible, and have deep knowledge of both Windows and Linux environments. Experience with job scheduling, storage solutions, and cloud platforms is essential.

NVIDIA offers a competitive package and a diverse, inclusive work environment. You'll be part of a company that's revolutionizing industries through AI and High-Performance Computing, working with the latest technologies and brilliant minds in the field. This role provides an opportunity to shape the future of computing while working on some of the most challenging technical problems in the industry.

Last updated 3 months ago

Responsibilities For Senior HPC DevOps Engineer

  • Design, implement, and maintain large-scale HPC/AI clusters with monitoring, logging, and alerting systems
  • Utilize and develop tools to manage infrastructure as code
  • Develop and maintain CI/CD pipelines
  • Develop automation scripts and tools
  • Deploy advanced monitoring solutions
  • Perform comprehensive troubleshooting
  • Serve as a technical resource and share best practices
  • Support R&D activities and engage in proof of concepts

Requirements For Senior HPC DevOps Engineer

Kubernetes
Linux
  • B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
  • Deep knowledge of HPC and AI solution technologies
  • Advanced proficiency in programming and scripting languages
  • Familiarity with Jenkins, Ansible, Puppet/Chef
  • Excellent knowledge of Windows and Linux
  • Deep understanding of networking protocols
  • Experience with job scheduling workloads and orchestration tools
  • Experience with multiple storage solutions
  • Expertise with virtual systems
  • Familiarity with cloud platforms

Interested in this job?

Jobs Related To NVIDIA Senior HPC DevOps Engineer

Senior Tools Development Engineer

Senior Tools Development Engineer role at NVIDIA focusing on building Python-based automated testing solutions and tools for simulation software, GPU drivers, and AI applications.

Senior Software QA Test Development Engineer

Senior Software QA Test Development Engineer role at NVIDIA focusing on platform testing, automation, and DevOps practices with competitive compensation and benefits.

Product Validation Tools Software Engineer

Senior Software Engineer role at NVIDIA focusing on product validation tools development and silicon validation infrastructure.

Senior Production Engineer - Storage

Senior Production Engineer role at NVIDIA focusing on storage platform reliability and scalability using DevOps practices and AI/ML technologies.

Senior DevOps and Automation Engineer, Fabric Networking - GPU

Senior DevOps role at NVIDIA focusing on GPU cluster automation and management, offering competitive compensation and remote work options.