Senior HPC DevOps Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
Breckenridge, CO 80424, USAEstes Park, CO 80517, USAReno, NV, USA
DevOps
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior HPC DevOps Engineer

NVIDIA is seeking a Senior HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role is at the intersection of artificial intelligence and GPU computing, where you'll drive breakthrough innovations in at-scale system design. You'll work with cutting-edge Accelerated computing and Deep Learning platforms, collaborating with scientific researchers, developers, and customers to enhance workflows and develop innovative solutions.

The position involves designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, and developing automated CI/CD pipelines. You'll be responsible for creating automation scripts, deploying monitoring solutions, and performing complex troubleshooting from bare metal to application level. As a technical leader, you'll share best practices and drive innovation through R&D activities.

The ideal candidate brings 5+ years of experience with a strong background in HPC and AI technologies, including expertise in CPUs, GPUs, and high-speed interconnects. You should be proficient in programming, familiar with tools like Jenkins and Ansible, and have deep knowledge of both Windows and Linux environments. Experience with job scheduling, storage solutions, and cloud platforms is essential.

NVIDIA offers a competitive package and a diverse, inclusive work environment. You'll be part of a company that's revolutionizing industries through AI and High-Performance Computing, working with the latest technologies and brilliant minds in the field. This role provides an opportunity to shape the future of computing while working on some of the most challenging technical problems in the industry.

Last updated a day ago

Responsibilities For Senior HPC DevOps Engineer

  • Design, implement, and maintain large-scale HPC/AI clusters with monitoring, logging, and alerting systems
  • Utilize and develop tools to manage infrastructure as code
  • Develop and maintain CI/CD pipelines
  • Develop automation scripts and tools
  • Deploy advanced monitoring solutions
  • Perform comprehensive troubleshooting
  • Serve as a technical resource and share best practices
  • Support R&D activities and engage in proof of concepts

Requirements For Senior HPC DevOps Engineer

Kubernetes
Linux
  • B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
  • Deep knowledge of HPC and AI solution technologies
  • Advanced proficiency in programming and scripting languages
  • Familiarity with Jenkins, Ansible, Puppet/Chef
  • Excellent knowledge of Windows and Linux
  • Deep understanding of networking protocols
  • Experience with job scheduling workloads and orchestration tools
  • Experience with multiple storage solutions
  • Expertise with virtual systems
  • Familiarity with cloud platforms

Interested in this job?

Jobs Related To NVIDIA Senior HPC DevOps Engineer

Senior Release Engineer - Server Software

Senior Release Engineer position at NVIDIA, managing software and firmware releases for server systems with focus on reliability and automation.

Senior DevOps Engineer - Robotics

Senior DevOps Engineer position at NVIDIA focusing on robotics infrastructure, requiring expertise in Python, Linux, and Kubernetes.

Senior Production Engineer - Storage

Senior Production Engineer role at NVIDIA focusing on storage systems and site reliability engineering, offering competitive compensation and opportunity to work with cutting-edge AI technology.

Senior DevOps Engineer

Senior DevOps Engineer role at NVIDIA, leading CI/CD infrastructure development and automation, offering competitive salary and opportunity to work with cutting-edge AI technology.

Senior DevOps Engineer - AI Infrastructure

Senior DevOps Engineer position at NVIDIA focusing on AI infrastructure and autonomous vehicle systems, requiring expertise in cloud technologies and automation.