Taro Logo

Senior HPC DevOps Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior HPC DevOps Engineer

NVIDIA is seeking an experienced HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role combines cutting-edge technology with practical implementation, focusing on large-scale system design and optimization for AI and GPU computing platforms. As a Senior HPC DevOps Engineer, you'll work at the intersection of hardware and software, collaborating with scientists, developers, and customers to enhance workflows and create innovative solutions. The position requires expertise in infrastructure management, automation, and system architecture, with opportunities to work with state-of-the-art accelerated computing and deep learning platforms. You'll be responsible for designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, developing CI/CD pipelines, and ensuring robust monitoring systems. The role demands strong technical skills in areas like containerization, GPU computing, and high-performance networking, while also requiring leadership in best practices and innovation. At NVIDIA, you'll be part of a team pushing the boundaries of technology and making real-world impact, supported by a company culture that values diversity and inclusion.

Last updated 2 months ago

Responsibilities For Senior HPC DevOps Engineer

  • Design, implement, and maintain large-scale HPC/AI clusters with monitoring systems
  • Utilize and develop tools to manage infrastructure as code
  • Develop and maintain CI/CD pipelines
  • Develop automation scripts and tools
  • Deploy advanced monitoring solutions
  • Perform comprehensive troubleshooting
  • Serve as a technical resource and share best practices
  • Support R&D activities and engage in POCs and POVs

Requirements For Senior HPC DevOps Engineer

Linux
Kubernetes
  • B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
  • Deep knowledge of HPC and AI solution technologies
  • Advanced proficiency in programming and scripting languages
  • Familiarity with Jenkins, Ansible, Puppet/Chef
  • Excellent knowledge of Windows and Linux
  • Deep understanding of networking protocols
  • Experience with job scheduling workloads and orchestration tools
  • Experience with multiple storage solutions
  • Expertise with virtual systems
  • Familiarity with cloud platforms

Interested in this job?