Senior HPC AI Cluster Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior HPC AI Cluster Engineer

NVIDIA is seeking an experienced Senior HPC AI Cluster Engineer to join their E2E software verification HPC/AI Infrastructure team. This role represents an exciting opportunity to work at the forefront of accelerated computing and artificial intelligence, building and maintaining supercomputers and HPC clusters based on cutting-edge technologies.

The position combines deep technical expertise in HPC systems with hands-on engineering work, requiring skills across system architecture, infrastructure automation, and performance optimization. You'll be working with the latest accelerated computing and deep learning platforms, collaborating with scientific researchers and developers to improve workflows and develop innovative solutions.

As a Senior HPC AI Cluster Engineer, you'll be responsible for designing and implementing large-scale HPC/AI clusters, managing workload orchestration, developing automation tools, and ensuring optimal system performance. The role requires expertise in Linux systems, networking protocols, storage solutions, and modern DevOps practices.

NVIDIA, as the world leader in accelerated computing, offers an environment where you'll be working with cutting-edge technology and contributing to breakthroughs in AI and GPU computing. The company's focus on innovation and technical excellence makes this an ideal position for someone passionate about high-performance computing and artificial intelligence.

The role offers the opportunity to work with multiple teams across the organization, providing technical leadership and developing standardized methodologies. You'll be involved in research and development activities, participating in proof-of-concepts for future improvements, and helping shape the future of HPC/AI infrastructure.

This position is perfect for a seasoned engineer who combines strong technical skills with a strategic mindset, capable of both hands-on implementation and high-level system architecture. The role offers significant growth potential and the chance to work on some of the most advanced computing systems in the industry.

Last updated 2 hours ago

Responsibilities For Senior HPC AI Cluster Engineer

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure
  • Deploy monitoring solutions for servers, network and storage
  • Perform troubleshooting from bare metal to application level
  • Develop and document standard methodologies
  • Support R&D activities and engage in POCs/POVs

Requirements For Senior HPC AI Cluster Engineer

Python
Linux
Kubernetes
  • Degree in Computer Science, Engineering, or related field
  • 5+ years of experience
  • Knowledge of HPC and AI solution technologies
  • Experience with job scheduling workloads and orchestration tools (Slurm, K8s)
  • Excellent knowledge of Windows and Linux networking and internals
  • Experience with multiple storage solutions (Lustre, GPFS, zfs, xfs)
  • Python programming and bash scripting experience
  • Experience with automation tools (Jenkins, Ansible, Puppet/chef)
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding of virtual systems

Interested in this job?

Jobs Related To NVIDIA Senior HPC AI Cluster Engineer

Senior Software Engineer - Build and Deployment Tools

Senior Software Engineer position at NVIDIA focusing on build and deployment tools development for chip design infrastructure.

Senior SWQA Test Development Engineer

Senior SWQA Test Development Engineer role at NVIDIA focusing on AI-powered testing and automation for software quality assurance.

Senior Software Engineer – AI Infrastructure and Tooling

Senior Software Engineer role at NVIDIA focusing on AI infrastructure automation and tooling, requiring expertise in DevOps, cloud technologies, and distributed systems.

Product Validation Tools Software Engineer

Senior Software Engineer role at NVIDIA focusing on product validation tools development and silicon validation infrastructure.

Senior DevOps Engineer

Senior DevOps Engineer role at NVIDIA focusing on infrastructure development and CI/CD implementation for DPU and Network Adapters platforms.