HPC Operations Manager – Hardware Engineering

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$272,000 - $419,750
Cloud
Principal Software Engineer
Hybrid
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For HPC Operations Manager – Hardware Engineering

NVIDIA, a leader in High-Performance Computing, Artificial Intelligence, and Visualization, is seeking an HPC Operations Manager for their Hardware Engineering team. This role involves leading a multi-national team of sysadmins and devops engineers, ensuring high reliability of HPC clusters, and collaborating with partners to develop programs for storage, networking, and compute in data centers. Key responsibilities include evaluating technologies, planning hardware deployments, managing HPC schedulers, tracking software licensing, and communicating with senior management. The ideal candidate will have extensive experience in IT infrastructure management, Linux servers, HPC schedulers, and hardware design workflows. This position offers the opportunity to work on cutting-edge technology and contribute to the development of next-generation GPUs and SOCs.

Responsibilities:

  • Lead and mentor a multi-national team of sysadmins and devops engineers
  • Ensure high reliability of HPC clusters and develop critical metrics
  • Evaluate latest technologies and recommend infrastructure evolution
  • Manage HPC scheduler (LSF) and drive high utilization
  • Collaborate with hardware engineering leaders to support chip design needs
  • Develop and manage program schedules, milestones, and deliverables
  • Communicate program status to senior management

Requirements:

  • B.S. or M.S. in Computer Science, Computer Engineering, or Information Science
  • 15+ years overall experience
  • 5+ years managing IT infrastructure teams of 10+ people
  • 10+ years experience with Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Experience with hardware design workflows (EDA tools and methodology)
  • Project management and capacity planning skills

Preferred Skills:

  • Experience with HPC storage systems
  • Infiniband expertise
  • Software development in a devops context
  • Knowledge of databases and analytics platforms
  • Experience with FlexLM-based software license servers
  • Established relationships with enterprise-level equipment suppliers

NVIDIA offers a competitive salary range, equity, and comprehensive benefits. They are committed to fostering a diverse work environment and are an equal opportunity employer.

Last updated 2 months ago

Responsibilities For HPC Operations Manager – Hardware Engineering

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers
  • Ensure the highest reliability of HPC clusters
  • Evaluate the latest technologies and recommend future evolution of the infrastructure
  • Work multi-functionally with hardware engineering leaders to support their future chip design needs
  • Lead all aspects of the HPC scheduler (LSF)
  • Track software licensing servers and drive efficient license utilization
  • Develop and manage program schedules, milestones and deliverables
  • Regularly communicate program status and key issues to senior management

Requirements For HPC Operations Manager – Hardware Engineering

Linux
  • B.S. or M.S. in Computer Science, Computer Engineering, or Information Science
  • 15+ years overall experience
  • 5+ years managing IT infrastructure teams of 10+ people
  • 10+ years experience with Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Knowledge of hardware design workflows (EDA tools and methodology)
  • Experience using project management and capacity planning software
  • Datacenter operations (rack and stack, maintenance)

Benefits For HPC Operations Manager – Hardware Engineering

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA HPC Operations Manager – Hardware Engineering

Data Center System Software Architect, DGX Cloud

Lead architect position for NVIDIA's DGX Cloud platform, focusing on next-generation cloud clusters and hybrid infrastructure solutions for AI applications.

Principal Systems Software Engineer - Cloud Infrastructure and Development

Lead cloud infrastructure development at NVIDIA using OpenStack and Kubernetes, shaping the future of AI and digital twins.

Principal Architect Cloud Infrastructure

NVIDIA seeks Principal Architect for scalable hybrid cloud infrastructure, offering competitive salary and benefits.

AWS Cloud Lead Developer

Senior AWS Cloud Lead Developer position requiring 16+ years of experience, focusing on cloud architecture, team leadership, and AWS services implementation.

Principal Software Engineering Manager

Principal Software Engineering Manager position at Microsoft Security, leading cloud security platform development in Bangalore, requiring 12+ years of experience in software engineering and cloud technologies.