HPC Operations Manager – Hardware Engineering

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$272,000 - $419,750
Cloud
Principal Software Engineer
Hybrid
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For HPC Operations Manager – Hardware Engineering

NVIDIA, a leader in High-Performance Computing, Artificial Intelligence, and Visualization, is seeking an HPC Operations Manager for their Hardware Engineering team. This role involves leading a multi-national team of sysadmins and devops engineers, ensuring high reliability of HPC clusters, and collaborating with partners to develop programs for storage, networking, and compute in data centers. Key responsibilities include evaluating technologies, planning hardware deployments, managing HPC schedulers, tracking software licensing, and communicating with senior management. The ideal candidate will have extensive experience in IT infrastructure management, Linux servers, HPC schedulers, and hardware design workflows. This position offers the opportunity to work on cutting-edge technology and contribute to the development of next-generation GPUs and SOCs.

Responsibilities:

  • Lead and mentor a multi-national team of sysadmins and devops engineers
  • Ensure high reliability of HPC clusters and develop critical metrics
  • Evaluate latest technologies and recommend infrastructure evolution
  • Manage HPC scheduler (LSF) and drive high utilization
  • Collaborate with hardware engineering leaders to support chip design needs
  • Develop and manage program schedules, milestones, and deliverables
  • Communicate program status to senior management

Requirements:

  • B.S. or M.S. in Computer Science, Computer Engineering, or Information Science
  • 15+ years overall experience
  • 5+ years managing IT infrastructure teams of 10+ people
  • 10+ years experience with Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Experience with hardware design workflows (EDA tools and methodology)
  • Project management and capacity planning skills

Preferred Skills:

  • Experience with HPC storage systems
  • Infiniband expertise
  • Software development in a devops context
  • Knowledge of databases and analytics platforms
  • Experience with FlexLM-based software license servers
  • Established relationships with enterprise-level equipment suppliers

NVIDIA offers a competitive salary range, equity, and comprehensive benefits. They are committed to fostering a diverse work environment and are an equal opportunity employer.

Last updated 3 months ago

Responsibilities For HPC Operations Manager – Hardware Engineering

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers
  • Ensure the highest reliability of HPC clusters
  • Evaluate the latest technologies and recommend future evolution of the infrastructure
  • Work multi-functionally with hardware engineering leaders to support their future chip design needs
  • Lead all aspects of the HPC scheduler (LSF)
  • Track software licensing servers and drive efficient license utilization
  • Develop and manage program schedules, milestones and deliverables
  • Regularly communicate program status and key issues to senior management

Requirements For HPC Operations Manager – Hardware Engineering

Linux
  • B.S. or M.S. in Computer Science, Computer Engineering, or Information Science
  • 15+ years overall experience
  • 5+ years managing IT infrastructure teams of 10+ people
  • 10+ years experience with Linux servers, NFS storage, and Ethernet networks
  • Knowledge of HPC schedulers (IBM LSF preferred)
  • Knowledge of hardware design workflows (EDA tools and methodology)
  • Experience using project management and capacity planning software
  • Datacenter operations (rack and stack, maintenance)

Benefits For HPC Operations Manager – Hardware Engineering

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA HPC Operations Manager – Hardware Engineering

Data Center System Software Architect, DGX Cloud

Lead architect position for NVIDIA's DGX Cloud platform, focusing on next-generation data center systems and AI infrastructure solutions.

Director of Engineering, Cloud and Database Platforms

Lead NVIDIA's cloud and database platforms strategy, managing infrastructure and teams for one of the world's leading AI and computing companies.

Senior Network Architect

Senior Network Architect position at NVIDIA, leading network architecture design and implementation for AI and high-performance computing infrastructure.

Principal Systems Software Engineer - Cloud Infrastructure and Development

Lead cloud infrastructure development at NVIDIA using OpenStack and Kubernetes, shaping the future of AI and digital twins.

Principal Architect Cloud Infrastructure

NVIDIA seeks Principal Architect for scalable hybrid cloud infrastructure, offering competitive salary and benefits.