High Performance Computing Engineer

A startup building large language tools, founded by Alex Smola and Mu Li, focusing on generative AI models for language, audio, and entertainment.
$150,000 - $250,000
Cloud
Senior Software Engineer
Hybrid
11 - 50 Employees
5+ years of experience
AI

Description For High Performance Computing Engineer

Boson AI, an innovative startup in the AI space, is seeking a Senior High Performance Computing Engineer to join their team in Toronto. Founded by renowned experts Alex Smola and Mu Li, the company is at the forefront of developing generative AI models for language, audio, and entertainment.

The role offers an exceptional opportunity to work with cutting-edge technology, including NVIDIA H100 and A100 GPUs, managing over 20PB of storage, Terabit networking, and hundreds of computers. You'll be responsible for operating GPUs, network, and filesystem in the datacenter deployment, requiring strong problem-solving skills and an adaptable learning mindset.

As a Senior HPC Engineer, you'll be deeply involved in managing high-end GPU clusters, handling system deployments, and maintaining critical infrastructure components. The position demands expertise in various technologies, including Slurm, MAAS, Ceph, Infiniband, and NVIDIA deepops, along with strong networking knowledge.

The ideal candidate will bring substantial experience in high-performance computing, data center operations, and large hardware cluster management. Your role will be crucial in designing, deploying, and maintaining production-grade machine learning systems at scale, making this an excellent opportunity for someone passionate about infrastructure and AI technology.

Working in a hybrid environment with a competitive salary range of $150,000 - $250,000, you'll be part of a team pushing the boundaries of AI technology. This role offers the chance to work with state-of-the-art hardware and contribute to the development of next-generation AI tools.

Last updated 6 hours ago

Responsibilities For High Performance Computing Engineer

  • Manage private large high-end GPU clusters
  • Handle full lifecycle of physical systems including deployments, operations, and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems using infrastructure-as-code practices
  • Configure and maintain network, including Layer 3 networking
  • Learn and deploy new tools

Requirements For High Performance Computing Engineer

Linux
Python
Kubernetes
  • Strong background in high performance computing
  • Experience with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems

Interested in this job?

Jobs Related To Boson AI High Performance Computing Engineer

High Performance Computing Engineer

Senior High Performance Computing Engineer role at Boson AI, managing GPU clusters and infrastructure for AI development in Toronto.

Commissioning Engineer

Senior Commissioning Engineer role at AWS overseeing critical infrastructure systems, requiring 5+ years experience in electrical/mechanical engineering with extensive travel.

AWS Technical Consultant/TAM - (MEC Team), ES - APJC - ANZ

Senior Technical Account Manager role at AWS, focusing on enterprise customer success, cloud solutions, and technical advisory services.

Senior Solutions Architect, eero

Senior Solutions Architect position at Amazon's eero division, focusing on B2B WiFi networking solutions, requiring 8+ years of experience and offering competitive compensation.

Senior Product Manager, Tech - AWS SNS, AWS Serverless Messaging

Senior Product Manager position at AWS leading cloud messaging services, focusing on SNS and SQS, requiring 5+ years of technical product management experience.