High Performance Computing Engineer

A startup building large language tools, founded by Alex Smola and Mu Li, focusing on generative AI models for language, audio, and entertainment.
$150,000 - $250,000
Cloud
Senior Software Engineer
Hybrid
11 - 50 Employees
5+ years of experience
AI

Description For High Performance Computing Engineer

Boson AI, an innovative startup in the AI space, is seeking a Senior High Performance Computing Engineer to join their team. Founded by renowned experts Alex Smola and Mu Li, the company is at the forefront of developing cutting-edge generative AI models for language, audio, and entertainment.

The role offers an exceptional opportunity to work with state-of-the-art infrastructure, including NVIDIA H100 and A100 GPUs, managing over 20PB of storage, Terabit networking, and hundreds of computers. You'll be responsible for the critical infrastructure that powers their AI operations in Toronto.

As a High Performance Computing Engineer, you'll be deeply involved in managing and optimizing the company's GPU clusters, implementing and maintaining complex networking solutions, and ensuring the smooth operation of their substantial computing infrastructure. The position requires a blend of hardware expertise and software engineering skills, with a focus on high-performance computing environments.

The ideal candidate will bring strong problem-solving abilities and a passion for learning new technologies. You'll work with cutting-edge tools and technologies, including Slurm, MAAS, Ceph, Infiniband, and NVIDIA deepops. The role offers competitive compensation ranging from $150,000 to $250,000 annually, reflecting the senior nature of the position and its critical importance to the company's infrastructure.

This is an excellent opportunity for a seasoned engineer who wants to be at the forefront of AI infrastructure, working with some of the most advanced computing systems available while contributing to the development of next-generation AI technologies.

Last updated 4 hours ago

Responsibilities For High Performance Computing Engineer

  • Manage private large high-end GPU clusters
  • Handle full lifecycle of physical systems including deployments, operations, and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems using infrastructure-as-code practices
  • Configure and maintain Layer 3 networking
  • Learn and deploy new tools

Requirements For High Performance Computing Engineer

Python
Linux
Kubernetes
  • Strong background in high performance computing
  • Experience with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python)
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems

Interested in this job?

Jobs Related To Boson AI High Performance Computing Engineer

High Performance Computing Engineer

Senior High Performance Computing Engineer role at Boson AI, managing GPU clusters and infrastructure for AI development in Toronto.

Senior Network Reliability Engineer

Senior Network Reliability Engineer position at Oracle Cloud Infrastructure focusing on network automation and operational excellence at scale.

Senior Cloud Support Engineer

Senior Cloud Support Engineer position at Oracle, providing technical support for Oracle Cloud Infrastructure (OCI) services with 24x7 shift rotation in Bucharest.

Cloud Solution Architect- Oracle Government, Defense & Intelligence

Senior Cloud Solution Architect position at Oracle's Government division, focusing on cloud architecture and federal government solutions with hybrid work arrangement.

Cloud Solution Engineer 4 - infrastructure and microservices

Senior Cloud Solution Engineer position at Oracle, focusing on infrastructure modernization and cloud migration projects with extensive travel requirements.