Hardcore Engineer - Pretraining Infrastructure

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.
$180,000 - $440,000
Backend
In-Person
AI

Description For Hardcore Engineer - Pretraining Infrastructure

xAI is seeking a Hardcore Engineer for Pretraining Infrastructure. The role involves designing, building, and implementing large-scale distributed training systems, profiling, debugging, and optimizing multi-host GPU utilization, hardware/software/algorithm co-design, maintaining and innovating on the codebase, and building tools to boost team productivity. The ideal candidate should have experience in configuring and troubleshooting operating systems for maximum performance, and building scalable training frameworks for AI models in HPC clusters. The team operates with a flat organizational structure, encouraging engineers to work across multiple areas and contribute directly to the company's mission. Strong communication skills and the ability to share knowledge concisely and accurately are essential. The interview process includes a coding assessment, systems hands-on demonstration, project deep-dive presentation, and a meet and greet with the wider team. xAI values engineering excellence, curiosity, and a strong work ethic.

Last updated 2 months ago

Responsibilities For Hardcore Engineer - Pretraining Infrastructure

  • Design, build, and implement large-scale distributed training systems
  • Profiling, debugging, and optimizing multi-host GPU utilization
  • Hardware / Software / Algorithm co-design
  • Maintain and innovate on the codebase
  • Build tools to boost the productivity of the team

Requirements For Hardcore Engineer - Pretraining Infrastructure

Python
Rust
  • Experience in configuring and troubleshooting operating systems for maximum performance
  • Built scalable training framework for AI models in HPC clusters
  • Experience with scalable orchestration framework and tools
  • Knowledge of machine learning compilers and runtime such as XLA, MLIR, and Triton
  • Experience with distributed training strategies such as FSDP, Megatron, and pipeline parallelism
  • Familiarity with NCCL or custom communication libraries for performant communication collectives
  • Strong communication skills

Interested in this job?