In this role, you will be a member of the Network AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around collective communication libraries around Meta.
At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale training and inference fleet through an observable, reliable and high-performance distributed AI communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the network communication layer. We are seeking leaders to work on the space of GenAI/LLM scaling reliability and performance.
Responsibilities:
Minimum Qualifications:
Preferred Qualifications:
Meta is committed to providing reasonable accommodations for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support.