Meta is seeking a Software Engineer to join their Network.AI Software team within the DC networking organization. This role focuses on developing and maintaining the NCCL (NVIDIA Collective Communications Library) software stack, which is crucial for multi-GPU and multi-node data communication in distributed ML training. The position is particularly centered around improving the reliability and performance of large-scale GenAI/LLM training systems.
The role involves working with PyTorch integration and is directly involved with Meta's GPU-based ML workloads. The team's mission is to enable Meta-wide ML products and innovations by providing an observable, reliable, and high-performance distributed AI/GPU communication stack. Current focus areas include building customized features, software benchmarks, and performance tuners to enhance distributed ML reliability and performance.
This is an excellent opportunity for someone with strong technical expertise in distributed systems, machine learning infrastructure, and high-performance computing. The ideal candidate should have experience with GPU architectures, CUDA programming, and deep learning frameworks. The position offers competitive compensation including base salary, bonus, equity, and comprehensive benefits.
Working at Meta means being at the forefront of AI infrastructure development, with the opportunity to impact billions of users through Meta's various products and platforms. The role combines deep technical challenges in distributed systems with cutting-edge machine learning applications, particularly in the rapidly evolving field of large language models and generative AI.
The position requires collaboration with various teams across Meta's infrastructure organization, working on solutions that scale across Meta's large GPU fleet. This is a chance to work on some of the most challenging problems in distributed ML training while contributing to the development of next-generation AI systems.