Software Engineer, SystemML - Scaling / Performance

Meta builds technologies that help people connect, find communities, and grow businesses through social platforms and immersive experiences.
$70,670 - $208,000
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI

Description For Software Engineer, SystemML - Scaling / Performance

Meta is seeking a Software Engineer to join their Network.AI Software team within the DC networking organization. This role focuses on developing and maintaining the NCCL (NVIDIA Collective Communications Library) software stack, which is crucial for multi-GPU and multi-node data communication in distributed ML training. The position is particularly centered around improving the reliability and performance of large-scale GenAI/LLM training systems.

The role involves working with PyTorch integration and is directly involved with Meta's GPU-based ML workloads. The team's mission is to enable Meta-wide ML products and innovations by providing an observable, reliable, and high-performance distributed AI/GPU communication stack. Current focus areas include building customized features, software benchmarks, and performance tuners to enhance distributed ML reliability and performance.

This is an excellent opportunity for someone with strong technical expertise in distributed systems, machine learning infrastructure, and high-performance computing. The ideal candidate should have experience with GPU architectures, CUDA programming, and deep learning frameworks. The position offers competitive compensation including base salary, bonus, equity, and comprehensive benefits.

Working at Meta means being at the forefront of AI infrastructure development, with the opportunity to impact billions of users through Meta's various products and platforms. The role combines deep technical challenges in distributed systems with cutting-edge machine learning applications, particularly in the rapidly evolving field of large language models and generative AI.

The position requires collaboration with various teams across Meta's infrastructure organization, working on solutions that scale across Meta's large GPU fleet. This is a chance to work on some of the most challenging problems in distributed ML training while contributing to the development of next-generation AI systems.

Last updated 4 days ago

Responsibilities For Software Engineer, SystemML - Scaling / Performance

  • Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling
  • Develop and maintain software stack around NCCL for multi-GPU and multi-node data communication
  • Build customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch

Requirements For Software Engineer, SystemML - Scaling / Performance

Python
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks

Benefits For Software Engineer, SystemML - Scaling / Performance

Medical Insurance
Dental Insurance
Vision Insurance
  • bonus
  • equity
  • benefits package

Interested in this job?

Jobs Related To Meta Software Engineer, SystemML - Scaling / Performance

Research Engineer, SysML - FAIR

Research Engineer position at Meta's FAIR team focusing on advancing AI through systems innovations, requiring expertise in machine learning systems and software engineering.

Software Engineer, Audio Applied Scientist

Senior Audio Applied Scientist role at Meta combining software engineering with audio signal processing and machine learning expertise for AR/VR and AI applications.

Software Engineer, Systems ML - Frameworks / Compilers / Kernels

Senior Software Engineer position at Meta focusing on AI compiler development and optimization for machine learning frameworks, offering competitive compensation and the opportunity to work on cutting-edge AI technologies.

Software Engineer, Systems ML - Frameworks / Compilers / Kernels

Senior Software Engineer role at Meta focusing on AI frameworks, compilers, and kernel development for machine learning systems.

Software Engineer, Machine Learning

Senior Machine Learning Engineer position at Meta, focusing on developing scalable ML solutions and leading technical teams in Boston.