AWS (Amazon Web Services) is seeking a talented Software Engineer II to join the Annapurna Labs team, specifically working on the Machine Learning Applications (ML Apps) team for AWS Neuron. This role is at the intersection of cloud infrastructure and cutting-edge machine learning technology.
The position focuses on developing and optimizing the AWS Neuron software stack, which powers AWS Inferentia and Trainium cloud-scale machine learning accelerators. You'll be responsible for enabling and performance-tuning various ML model families, including large language models like GPT-2 and GPT-3, stable diffusion, and Vision Transformers.
As a key member of the ML Distributed Training team, you'll collaborate closely with chip architects, compiler engineers, and runtime engineers. Your primary focus will be on building distributed training support into frameworks like PyTorch and TensorFlow, working with XLA and the Neuron compiler and runtime stacks. The role requires both strong software development skills and deep machine learning knowledge.
The team operates within AWS's larger infrastructure ecosystem, where Annapurna Labs (acquired by AWS in 2015) serves as a crucial infrastructure provider. The organization spans multiple disciplines, including silicon engineering, hardware design and verification, software, and operations. Their impressive portfolio includes products like AWS Nitro, ENA, EFA, Graviton and F1 EC2 Instances, AWS Neuron, Inferentia and Trainium ML Accelerators.
AWS offers a supportive and inclusive work environment with a strong emphasis on work-life balance. The company provides comprehensive benefits, mentorship opportunities, and a culture that celebrates knowledge sharing. With ten employee-led affinity groups reaching 40,000 employees globally, AWS is committed to fostering diversity and inclusion.
This role offers an exciting opportunity to work on cutting-edge ML infrastructure that impacts millions of users worldwide. You'll be at the forefront of developing solutions that help businesses leverage machine learning at scale, while working with some of the most advanced cloud and ML technologies available today.