AWS Neuron is seeking a Software Development Engineer II to join their Machine Learning Applications team, focusing on distributed training solutions. This role is part of Annapurna Labs, acquired by AWS in 2015, which serves as the infrastructure provider for AWS. The position involves working on AWS Neuron, the complete software stack for AWS Inferentia and Trainium cloud-scale machine learning accelerators.
The role focuses on developing and optimizing distributed training support for large-scale ML models, including GPT-2, GPT-3, stable diffusion, and Vision Transformers. You'll work closely with chip architects, compiler engineers, and runtime engineers to create and tune distributed training solutions for Trn1 systems. Experience with Python and distributed training libraries like FSDP and Deepspeed is essential.
The team emphasizes work-life balance and inclusive culture, with strong support for new members through mentorship and knowledge sharing. AWS offers comprehensive benefits and opportunities for career growth. The position involves working with cutting-edge ML infrastructure and contributing to systems that impact millions of users worldwide.
Key responsibilities include implementing distributed training support in major frameworks, optimizing performance for AWS silicon, and collaborating across teams to deliver high-performance ML solutions. The role requires both strong software development skills and deep ML knowledge, making it ideal for candidates with experience in both areas.
The position offers competitive compensation based on location and experience, along with equity opportunities and comprehensive benefits. AWS maintains a strong commitment to diversity and inclusion, reflected in their leadership principles and workplace culture.