Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS is a leading cloud infrastructure company, with Annapurna Labs serving as AWS's infrastructure provider following its 2015 acquisition.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS (Amazon Web Services) is seeking a talented Software Engineer II to join the Annapurna Labs team, specifically working on the Machine Learning Applications (ML Apps) team for AWS Neuron. This role is at the intersection of cloud infrastructure and cutting-edge machine learning technology.

The position focuses on developing and optimizing the AWS Neuron software stack, which powers AWS Inferentia and Trainium cloud-scale machine learning accelerators. You'll be responsible for enabling and performance-tuning various ML model families, including large language models like GPT-2 and GPT-3, stable diffusion, and Vision Transformers.

As a key member of the ML Distributed Training team, you'll collaborate closely with chip architects, compiler engineers, and runtime engineers. Your primary focus will be on building distributed training support into frameworks like PyTorch and TensorFlow, working with XLA and the Neuron compiler and runtime stacks. The role requires both strong software development skills and deep machine learning knowledge.

The team operates within AWS's larger infrastructure ecosystem, where Annapurna Labs (acquired by AWS in 2015) serves as a crucial infrastructure provider. The organization spans multiple disciplines, including silicon engineering, hardware design and verification, software, and operations. Their impressive portfolio includes products like AWS Nitro, ENA, EFA, Graviton and F1 EC2 Instances, AWS Neuron, Inferentia and Trainium ML Accelerators.

AWS offers a supportive and inclusive work environment with a strong emphasis on work-life balance. The company provides comprehensive benefits, mentorship opportunities, and a culture that celebrates knowledge sharing. With ten employee-led affinity groups reaching 40,000 employees globally, AWS is committed to fostering diversity and inclusion.

This role offers an exciting opportunity to work on cutting-edge ML infrastructure that impacts millions of users worldwide. You'll be at the forefront of developing solutions that help businesses leverage machine learning at scale, while working with some of the most advanced cloud and ML technologies available today.

Last updated a day ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts building distributed training support into PyTorch, TensorFlow using XLA
  • Work with Neuron compiler and runtime stacks
  • Tune ML models for highest performance on AWS Trainium and Inferentia silicon
  • Develop and enable various ML model families including GPT2, GPT3, stable diffusion, and Vision Transformers
  • Work with chip architects, compiler engineers and runtime engineers

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 3+ years of non-internship professional software development experience
  • 3+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Experience with PyTorch/JAX/TensorFlow
  • Knowledge of distributed training libraries and frameworks

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
  • Medical, financial, and other benefits
  • Flexible working hours
  • Mentorship and career growth opportunities
  • Inclusive team culture
  • Employee-led affinity groups

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Distributed Training

Support Engineer - Intelligent Document Processing

Support Engineer role at Amazon focusing on AI and compliance, implementing LLMs for document validation with 2+ years experience required.

Data Scientist II, Enterprise Engineering

Data Scientist role focused on developing and implementing machine learning models for Amazon's Enterprise Engineering team.

Machine Learning Engineer, Computer Vision & Remote Sensing, Proserve

AWS seeks Computer Vision Engineer for federal services team to develop ML solutions using satellite imagery, medical imaging, and remote sensing capabilities.

Language Engineer II, Amazon Transcribe

Language Engineer II position at Amazon AWS focusing on natural language data collections and GenAI services development.

Support Engineer - Intelligent Document Processing

Support Engineer role at Amazon focusing on AI and compliance, implementing LLMs for document validation with 2+ years experience required.