Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon Web Services (AWS) is a leading cloud computing platform, providing a wide array of services to businesses and individuals worldwide.
$129,300 - $223,600
Machine Learning
Senior Software Engineer
Contact Company
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS · Cloud

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

Do you love decomposing problems to develop products that impact millions of people around the world? Would you enjoy identifying, defining, and building software solutions that revolutionize how businesses operate? The Annapurna Labs team at Amazon Web Services (AWS) is looking for a Software Development Engineer II to build, deliver, and maintain complex products that delight our customers and raise our performance bar.

You'll design fault-tolerant systems that run at massive scale as we continue to innovate best-in-class services and applications in the AWS Cloud. Annapurna Labs was a startup company acquired by AWS in 2015, and is now fully integrated. If AWS is an infrastructure company, then think Annapurna Labs as the infrastructure provider of AWS.

This role is for a senior software engineer in the Machine Learning Applications (ML Apps) team for AWS Neuron. You'll be responsible for development, enablement, and performance tuning of a wide variety of ML model families, including massive scale large language models like GPT2, GPT3 and beyond, as well as stable diffusion, Vision Transformers, and many more.

Key responsibilities include:

  • Leading efforts to build distributed training support into PyTorch, TensorFlow using XLA, and the Neuron compiler and runtime stacks
  • Tuning models to ensure highest performance and maximize efficiency on AWS Trainium and Inferentia silicon and the TRn1, Inf1 servers
  • Working closely with chip architects, compiler engineers, and runtime engineers

We offer:

  • Inclusive team culture with employee-led affinity groups
  • Emphasis on work-life balance
  • Mentorship and career growth opportunities
  • Flexibility in working hours

Join us to revolutionize cloud-scale machine learning acceleration at AWS!

Last updated 3 months ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Build distributed training support into PyTorch, TensorFlow using XLA, and the Neuron compiler and runtime stacks
  • Tune models for highest performance on AWS Trainium and Inferentia silicon and TRn1, Inf1 servers
  • Work with chip architects, compiler engineers, and runtime engineers
  • Develop, enable, and performance tune a wide variety of ML model families
  • Design fault-tolerant systems that run at massive scale

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 3+ years of non-internship professional software development experience
  • 3+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Bachelor's degree in computer science or equivalent (preferred)
  • Experience with PyTorch/JAX/TensorFlow, Distributed libraries and Frameworks, End-to-end Model Training (preferred)

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Flexible working hours
  • Mentorship and career growth opportunities
  • Employee-led affinity groups
  • Work-life balance

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Distributed Training

Applied Scientist

Senior Applied Scientist role at Amazon SageMaker, leading automated ML systems development with focus on innovation and practical implementation.

Machine Learning Engineer III, FAR (Frontier AI & Robotics)

Senior ML Engineer role at Amazon's Frontier AI & Robotics team, optimizing foundation models for robotics applications with industry leaders.

ASIC Design Engineer, Cloud-Scale Machine Learning Acceleration team

Senior ASIC Design Engineer position for AWS's Machine Learning Acceleration team, focusing on custom SoC design and optimization.

Applied Scientist, AWS SAAR

Senior Applied Scientist role at AWS focusing on machine learning and security analytics, developing innovative solutions for cloud security services.

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Senior ML Engineer role at AWS focusing on distributed training systems and ML accelerators, offering competitive pay and growth opportunities.