Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS is a leading cloud infrastructure company, with Annapurna Labs serving as AWS's infrastructure provider following its 2015 acquisition.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
3+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Software Development Engineer II, ML_AI

AWS SDE II role focusing on building next-gen AI platform for large-scale deep learning, working with LLMs and distributed systems at Amazon's cloud division.

Software Dev. Engineer, Alexa Analytics

Software Development Engineer position at Amazon's Alexa Analytics team, combining software engineering with ML expertise to build scalable analytics applications and improve Alexa's predictive capabilities.

Machine Learning Engineer II, Special Projects

Machine Learning Engineer II position at Amazon's Special Projects team, focusing on Generative AI and LLMs, offering competitive compensation and comprehensive benefits.

Software Engineer / SDE II, Amazon

Software Engineer II position at Amazon Advertising focusing on building AI-powered targeting systems for Sponsored Products.

Machine Learning Engineer, AGIF | Finetuning

Machine Learning Engineer position at Amazon's AGI Finetuning team, focusing on developing and maintaining evaluation systems for advanced AI models.

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Neuron is seeking a talented Software Engineer to join their Machine Learning Applications (ML Apps) team. This role is part of the innovative Annapurna Labs organization, which was acquired by AWS in 2015 and serves as the infrastructure backbone of AWS.

The position focuses on developing and optimizing AWS Neuron, the complete software stack for AWS Inferentia and Trainium cloud-scale machine learning accelerators. You'll be working with cutting-edge ML technologies, including large language models like GPT2 and GPT3, stable diffusion, and Vision Transformers.

As a Software Engineer II, you'll collaborate with chip architects, compiler engineers, and runtime engineers to create sophisticated distributed training solutions. Your responsibilities will include implementing distributed training support in frameworks like PyTorch and TensorFlow, optimizing model performance on AWS Trainium and Inferentia silicon, and working with various ML model families.

The role offers an exciting opportunity to work at the intersection of hardware and software, directly impacting how businesses leverage machine learning at scale. You'll be part of a team that has delivered groundbreaking products like AWS Nitro, ENA, EFA, Graviton, and F1 EC2 Instances.

AWS provides a supportive and inclusive work environment with a strong focus on work-life balance. The company offers comprehensive benefits, mentorship opportunities, and a culture that celebrates diversity through various employee-led affinity groups. You'll have the chance to grow professionally while working on challenging problems that affect millions of users worldwide.

The compensation is competitive, ranging from $129,300 to $223,600 per year, depending on location and experience, plus additional benefits and potential equity. This is an excellent opportunity for someone with strong software development skills and ML knowledge who wants to make a significant impact in the cloud computing and machine learning space.

Last updated 3 months ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts building distributed training support into PyTorch, TensorFlow using XLA
  • Work with Neuron compiler and runtime stacks
  • Tune ML models for highest performance on AWS Trainium and Inferentia silicon
  • Develop and enable various ML model families including GPT2, GPT3, stable diffusion, and Vision Transformers
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions with Trn1

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 3+ years of non-internship professional software development experience
  • 3+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Experience with PyTorch/JAX/TensorFlow
  • Knowledge of distributed training libraries and frameworks
  • Bachelor's degree in computer science or equivalent (preferred)

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
401k
  • Medical, financial, and other benefits
  • Flexible working hours
  • Mentorship and career growth opportunities
  • Employee-led affinity groups
  • Work-life balance focus

Interested in this job?