Taro Logo

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Utility Computing provides product innovations and cloud services including S3, EC2, and other foundational AWS services.
$151,300 - $261,500
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Neuron is seeking a Senior Software Engineer to join their Machine Learning Applications (ML Apps) team, focusing on distributed training solutions. This role combines deep software engineering expertise with machine learning knowledge to develop and optimize ML frameworks for AWS's custom silicon. You'll work on AWS Neuron, the complete software stack for AWS Inferentia and Trainium cloud-scale machine learning accelerators.

The position involves working with cutting-edge ML models including large language models like GPT-2/3, stable diffusion, and Vision Transformers. You'll collaborate with chip architects and engineers to build distributed training solutions using technologies like FSDP and Deepspeed. The role requires expertise in both software development and machine learning, particularly in Python-based frameworks.

As part of AWS Utility Computing, you'll contribute to foundational services that power cloud computing worldwide. The team culture emphasizes learning, diversity, and work-life harmony. Amazon offers comprehensive benefits, mentorship opportunities, and strong career growth potential.

Key responsibilities include implementing distributed training support across major ML frameworks, optimizing model performance on custom silicon, and leading technical initiatives. The ideal candidate brings 5+ years of software development experience, strong ML knowledge, and leadership experience.

This role offers the opportunity to work on next-generation AI infrastructure at scale, with competitive compensation ranging from $151,300 to $261,500 based on location, plus equity and comprehensive benefits. Join us in shaping the future of machine learning infrastructure at AWS.

Last updated 2 months ago

Responsibilities For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts building distributed training and inference support into Pytorch, Tensorflow, Jax
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions
  • Performance tuning of ML model families including GPT2, GPT3, stable diffusion, Vision Transformers
  • Ensure highest performance and maximize efficiency on AWS Trainium and Inferentia silicon

Requirements For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language
  • 5+ years of leading design or architecture experience
  • 5+ years of full software development life cycle experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Bachelor's degree in computer science or equivalent (preferred)
  • Machine Learning knowledge in frameworks and end to end model training

Benefits For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
401k
Vision Insurance
Dental Insurance
Parental Leave
  • Work-life harmony
  • Mentorship opportunities
  • Career growth resources
  • Comprehensive benefits package

Interested in this job?

Jobs Related To Amazon Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training