Taro Logo

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS is the world's most comprehensive and broadly adopted cloud platform, pioneering cloud computing and continuous innovation.
$151,300 - $261,500
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Annapurna Labs, now fully integrated with AWS after its 2015 acquisition, is seeking a Senior Machine Learning Engineer for their Distribute Training team within AWS Neuron. This role focuses on developing and optimizing distributed training solutions for AWS's cloud-scale Machine Learning accelerators, Trainium and Inferentia. The position involves working with cutting-edge ML models including LLMs like GPT and Llama, as well as Stable Diffusion and Vision Transformers.

The role requires expertise in distributed training libraries such as FSDP, Deepspeed, and Nemo, along with strong Python skills. You'll collaborate with cross-functional teams including chip architects and compiler engineers to push the boundaries of ML training performance on AWS custom silicon.

AWS values diverse experiences and maintains an inclusive culture through employee-led affinity groups and ongoing learning experiences. The team emphasizes knowledge-sharing and mentorship, making it an ideal environment for professional growth. Work-life harmony is prioritized, ensuring success both at work and home.

The position offers competitive compensation ranging from $151,300 to $261,500 per year, depending on location and experience, plus additional benefits including equity and sign-on payments. This is an opportunity to work at the forefront of ML infrastructure, developing solutions that enable customers to solve previously unimaginable technical challenges.

As part of Annapurna Labs, you'll be working with the team responsible for critical AWS infrastructure components including AWS Nitro, Graviton, and ML Accelerators. The role combines deep technical expertise with leadership opportunities, making it perfect for experienced engineers passionate about advancing ML technology at scale.

Last updated 8 months ago

Responsibilities For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts to build distributed training support into PyTorch and JAX using XLA
  • Optimize models to achieve peak performance on AWS custom silicon
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions with Trainium instances
  • Develop and enable performance tuning of ML model families including LLMs

Requirements For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
Java
  • Bachelor's degree in computer science or equivalent
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language
  • 5+ years of leading design or architecture experience
  • 5+ years of full software development life cycle experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Experience in machine learning, data mining, statistics or natural language processing

Benefits For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
Equity
  • Medical, financial, and other benefits
  • Equity compensation
  • Sign-on payments
  • Mentorship and career growth opportunities
  • Work-life harmony
  • Inclusive team culture

Interested in this job?