Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS (Amazon Web Services) is a leading cloud infrastructure company that provides services to millions of customers worldwide.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS · Cloud

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Neuron is seeking a Software Engineer to join their Machine Learning Applications team, focusing on distributed training solutions. This role is part of Annapurna Labs, acquired by AWS in 2015, which serves as the infrastructure provider for AWS. The position involves working on AWS Neuron, the complete software stack for AWS Inferentia and Trainium cloud-scale machine learning accelerators.

The role requires expertise in developing and optimizing distributed training support for major ML frameworks like PyTorch and TensorFlow. You'll work closely with chip architects and compiler engineers to create efficient solutions for Trn1 systems. The position involves performance tuning of various ML models, including large language models like GPT2/GPT3 and stable diffusion.

AWS offers a collaborative environment with strong emphasis on work-life balance and professional growth. The team values knowledge sharing and mentorship, providing opportunities to work on complex projects that impact millions of users. The company provides comprehensive benefits and promotes an inclusive culture through various employee-led affinity groups.

This is an excellent opportunity for engineers passionate about machine learning infrastructure who want to work at the intersection of hardware and software optimization. You'll be part of a team that's pushing the boundaries of ML acceleration and distributed computing, while enjoying the stability and resources of one of the world's leading tech companies.

Last updated a day ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts building distributed training support into PyTorch and TensorFlow
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions with Trn1
  • Performance tuning of ML model families including GPT2, GPT3, and stable diffusion
  • Ensure highest performance and maximize efficiency on AWS Trainium and Inferentia silicon

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
TypeScript
  • 3+ years of non-internship professional software development experience
  • 3+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Experience with PyTorch/JAX/TensorFlow
  • Knowledge of distributed libraries and frameworks
  • Experience with end-to-end model training

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
Dental Insurance
Vision Insurance
401k
  • Medical, financial, and other benefits
  • Flexible working hours
  • Mentorship and career growth opportunities
  • Employee-led affinity groups
  • Work-life balance focus

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Distributed Training

Software Development Engineer, Alexa Identity - Alexa Connected Devices

Software Development Engineer role at Amazon's Alexa Identity team, focusing on LLM-based AI assistant development with competitive compensation and benefits.

Software Development Engineer, Generation

Software Development Engineer role at Amazon focusing on speech and language AI technology, requiring 3+ years of experience and expertise in Java and AWS services.

ML Software Engineer, Robotics AI

ML Software Engineer position at Amazon Robotics focusing on building high-performance robotic systems with AI and computer vision capabilities.

Software Development Engineer - Machine Learning, Ad Response Prediction

Machine Learning Software Engineer role at Amazon focusing on ad response prediction and sponsored products systems.

Software Development Engineer

Build machine learning systems to monitor and classify billions of products on Amazon's platform, ensuring marketplace safety and compliance.