Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS is a leading cloud infrastructure company, with Annapurna Labs serving as AWS's infrastructure provider.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Neuron is seeking a Software Engineer II to join their Machine Learning Applications team, focusing on distributed training solutions. This role is part of Annapurna Labs, acquired by AWS in 2015, which serves as the infrastructure provider for AWS. The position involves working with cutting-edge ML technologies, including AWS Inferentia and Trainium cloud-scale machine learning accelerators.

The role requires expertise in distributed training libraries like FSDP and Deepspeed, and involves close collaboration with chip architects, compiler engineers, and runtime engineers. You'll be responsible for developing and optimizing support for various ML model families, including large language models like GPT2/GPT3, stable diffusion, and Vision Transformers.

AWS offers a strong emphasis on work-life balance, mentorship, and career growth. The company maintains an inclusive culture with ten employee-led affinity groups and innovative benefit offerings. The team values knowledge sharing and supports new members through a broad mix of experience levels and tenures.

This position offers competitive compensation ranging from $129,300 to $223,600 based on geographic location, plus equity and comprehensive benefits. The role presents significant opportunities for working with large-scale systems and contributing to AWS's continued innovation in cloud infrastructure and machine learning acceleration.

Last updated a day ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts building distributed training support into PyTorch, TensorFlow using XLA
  • Work with Neuron compiler and runtime stacks
  • Tune models for highest performance on AWS Trainium and Inferentia silicon
  • Develop and enable ML model families including GPT2, GPT3, stable diffusion, and Vision Transformers
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions with Trn1

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 3+ years of non-internship professional software development experience
  • 3+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Experience with PyTorch/JAX/TensorFlow
  • Knowledge of distributed libraries and frameworks
  • End-to-end Model Training experience

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
  • Medical, financial, and other benefits
  • Flexible working hours
  • Mentorship and career growth opportunities
  • Employee-led affinity groups
  • Work-life balance focus

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Distributed Training

Software Development Engineer, StoreGen

AI-focused Software Development Engineer role at Amazon, building next-generation development tools and practices using artificial intelligence.

Software Dev Engineer, AGI Info - Web & Knowledge Services

Software Development Engineer role at Amazon focusing on AGI development, combining ML, distributed systems, and high-performance computing.

Software Development Engineer II

Software Development Engineer II position at Amazon's AI Technology team, focusing on machine learning and AI innovation for consumer electronics and shopping experiences.

Software Development Engineer II

Software Development Engineer II position at Amazon focusing on AI/ML systems development and implementation within the Consumer Electronics Technology organization.

Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

ML Engineer role at AWS developing distributed training solutions for cloud-scale machine learning accelerators, focusing on LLMs and multi-modal models.