Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, pioneering cloud computing and continuously innovating.
$129,300 - $223,600
Machine Learning
Senior Software Engineer
Contact Company
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

AWS Utility Computing (UC) provides product innovations that set AWS's services apart in the industry. As a member of the UC organization, you'll support the development and management of Compute, Database, Storage, Platform, and Productivity Apps services in AWS, including specialized security solutions. This role may involve exposure to Amazon's growing suite of generative AI services and cutting-edge cloud computing offerings.

Annapurna Labs, within AWS UC, designs silicon and software that accelerates innovation. AWS Neuron is the complete software stack for AWS Inferentia and Trainium, our cloud-scale Machine Learning accelerators. This role is for a machine learning engineer in the Distribute Training team for AWS Neuron, responsible for development, enablement, and performance tuning of various ML model families, including massive-scale Large Language Models (LLM), Stable Diffusion, and Vision Transformers (ViT).

Key responsibilities include:

  • Leading efforts to build distributed training support into PyTorch and TensorFlow using XLA, Neuron compiler, and runtime stacks.
  • Tuning models for highest performance and efficiency on AWS Trainium and Inferentia silicon and Trn1, Inf1/2 servers.
  • Working with chip architects, compiler engineers, and runtime engineers to create, build, and tune distributed training solutions.

The team values knowledge-sharing, mentorship, and career growth. They offer a supportive environment for new members, with opportunities for one-on-one mentoring and thorough code reviews.

AWS values diverse experiences and encourages candidates to apply even if they don't meet all qualifications. The company fosters an inclusive culture through employee-led affinity groups and ongoing learning experiences.

Work-life harmony is prioritized, with flexibility as part of the working culture. AWS strives to become Earth's Best Employer by providing resources for knowledge-sharing, mentorship, and career advancement.

Last updated 3 months ago

Responsibilities For Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

  • Lead efforts building distributed training support into PyTorch, TensorFlow using XLA and the Neuron compiler and runtime stacks
  • Tune models to ensure highest performance and maximize efficiency on AWS Trainium and Inferentia silicon and Trn1, Inf1/2 servers
  • Work with chip architects, compiler engineers, and runtime engineers to create, build, and tune distributed training solutions
  • Develop and enable a wide variety of ML model families, including massive-scale Large Language Models (LLM), Stable Diffusion, and Vision Transformers (ViT)

Requirements For Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

Python
  • Bachelor's degree in computer science or equivalent
  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Experience in machine learning, data mining, information retrieval, statistics or natural language processing

Interested in this job?

Jobs Related To Amazon Software Engineer - AI/ML, AWS Neuron Distributed Training - Multimodal

Software Development Engineer, Prime Video Sports

Senior Software Engineer role at Amazon Prime Video Sports, focusing on ML/CV technology to enhance sports streaming experiences.

Machine Learning Engineer III, FAR (Frontier AI & Robotics)

Senior ML Engineer role at Amazon Robotics, optimizing large-scale foundation models and working with world-class AI researchers to advance robotics technology.

ASIC Design Engineer, Cloud-Scale Machine Learning Acceleration team

Senior ASIC Design Engineer position for AWS's Machine Learning Acceleration team, focusing on custom SoC design and optimization.

Applied Scientist, AWS SAAR

Senior Applied Scientist role at AWS focusing on machine learning and security analytics, developing innovative solutions for cloud security services.

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Senior ML Engineer role at AWS focusing on distributed training systems and ML accelerators, offering competitive pay and growth opportunities.