Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training, Core Technologies and Infra (CoreTex)

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, pioneering cloud computing and continuously innovating.
$198,100 - $342,300
Machine Learning
Principal Software Engineer
Hybrid
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Description For Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training, Core Technologies and Infra (CoreTex)

AWS Neuron is the complete software stack for the AWS Inferentia and Trainium (Neuron) cloud-scale machine learning accelerators. As a Sr. SDM of Software Development for the Machine Learning Distributed Training, Core Technologies and Infra org, you will be responsible for leading strong teams of software engineers and managers to help design and deploy software that enables ML workloads to work seamlessly on these new products.

Key responsibilities:

  • Manage the full development lifecycle of integrations and extensions for training support in PyTorch, XLA, JAX, and distributed training libraries like FSDP.
  • Lead characterization, enablement, and development of existing and future massive-scale ML models like Claude 3, GPT4, ViT, Llava, Stable Diffusion3, and more.
  • Ensure support for key ML functionality in a combined chip/software platform.
  • Work with executive leadership and other senior management to define product directions and deliver them to customers.
  • Build massive-scale distributed training and inference solutions.

The role requires:

  • 10+ years of engineering experience
  • 5+ years of engineering team management experience
  • 10+ years of planning, designing, developing, and delivering consumer software experience
  • Experience partnering with product and program management teams
  • Experience managing multiple concurrent programs, projects, and development teams in an Agile environment

Preferred qualifications:

  • Experience designing and developing large scale, high-traffic applications
  • 5+ years of industry experience in Machine/Deep Learning software/framework and/or infrastructure

Amazon offers a comprehensive benefits package and values work-life harmony. The company is committed to diversity and inclusion, providing ongoing events, learning experiences, and employee-led affinity groups to foster an inclusive team culture.

Last updated 3 months ago

Responsibilities For Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training, Core Technologies and Infra (CoreTex)

  • Lead strong teams of software engineers and managers
  • Design and deploy software for ML workloads on AWS Neuron accelerators
  • Manage full development lifecycle of integrations for PyTorch, XLA, JAX, and distributed training libraries
  • Lead characterization and enablement of massive-scale ML models
  • Ensure support for key ML functionality in combined chip/software platforms
  • Work with executive leadership to define product directions
  • Build massive-scale distributed training and inference solutions

Requirements For Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training, Core Technologies and Infra (CoreTex)

Python
Java
  • 10+ years of engineering experience
  • 5+ years of engineering team management experience
  • 10+ years of planning, designing, developing and delivering consumer software experience
  • Experience partnering with product and program management teams
  • Experience managing multiple concurrent programs, projects and development teams in an Agile environment

Interested in this job?

Jobs Related To Amazon Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training, Core Technologies and Infra (CoreTex)

Principal Applied Scientist - CV/ML, Amazon Robotics

Lead computer vision and machine learning initiatives for Amazon Robotics, developing cutting-edge perception systems for robotic automation at scale.

Sr. ML Architect

Sr. ML Architect position at Amazon Devices, developing next-generation SOCs for machine learning-enabled consumer products.

Principal Applied Scientist, Neuron ARG

Principal Applied Scientist role at AWS Neuron Compiler team, developing state-of-the-art deep learning compiler stack and ML accelerators.

Applied Scientist, Neuron ARG

Applied Scientist role at AWS Neuron Compiler team, developing deep learning compiler stack and working with ML accelerators.

AI Solutions Principal Engineer

Lead AI solutions development at Oowlish using Python, AWS & Azure, working remotely with US and European clients on cutting-edge projects.