AWS Neuron is the complete software stack for the AWS Inferentia and Trainium (Neuron) cloud-scale machine learning accelerators. As a Sr. SDM of Software Development for the Machine Learning Distributed Training, Core Technologies and Infra org, you will be responsible for leading strong teams of software engineers and managers to help design and deploy software that enables ML workloads to work seamlessly on these new products.
Key responsibilities:
- Manage the full development lifecycle of integrations and extensions for training support in PyTorch, XLA, JAX, and distributed training libraries like FSDP.
- Lead characterization, enablement, and development of existing and future massive-scale ML models like Claude 3, GPT4, ViT, Llava, Stable Diffusion3, and more.
- Ensure support for key ML functionality in a combined chip/software platform.
- Work with executive leadership and other senior management to define product directions and deliver them to customers.
- Build massive-scale distributed training and inference solutions.
The role requires:
- 10+ years of engineering experience
- 5+ years of engineering team management experience
- 10+ years of planning, designing, developing, and delivering consumer software experience
- Experience partnering with product and program management teams
- Experience managing multiple concurrent programs, projects, and development teams in an Agile environment
Preferred qualifications:
- Experience designing and developing large scale, high-traffic applications
- 5+ years of industry experience in Machine/Deep Learning software/framework and/or infrastructure
Amazon offers a comprehensive benefits package and values work-life harmony. The company is committed to diversity and inclusion, providing ongoing events, learning experiences, and employee-led affinity groups to foster an inclusive team culture.