Annapurna Labs, acquired by AWS in 2015, serves as the infrastructure provider for AWS, focusing on silicon engineering, hardware design and verification, software, and operations. The Machine Learning System Operations and Automation Team is seeking candidates to write automation software for their global ML server fleet. The role involves working with cutting-edge ML products and massive-scale autonomous software development.
The position is part of the MLA Systems Fleet Triage team, responsible for addressing complex hardware and software failures in ML-optimized servers at scale. Team members collaborate with hardware design, firmware, and validation teams to enhance test coverage and production environment detection. The role combines hands-on technical work with strategic problem-solving, offering exposure to AWS's most advanced server systems.
AWS values diverse experiences and maintains an inclusive culture through employee-led affinity groups and ongoing learning experiences. The company offers flexible work arrangements, supporting work-life harmony, and provides extensive opportunities for knowledge-sharing and mentorship. The hybrid work model allows engineers to choose between daily office attendance or flexible arrangements near US Amazon offices.
This role is ideal for candidates who enjoy solving complex technical challenges, are data-driven, and are passionate about working with cutting-edge ML infrastructure at scale. You'll be part of a team that's at the forefront of hardware/software co-design, contributing to products like AWS Nitro, Graviton, and ML Accelerators.