System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

World's most comprehensive and broadly adopted cloud platform, pioneering cloud computing and continuous innovation.
DevOps
Mid-Level Software Engineer
Hybrid
5,000+ Employees
2+ years of experience
AI · Enterprise SaaS

Description For System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

Annapurna Labs, acquired by AWS in 2015, serves as the infrastructure provider for AWS, focusing on silicon engineering, hardware design and verification, software, and operations. The Machine Learning System Operations and Automation Team is seeking candidates to write automation software for their global ML server fleet. The role involves working with cutting-edge ML products and massive-scale autonomous software development.

The position is part of the MLA Systems Fleet Triage team, responsible for addressing complex hardware and software failures in ML-optimized servers at scale. Team members collaborate with hardware design, firmware, and validation teams to enhance test coverage and production environment detection. The role combines hands-on technical work with strategic problem-solving, offering exposure to AWS's most advanced server systems.

AWS values diverse experiences and maintains an inclusive culture through employee-led affinity groups and ongoing learning experiences. The company offers flexible work arrangements, supporting work-life harmony, and provides extensive opportunities for knowledge-sharing and mentorship. The hybrid work model allows engineers to choose between daily office attendance or flexible arrangements near US Amazon offices.

This role is ideal for candidates who enjoy solving complex technical challenges, are data-driven, and are passionate about working with cutting-edge ML infrastructure at scale. You'll be part of a team that's at the forefront of hardware/software co-design, contributing to products like AWS Nitro, Graviton, and ML Accelerators.

Last updated 13 minutes ago

Responsibilities For System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

  • Monitor, optimize, and remediate hardware in ML servers
  • Root cause hardware failures and identify live trends
  • Implement and improve system level testing
  • Develop maintainable and reusable software
  • Build high-impact solutions for large customer base
  • Participate in design discussions and code review
  • Work cross-functionally to drive business decisions

Requirements For System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

Python
Go
Linux
  • 2+ years of professional software development experience
  • 1+ years of designing or architecting systems experience
  • 3+ years of administrative experience in networking, storage systems, and operating systems
  • Knowledge of systems engineering fundamentals
  • Experience with modern programming languages (C++, C#, Java, Python, Golang, PowerShell, Ruby)

Benefits For System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

Medical Insurance
Dental Insurance
Vision Insurance
  • Flexible work hours
  • Mentorship and career growth opportunities
  • Inclusive team culture
  • Employee-led affinity groups
  • Hybrid work options

Interested in this job?

Jobs Related To Amazon System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

Data Center Operations Support Engineer, DCO

AWS Data Center Operations Support Engineer position focusing on infrastructure management, technical support, and operational excellence for cloud services.

System Development Engineer, FBA Capacity Management and Planning

System Development Engineer role at Amazon focusing on FBA capacity management and planning, requiring 4+ years of experience in systems/software development and infrastructure.

Systems Engineer, AMER Controls Support

AWS Infrastructure Services seeks Systems Engineer for critical infrastructure management, focusing on Linux systems, networking, and automation.

Support Engineer, CMT Promotions Excellence

Support Engineer role at Amazon combining DevOps, Systems, and Software Engineering skills to automate operations and improve service delivery.

System Development Engineer, FBA Capacity Management and Planning

System Development Engineer role at Amazon focusing on FBA capacity management and planning, requiring 4+ years of experience in systems development and infrastructure.