AWS AI is revolutionizing deep learning in the cloud through Amazon SageMaker, building customer-facing services for data scientists and software engineers. As customers increasingly adopt LLMs and Generative AI, we're developing a next-generation AI platform optimized for LLMs and distributed training.
The role focuses on the SageMaker HyperPod team, where you'll design, develop, and deploy distributed machine learning systems for worldwide customers. You'll work closely with ML scientists and customers to shape strategy and define roadmaps, while translating requirements into technical specifications for scalable solutions.
Key responsibilities include:
- Developing innovative solutions for Large Language Model training across node clusters
- Building and maintaining performant, resilient services for training large-scale foundation models
- Optimizing distributed training through performance profiling and bottleneck resolution
- Leading complex projects and serving as a technical resource throughout development
- Mentoring junior engineers and driving best practices
The ideal candidate brings:
- Strong background in large-scale software systems
- Experience with multi-threaded asynchronous C++/Go development
- Knowledge of Kubernetes, high-performance computing, and large language model training
- Passion for building platforms handling 100+ billion parameter GPT models across 1000s of GPU devices
Benefits include:
- Flexible hybrid work options
- Comprehensive mentorship and career growth opportunities
- Inclusive team culture with employee-led affinity groups
- Work-life harmony focus
- Competitive compensation package including equity and benefits
Join AWS to have a significant impact on cloud computing and serve customers worldwide while working with cutting-edge AI technology.