AWS Utility Computing (UC) is seeking a DevOps Engineer for the Machine Learning (ML) Infrastructure team to build tools that guarantee top performance of AWS ML and High Performance Computing (HPC) technologies. The role involves working with CI/CD automation, ML and HPC benchmarks, and applications for cutting-edge software development.
Key responsibilities include:
- Leading a team that builds and maintains infrastructure for monitoring and reporting on large-scale testing workloads.
- Using internal Amazon CI/CD tools, Linux, and AWS products to automate software delivery.
- Writing Python code to manage large clusters and run ML and HPC workload benchmarks.
- Creating dashboards using AWS Managed Grafana, Quicksight, OpenSearch, and Athena to analyze performance data.
- Developing automatic mechanisms to alert developers about functional and performance regressions.
- Managing complex infrastructure covering various instance types, software stacks, and Linux operating systems.
- Ensuring all infrastructure setup is code (IaC), reviewed, and committed to automated pipelines.
- Scheduling work using Jenkins to support the development team while optimizing cluster costs.
- Reviewing dashboard and automation results, triaging failures, and introducing new tests and platforms.
- Creating reports on the CI/CD system status for stakeholders.
The role is part of Annapurna Labs, an AWS subsidiary that builds software and hardware for ML and HPC on EC2. The team is focused on making AWS the best and most cost-effective platform for running AI and HPC workloads at scale.
AWS values diverse experiences, work-life harmony, and fosters an inclusive team culture. The company offers mentorship and career growth opportunities, as well as employee-led affinity groups and ongoing learning experiences.