Software Development and DevOps Engineer, EFA

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, pioneering cloud computing and continuously innovating.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
2+ years of experience
AI · Enterprise SaaS · Cloud
This job posting may no longer be active. You may be interested in these related jobs instead:
Sr. Systems Development Engineer – SRE, Kuiper

Senior SRE position at Amazon's Project Kuiper, working on satellite communications infrastructure and ASIC development.

System Development Engineer II, AWS, Network Alerts

Senior System Development Engineer role at AWS Network Alerts team, building and maintaining monitoring systems for one of the world's largest networks.

Senior Software Development Engineer III, Everglades

Senior DevOps role at AWS Everglades team building and supporting internal tools, requiring TS/SCI clearance and strong development/operations experience.

Systems Development Engineer II, Amazon Autos

Senior Systems Development Engineer role at Amazon Autos, focusing on AWS infrastructure, security, and automotive e-commerce solutions.

Systems Development Engineer, Tech Deploy- Systems Integration

Senior Systems Development Engineer role at Amazon focusing on automated fulfillment systems integration, requiring 5+ years of control systems experience.

Description For Software Development and DevOps Engineer, EFA

AWS Utility Computing (UC) is seeking a DevOps Engineer for the Machine Learning (ML) Infrastructure team to build tools that guarantee top performance of AWS ML and High Performance Computing (HPC) technologies. The role involves working with CI/CD automation, ML and HPC benchmarks, and applications for cutting-edge software development.

Key responsibilities include:

  • Leading a team that builds and maintains infrastructure for monitoring and reporting on large-scale testing workloads.
  • Using internal Amazon CI/CD tools, Linux, and AWS products to automate software delivery.
  • Writing Python code to manage large clusters and run ML and HPC workload benchmarks.
  • Creating dashboards using AWS Managed Grafana, Quicksight, OpenSearch, and Athena to analyze performance data.
  • Developing automatic mechanisms to alert developers about functional and performance regressions.
  • Managing complex infrastructure covering various instance types, software stacks, and Linux operating systems.
  • Ensuring all infrastructure setup is code (IaC), reviewed, and committed to automated pipelines.
  • Scheduling work using Jenkins to support the development team while optimizing cluster costs.
  • Reviewing dashboard and automation results, triaging failures, and introducing new tests and platforms.
  • Creating reports on the CI/CD system status for stakeholders.

The role is part of Annapurna Labs, an AWS subsidiary that builds software and hardware for ML and HPC on EC2. The team is focused on making AWS the best and most cost-effective platform for running AI and HPC workloads at scale.

AWS values diverse experiences, work-life harmony, and fosters an inclusive team culture. The company offers mentorship and career growth opportunities, as well as employee-led affinity groups and ongoing learning experiences.

Last updated 2 months ago

Responsibilities For Software Development and DevOps Engineer, EFA

  • Lead a team building and maintaining infrastructure for monitoring large-scale testing workloads
  • Automate software delivery using Amazon CI/CD tools, Linux, and AWS products
  • Develop Python code for managing large clusters and running ML/HPC benchmarks
  • Create dashboards using AWS tools to analyze performance data
  • Implement automatic alerting mechanisms for functional and performance regressions
  • Manage complex infrastructure across various instance types and software stacks
  • Ensure infrastructure setup is code (IaC) and follows proper review processes
  • Optimize work scheduling using Jenkins to support development while managing costs
  • Review and triage automation results, introducing new tests and platforms
  • Prepare CI/CD system status reports for stakeholders

Requirements For Software Development and DevOps Engineer, EFA

Python
Linux
Kubernetes
  • 2+ years of non-internship professional software development testing experience
  • Experience programming with at least one modern language such as Java, C++, or C# including object-oriented design
  • Experience in penetration testing and exploitability-focused vulnerability assessment
  • Experience in platform-level security mitigations and hardening for Linux and Windows
  • Knowledge of CI/CD automation
  • Familiarity with ML and HPC benchmarks and applications
  • Proficiency in Python programming
  • Experience with AWS services and tools

Benefits For Software Development and DevOps Engineer, EFA

  • Flexible work hours
  • Employee-led affinity groups
  • Ongoing learning experiences
  • Mentorship opportunities
  • Career advancement resources

Interested in this job?