Sr. Software Development Engineer, HPC/ML Networking Engineer

Amazon Web Services (AWS) is the world's most comprehensive and broadly adopted cloud platform, pioneering cloud computing innovation.
$151,300 - $261,500
Distributed Systems
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Sr. Software Development Engineer, HPC/ML Networking Engineer

Join Annapurna Labs, a crucial part of AWS, where we're revolutionizing cloud computing through custom chips and software development. We're seeking a Senior Software Engineer to join our Elastic Collectives team, focusing on optimizing networking solutions for ML and HPC workloads. You'll work with cutting-edge AI/ML technologies, designing systems that scale network-intensive workloads across thousands of processors.

Our team specializes in building the collective operations layer for distributed machine learning, working with both Trainium and Nvidia stacks. You'll collaborate with principal and senior principal engineers daily, hunting for performance bottlenecks and optimizing customer ML/AI workloads. Your work will directly impact every AWS customer using large model training and inference.

We offer a supportive environment that celebrates knowledge-sharing and mentorship, with team members of various experience levels. You'll benefit from one-on-one mentoring, thorough code reviews, and projects that enhance your engineering expertise. We value work-life harmony and provide flexible working hours to ensure long-term success both personally and professionally.

At AWS, we embrace diversity and foster an inclusive culture through employee-led affinity groups and ongoing learning experiences. Our comprehensive benefits package includes medical, financial, and other support systems to help you thrive. If you're passionate about solving complex infrastructure problems, working with HPC and ML customers, and delivering meaningful solutions at scale, this role offers an exciting opportunity to shape the future of cloud computing.

The position requires strong expertise in low-latency networking, collective operations, and kernel-level programming. You'll need excellent problem-solving abilities and strong communication skills to work effectively in our collaborative environment. Join us in pushing the boundaries of what's possible in cloud computing while growing your career with one of technology's most innovative companies.

Last updated a day ago

Responsibilities For Sr. Software Development Engineer, HPC/ML Networking Engineer

  • Design systems that enable scaling network-intensive workloads over thousands of CPUs, GPUs, and TPUs
  • Optimize networking for AI workloads such as LLMs
  • Design and optimize networking solutions for Machine Learning and HPC workloads
  • Collaborate with cross-functional teams
  • Engage with customers to gather feedback
  • Troubleshoot complex networking issues
  • Build collective operations layer in the Trainium and Nvidia stack for distributed machine learning

Requirements For Sr. Software Development Engineer, HPC/ML Networking Engineer

Linux
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming experience with at least one programming language
  • 5+ years of leading design or architecture experience
  • 5+ years of full software development life cycle experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Deep understanding of Linux and kernel-level programming
  • Proficiency in C/C++
  • Experience in low-latency networking and collective operations

Benefits For Sr. Software Development Engineer, HPC/ML Networking Engineer

Medical Insurance
401k
Mental Health Assistance
  • Medical, financial, and other benefits
  • Flexible working hours
  • Career growth and mentorship opportunities
  • Learning experiences through CORE and AmazeCon conferences
  • Work-life balance support

Interested in this job?

Jobs Related To Amazon Sr. Software Development Engineer, HPC/ML Networking Engineer

Software Dev Eng III, EC2 Networking

Senior Software Engineer role at Amazon AWS, developing network virtualization systems for EC2 VPC, offering competitive salary and growth opportunities.

Sr. Software Dev Engineer, CloudFront Media & Entertainment

Senior Software Engineer role at AWS CloudFront, building distributed systems for video delivery and content distribution at global scale.

Software Development Engineer, Amazon S3 Tables

Senior Software Engineer role at AWS S3 building large-scale distributed storage systems with focus on durability and availability of key-value metadata.

Senior Software Development Engineer

Senior Software Engineer role at Amazon working on distributed tax calculation systems, requiring 5+ years of experience in software development and system architecture.

Senior Software Development Engineer

Senior Software Development Engineer role at Amazon's Tax Platform Services, building global-scale distributed tax calculation systems.