DL Communications Collectives SW Engineer

Rivos is working on software to improve the Deep Learning ecosystem and help hardware engineers build great Deep Learning parallel systems.
Santa Clara, CA, USAAustin, TX, USAPortland, OR, USA
Backend
Senior Software Engineer
In-Person
5+ years of experience
AI

Description For DL Communications Collectives SW Engineer

Rivos is seeking a DL Communications Collectives SW Engineer to join their team working on improving the Deep Learning ecosystem. This role involves designing and implementing highly optimized communication collectives libraries similar to UCC and NCCL. The ideal candidate will work closely with hardware and software teams to ensure efficient data communication and synchronization across multiple AI accelerators in a distributed system.

Key responsibilities include building communication components of an AI Software Stack, porting AI Software to new hardware platforms, and optimizing communication within AI applications. The engineer will design and implement various communication collectives, optimize algorithms for multi-node clusters, and ensure low-latency, high-bandwidth communication across multi-GPU setups.

The ideal candidate should have a strong background in GPU architectures, parallel and distributed algorithms, and experience with network interconnects. Proficiency in communication collectives libraries, deep learning frameworks, and low-level performance optimization on GPU architectures is crucial. The role requires excellent problem-solving skills, strong communication abilities, and the capacity to work effectively in a fast-paced, collaborative environment.

Rivos offers the opportunity to work with industry veterans, learning technical and organizational skills while contributing to open-source projects. This position is perfect for someone passionate about advancing AI technology and eager to tackle complex challenges in distributed computing and machine learning.

Last updated 2 months ago

Responsibilities For DL Communications Collectives SW Engineer

  • Build-up communication components of an AI Software Stack
  • Port AI Software to run on a new H/W platform
  • Profiling and tuning of communications within AI applications
  • Design, develop, and optimize communication collectives for large-scale distributed computing and machine learning frameworks
  • Implement and optimize communication algorithms tailored for our architectures and multi-node clusters
  • Ensure low-latency, high-bandwidth communication across multi-GPU setups
  • Collaborate with hardware engineers and other software teams to optimize performance
  • Implement fault tolerance and scalability mechanisms in distributed systems
  • Write unit tests and benchmark tools to validate performance and correctness of collective operations
  • Stay current with advancements in hardware and networking technologies

Requirements For DL Communications Collectives SW Engineer

Python
  • Strong understanding of GPU architectures and experience in GPU programming
  • Proficiency in designing and implementing parallel and distributed algorithms
  • Experience with network interconnects and understanding of their performance implications
  • Hands-on experience with communication collectives libraries like UCC, NCCL, or MPI
  • Strong knowledge of concurrency, synchronization, and memory consistency models
  • Experience with profiling and optimizing low-level performance on GPU architectures
  • Familiarity with deep learning frameworks and their use of communication collectives
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Network driver experience recommended
  • Excellent skills in problem solving, written and verbal communication
  • Strong organization skills, and highly self-motivated
  • Ability to work well in a team and be productive under aggressive schedules
  • Bachelor's, Master's, or PhD in Computer Engineering, Software Engineering or Computer Science

Interested in this job?

Jobs Related To Rivos DL Communications Collectives SW Engineer

Logic Equivalence Check (LEC) Engineer

Join Rivos as a Logic Equivalence Check (LEC) Engineer to develop and improve formal verification flows for cutting-edge silicon designs.

Software Development Engineer, Air Science and Technology

Senior Software Engineer role at Amazon focusing on developing innovative solutions for shipping network optimization and package delivery efficiency.

System Dev Engineer (SAP/ABAP), AWS SAP Engineering, EC2 Commercial Software Services

Senior System Development Engineer role at AWS focusing on SAP workload solutions, requiring expertise in cloud computing, distributed systems, and ABAP development.

Software Dev Engineer, AWS Commerce Platform International Expansion India team

Senior SDE role at AWS leading international expansion initiatives, designing scalable cloud solutions, and mentoring teams.

Software Development Engineer, Data Center Operating Systems

Senior Software Engineer role at AWS building and maintaining systems for data center operations and infrastructure management.