DL Communications Collectives SW Engineer

Rivos is working on software to improve the Deep Learning ecosystem and help hardware engineers build great Deep Learning parallel systems.
Santa Clara, CA, USA · Austin, TX, USA · Portland, OR, USA...
Backend
Senior Software Engineer
In-Person
5+ years of experience
AI

Description For DL Communications Collectives SW Engineer

Rivos is seeking a DL Communications Collectives SW Engineer to join their team working on improving the Deep Learning ecosystem. This role involves designing and implementing highly optimized communication collectives libraries similar to UCC and NCCL. The ideal candidate will work closely with hardware and software teams to ensure efficient data communication and synchronization across multiple AI accelerators in a distributed system.

Key responsibilities include building communication components of an AI Software Stack, porting AI Software to new hardware platforms, and optimizing communication within AI applications. The engineer will design and implement various communication collectives, optimize algorithms for multi-node clusters, and ensure low-latency, high-bandwidth communication across multi-GPU setups.

The ideal candidate should have a strong background in GPU architectures, parallel and distributed algorithms, and experience with network interconnects. Proficiency in communication collectives libraries, deep learning frameworks, and low-level performance optimization on GPU architectures is crucial. The role requires excellent problem-solving skills, strong communication abilities, and the capacity to work effectively in a fast-paced, collaborative environment.

Rivos offers the opportunity to work with industry veterans, learning technical and organizational skills while contributing to open-source projects. This position is perfect for someone passionate about advancing AI technology and eager to tackle complex challenges in distributed computing and machine learning.

Last updated 2 minutes ago

Responsibilities For DL Communications Collectives SW Engineer

  • Build-up communication components of an AI Software Stack
  • Port AI Software to run on a new H/W platform
  • Profiling and tuning of communications within AI applications
  • Design, develop, and optimize communication collectives for large-scale distributed computing and machine learning frameworks
  • Implement and optimize communication algorithms tailored for our architectures and multi-node clusters
  • Ensure low-latency, high-bandwidth communication across multi-GPU setups
  • Collaborate with hardware engineers and other software teams to optimize performance
  • Implement fault tolerance and scalability mechanisms in distributed systems
  • Write unit tests and benchmark tools to validate performance and correctness of collective operations
  • Stay current with advancements in hardware and networking technologies

Requirements For DL Communications Collectives SW Engineer

Python
  • Strong understanding of GPU architectures and experience in GPU programming
  • Proficiency in designing and implementing parallel and distributed algorithms
  • Experience with network interconnects and understanding of their performance implications
  • Hands-on experience with communication collectives libraries like UCC, NCCL, or MPI
  • Strong knowledge of concurrency, synchronization, and memory consistency models
  • Experience with profiling and optimizing low-level performance on GPU architectures
  • Familiarity with deep learning frameworks and their use of communication collectives
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Network driver experience recommended
  • Excellent skills in problem solving, written and verbal communication
  • Strong organization skills, and highly self-motivated
  • Ability to work well in a team and be productive under aggressive schedules
  • Bachelor's, Master's, or PhD in Computer Engineering, Software Engineering or Computer Science

Interested in this job?

Jobs Related To Rivos DL Communications Collectives SW Engineer

Senior Software Engineer (SDE) in Test

Senior Software Engineer in Test at Inbox Health, leading QA strategy and implementation for healthcare billing software.

Software Development Engineer, Video Ads, Amazon

Senior Software Development Engineer role for Amazon's Video Ads team, building customer-centric advertising services for streaming publishers.

Software Development Engineer III, Secure Connection Services

AWS seeks experienced Software Development Engineer for Region Services team to build scalable solutions for Amazon Dedicated Cloud, focusing on innovation and customer needs.

Senior Software Engineer (Automation) - Gliffy

Senior Software Engineer (Automation) role for Gliffy at Perforce, focusing on functional testing and potential growth into test automation.

Senior Java Developer

Senior Java Developer position at CI&T: Remote work, microservices, cloud tech, and continuous improvement.