We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes.
Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! We are looking for a technical leader to manage our NVSHMEM and UCX libraries. This is an outstanding opportunity to push the limits on the state-of-the-art and deliver platforms the world has never seen before.
What you will be doing:
- Lead, mentor, and grow your library engineering team and be responsible for the planning and execution of projects as well as the quality, and performance of your libraries.
- Participate in feature design and implementation.
- Interact with internal and external partners and researchers to understand their use cases and requirements.
- Collaborate with engineering teams, program and product management, and partners to define the product roadmap.
- Continuously review and identify improvement opportunities in established processes, infrastructure, and practices.
What we need to see:
- 10+ overall years of experience in the software industry with specialization in HPC networking or system software.
- 4+ years of management experience.
- BS, MS, or Ph.D. in CS, CE, EE (related technical field) or equivalent experience.
- Prior systems software or communication runtime or high performance networking software development experience.
- Strong understanding of computer system architecture, operating systems principles, HW-SW interactions and performance analysis/optimizations.
- Excellent C/C++ programming and debugging skills in Linux.
- Experience balancing multiple projects with competing priorities.
- Flexibility to work and communicate effectively across different teams and timezones.
Ways to stand out:
- Experience with parallel programming models (MPI, SHMEM) and communication runtimes.
- Background with RDMA, high-performance networking technologies, and network architecture.
- Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, etc.
NVIDIA offers a diverse, supportive environment where everyone is inspired to do their best work. Join the team and make a lasting impact on the world.