Senior HPC and AI Networking Performance Research and Analysis Engineer

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

Yokne'am Illit, Israel

Senior Software Engineer

In-Person

5+ years of experience

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior HPC and AI Networking Performance Research and Analysis Engineer

NVIDIA, the world leader in accelerated computing, is seeking a Senior HPC and AI Networking Performance Research and Analysis Engineer to join their Performance group. This role focuses on profiling and analyzing AI workloads on large-scale GPU and CPU clusters, specifically for distributed Deep Learning LLM training.

The position offers an opportunity to work with cutting-edge hardware and platforms, including HCAs, Switches, CPUs, GPUs, and Systems. You'll be developing and utilizing performance analysis tools to understand performance expectations, limitations, and bottlenecks in high-performance networking environments.

As a senior engineer, you'll be responsible for benchmarking and analyzing performance of AI workloads, with a particular emphasis on networking aspects. The role requires expertise in high-performance computing, deep learning frameworks, and networking protocols such as RDMA and RoCE.

The ideal candidate will have a strong background in Computer Science or Software Engineering, with at least 5 years of experience in high-performance networking. You'll need demonstrated skills in performance analysis, experience with NVIDIA technologies, and proficiency in Python, Bash, and C programming languages.

What makes this role particularly exciting is the opportunity to work on some of the most advanced AI and machine learning systems in the world. You'll be contributing to the development and optimization of large language models and distributed training systems, working with state-of-the-art technology in a collaborative environment.

NVIDIA offers a diverse and inclusive workplace, committed to fostering innovation and professional growth. This role provides an excellent opportunity to work at the forefront of AI and high-performance computing, making a significant impact on the future of technology.

Last updated 6 months ago

Responsibilities For Senior HPC and AI Networking Performance Research and Analysis Engineer

Profile and analyze AI workloads on large GPUs and CPUs scale clusters for distributed Deep Learning LLM training
Research AI workloads and DL models for large-scale deep learning LLM training
Benchmark, profile, and analyze performance to find bottlenecks
Implement performance analysis tools
Collaborate with hardware and software teams
Define performance test planning and set performance expectations

Requirements For Senior HPC and AI Networking Performance Research and Analysis Engineer

Python

Linux

Kubernetes

B.Sc in Computer Science or Software Engineering
5+ years of experience with high-performance Networking (RDMA, MPI)
Demonstrated Performance Analysis skills and methodologies
Experience with NVIDIA GPUs, CUDA library, deep learning frameworks
Fast and self-learning capabilities with strong analytical skills
Programming Languages: Python, Bash and C languages
Experience with Linux OS distros
Team player with good communication and interpersonal skills