NVIDIA, the world leader in accelerated computing, is seeking a Senior HPC and AI Networking Performance Research and Analysis Engineer to join their Performance group. This role focuses on profiling and analyzing AI workloads on large-scale GPU and CPU clusters, specifically for distributed Deep Learning LLM training.
The position offers an opportunity to work with cutting-edge hardware and platforms, including HCAs, Switches, CPUs, GPUs, and Systems. You'll be developing and utilizing performance analysis tools to understand performance expectations, limitations, and bottlenecks in high-performance networking environments.
As a senior engineer, you'll be responsible for benchmarking and analyzing performance of AI workloads, with a particular emphasis on networking aspects. The role requires expertise in high-performance computing, deep learning frameworks, and networking protocols such as RDMA and RoCE.
The ideal candidate will have a strong background in Computer Science or Software Engineering, with at least 5 years of experience in high-performance networking. You'll need demonstrated skills in performance analysis, experience with NVIDIA technologies, and proficiency in Python, Bash, and C programming languages.
What makes this role particularly exciting is the opportunity to work on some of the most advanced AI and machine learning systems in the world. You'll be contributing to the development and optimization of large language models and distributed training systems, working with state-of-the-art technology in a collaborative environment.
NVIDIA offers a diverse and inclusive workplace, committed to fostering innovation and professional growth. This role provides an excellent opportunity to work at the forefront of AI and high-performance computing, making a significant impact on the future of technology.