Senior HPC and AI Networking Performance Research and Analysis Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
Distributed Systems
Senior Software Engineer
In-Person
5+ years of experience
AI

Description For Senior HPC and AI Networking Performance Research and Analysis Engineer

NVIDIA, the world leader in accelerated computing, is seeking a Senior HPC and AI Networking Performance Research and Analysis Engineer to join their Performance group. This role focuses on profiling and analyzing AI workloads on large-scale GPU and CPU clusters, specifically for distributed Deep Learning LLM training.

The position offers an opportunity to work with cutting-edge hardware and platforms, including HCAs, Switches, CPUs, GPUs, and Systems. You'll be developing and utilizing performance analysis tools to understand performance expectations, limitations, and bottlenecks in high-performance networking environments.

As a senior engineer, you'll be responsible for benchmarking and analyzing performance of AI workloads, with a particular emphasis on networking aspects. The role requires expertise in high-performance computing, deep learning frameworks, and networking protocols such as RDMA and RoCE.

The ideal candidate will have a strong background in Computer Science or Software Engineering, with at least 5 years of experience in high-performance networking. You'll need demonstrated skills in performance analysis, experience with NVIDIA technologies, and proficiency in Python, Bash, and C programming languages.

What makes this role particularly exciting is the opportunity to work on some of the most advanced AI and machine learning systems in the world. You'll be contributing to the development and optimization of large language models and distributed training systems, working with state-of-the-art technology in a collaborative environment.

NVIDIA offers a diverse and inclusive workplace, committed to fostering innovation and professional growth. This role provides an excellent opportunity to work at the forefront of AI and high-performance computing, making a significant impact on the future of technology.

Last updated 19 days ago

Responsibilities For Senior HPC and AI Networking Performance Research and Analysis Engineer

  • Profile and analyze AI workloads on large GPUs and CPUs scale clusters for distributed Deep Learning LLM training
  • Research AI workloads and DL models for large-scale deep learning LLM training
  • Benchmark, profile, and analyze performance to find bottlenecks
  • Implement performance analysis tools
  • Collaborate with hardware and software teams
  • Define performance test planning and set performance expectations

Requirements For Senior HPC and AI Networking Performance Research and Analysis Engineer

Python
Linux
Kubernetes
  • B.Sc in Computer Science or Software Engineering
  • 5+ years of experience with high-performance Networking (RDMA, MPI)
  • Demonstrated Performance Analysis skills and methodologies
  • Experience with NVIDIA GPUs, CUDA library, deep learning frameworks
  • Fast and self-learning capabilities with strong analytical skills
  • Programming Languages: Python, Bash and C languages
  • Experience with Linux OS distros
  • Team player with good communication and interpersonal skills

Interested in this job?

Jobs Related To NVIDIA Senior HPC and AI Networking Performance Research and Analysis Engineer

Senior AI-HPC Storage Engineer

Senior AI-HPC Storage Engineer position at NVIDIA focusing on designing and implementing distributed storage solutions for AI and HPC workloads.

Senior Software Engineer, GPU Communications and Networking

Senior Software Engineer role at NVIDIA focusing on GPU Communications and Networking, developing high-performance computing systems and deep learning frameworks.

Senior Software Engineer - HPC

Senior Software Engineer position at NVIDIA focusing on HPC infrastructure, requiring 10+ years of experience in distributed systems and cloud computing.

Senior HPC Performance Engineer

Senior HPC Performance Engineer role at NVIDIA focusing on optimizing GPU communication libraries for large-scale AI and HPC systems.

Systems Engineer, Enterprise

Senior Systems Engineer position at NVIDIA focusing on enterprise HPC server deployment, requiring 6+ years experience and strong hardware/software expertise.