Senior GPU Cluster Software Engineer

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Distributed Systems
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior GPU Cluster Software Engineer

NVIDIA, the world leader in accelerated computing, is seeking a Senior GPU Cluster Software Engineer to join their System Software team. This role focuses on building profiling solutions for large-scale applications running on GPU compute clusters, ensuring optimal performance and enhanced user experience. The position combines cutting-edge work in distributed systems, machine learning, and high-performance computing.

As a senior engineer, you'll be responsible for developing and maintaining profiling tools that analyze real-world ML/DL applications on HPC GPU clusters. The role requires expertise in Python development, distributed systems architecture, and database management. You'll work with state-of-the-art technology stacks including various monitoring and visualization tools like Kibana, Grafana, and modern databases.

The ideal candidate will have 5+ years of software development experience, strong understanding of distributed systems, and familiarity with machine learning concepts. This position offers the opportunity to work on meaningful projects with self-direction while providing support and mentorship for professional growth. The hybrid work environment at NVIDIA's Shanghai office allows for flexibility while maintaining collaborative opportunities.

Working at NVIDIA means being at the forefront of AI and digital twins technology, contributing to solutions that transform major industries. The role offers exposure to cutting-edge GPU technology and the chance to work with various application owners and research teams to improve current and future generation GPU clusters.

Last updated 5 days ago

Responsibilities For Senior GPU Cluster Software Engineer

  • Work in an agile and fast-paced global environment to gather requirements, architect, design, implement, test, deploy, release, and support large scale distributed systems infrastructure
  • Build internal profiling tools for real world ML/DL applications running on HPC GPU clusters for failure and efficiency analysis
  • Understand state of the art improvements in ML/DL domain, and work with various application owners and research teams to add / improve profiling needs

Requirements For Senior GPU Cluster Software Engineer

Python
Redis
Kubernetes
  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development (in Python)
  • Experience with Gitlab (or another source code management) branch/release, CI/CD pipeline
  • Solid understanding of algorithms, data structures, and runtime/space complexity
  • Experience working with distributed system software architecture
  • Basic understanding of HPC GPU cluster, slurm
  • Basic understanding of Machine learning concepts and terminologies
  • Background with databases - SQL and NoSQL (prometheus, elasticsearch, opensearch, redis, etc.)
  • Experience with distributed Data Pipeline, Telemetry, Visualizations (Kibana, Grafana, etc.), Alerting (pagerduty, etc.)

Interested in this job?

Jobs Related To NVIDIA Senior GPU Cluster Software Engineer

Senior Software Engineer-Distributed Inference

Senior Software Engineer position at NVIDIA focusing on distributed inference and AI performance optimization, offering competitive compensation and remote work options.

Senior HPC Performance Engineer

Senior HPC Performance Engineer role at NVIDIA focusing on GPU Communications Libraries and Networking, optimizing performance for deep learning and HPC applications.

Senior Generalist Software Engineer -- Omniverse

Senior Generalist Software Engineer position at NVIDIA focusing on Omniverse, computer graphics, and compute systems development in Taiwan.

Senior AI-HPC Storage Engineer

Senior AI-HPC Storage Engineer role at NVIDIA, focusing on designing and implementing distributed storage solutions for AI and HPC workloads, offering competitive compensation and benefits.

Senior System Software Engineer, NCCL - Partner Enablement

Senior System Software Engineer position at NVIDIA focusing on NCCL partner enablement, combining distributed systems expertise with customer support for AI and HPC applications.