Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins to transform industries and society.
Santa Clara, CA, USA · Westford, MA 01886, USA · Austin, TX, USA...
$148,000 - $276,000
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer for their AI Research Clusters. As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in designing and implementing groundbreaking GPU compute clusters that power all AI research across NVIDIA. You'll be responsible for building and operating these clusters with high reliability, efficiency, and performance, while driving foundational improvements and automation to enhance researcher productivity.

Key responsibilities include:

  • Designing and implementing state-of-the-art GPU compute clusters
  • Optimizing cluster operations for maximum reliability, efficiency, and performance
  • Tackling strategic challenges in large-scale, high-performance computing environments
  • Troubleshooting and diagnosing system failures
  • Building automation for AI-HPC GPU Cluster bring-up and scaled-up operation
  • Implementing remediations across software and hardware stacks

Requirements:

  • Bachelor's degree in Computer Science, Electrical Engineering, or related field
  • Minimum 5 years of experience designing and operating large-scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools and infrastructure-level applications
  • In-depth understanding of container technologies
  • Programming experience in Python and Bash scripting

Preferred qualifications:

  • Familiarity with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking
  • Experience with InfiniBand, IBoIP, and RDMA
  • Understanding of fast, distributed storage systems for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers competitive salaries, comprehensive benefits, and the opportunity to work with some of the most brilliant and talented people in the world. Join our diverse team and help shape the future of AI and accelerated computing!

Last updated 19 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Tackle strategic challenges in large-scale, high-performance computing environments
  • Troubleshoot, diagnose and root cause system failures
  • Build automation for AI-HPC GPU Cluster bring-up and scaled-up operation
  • Implement remediations across software and hardware stacks

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in Computer Science, Electrical Engineering or related field
  • Minimum 5 years of experience designing and operating large scale compute infrastructure
  • Proven experience in site reliability engineering for high-performance computing environments
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (e.g., Slurm)
  • Knowledge of cluster configuration management tools and infrastructure level applications
  • In-depth understanding of container technologies
  • Programming experience in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

DevOps Engineer, Data Management

DevOps Engineer role at Google focusing on Site Reliability Engineering for Google Cloud services, requiring expertise in distributed systems and software development.

Systems Development Engineer II, Client Engineering - Windows Platform

Senior Systems Development Engineer role for Windows client management at Amazon, focusing on security and productivity.

Senior Splunk Engineer

Senior Splunk Engineer role at Salesforce Marketing Cloud, focusing on large-scale Splunk environments and infrastructure automation.

Senior Engineer, IT

Senior Engineer, IT position at Qualcomm focusing on Microsoft Intune and Configuration Manager for enterprise IT solutions.

Sr. Mechanical Design Engineer, Automated Test Equipment, Manufacturing Test Engineering

Senior Mechanical Design Engineer role at Tesla, focusing on automated test equipment for manufacturing test engineering.