Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins to transform industries and society.

Santa Clara, CA, USA · Westford, MA 01886, USA · Austin, TX, USA...

$148,000 - $276,000

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer for their AI Research Clusters. As a member of the GPU AI/HPC Infrastructure team, you will provide leadership in designing and implementing groundbreaking GPU compute clusters that power all AI research across NVIDIA. You'll be responsible for building and operating these clusters with high reliability, efficiency, and performance, while driving foundational improvements and automation to enhance researcher productivity.

Key responsibilities include:

Designing and implementing state-of-the-art GPU compute clusters
Optimizing cluster operations for maximum reliability, efficiency, and performance
Tackling strategic challenges in large-scale, high-performance computing environments
Troubleshooting and diagnosing system failures
Building automation for AI-HPC GPU Cluster bring-up and scaled-up operation
Implementing remediations across software and hardware stacks

Requirements:

Bachelor's degree in Computer Science, Electrical Engineering, or related field
Minimum 5 years of experience designing and operating large-scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers (e.g., Slurm)
Knowledge of cluster configuration management tools and infrastructure-level applications
In-depth understanding of container technologies
Programming experience in Python and Bash scripting

Preferred qualifications:

Familiarity with NVIDIA GPUs, CUDA Programming, NCCL, and MLPerf benchmarking
Experience with InfiniBand, IBoIP, and RDMA
Understanding of fast, distributed storage systems for AI/HPC workloads
Familiarity with deep learning frameworks like PyTorch and TensorFlow

NVIDIA offers competitive salaries, comprehensive benefits, and the opportunity to work with some of the most brilliant and talented people in the world. Join our diverse team and help shape the future of AI and accelerated computing!

Last updated 19 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for maximum reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Tackle strategic challenges in large-scale, high-performance computing environments
Troubleshoot, diagnose and root cause system failures
Build automation for AI-HPC GPU Cluster bring-up and scaled-up operation
Implement remediations across software and hardware stacks

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python

Kubernetes

Linux

Bachelor's degree in Computer Science, Electrical Engineering or related field
Minimum 5 years of experience designing and operating large scale compute infrastructure
Proven experience in site reliability engineering for high-performance computing environments
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers (e.g., Slurm)
Knowledge of cluster configuration management tools and infrastructure level applications
In-depth understanding of container technologies
Programming experience in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Equity

Equity
Comprehensive benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins to transform industries and society.

Santa Clara, CA, USA · Westford, MA 01886, USA · Austin, TX, USA...

$148,000 - $276,000

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

DevOps Engineer, Data Management

Google

DevOps Engineer role at Google focusing on Site Reliability Engineering for Google Cloud services, requiring expertise in distributed systems and software development.

Systems Development Engineer II, Client Engineering - Windows Platform

Amazon.com Services LLC

Senior Systems Development Engineer role for Windows client management at Amazon, focusing on security and productivity.

Senior Splunk Engineer

Salesforce

Senior Splunk Engineer role at Salesforce Marketing Cloud, focusing on large-scale Splunk environments and infrastructure automation.

Senior Engineer, IT

Qualcomm Incorporated

Senior Engineer, IT position at Qualcomm focusing on Microsoft Intune and Configuration Manager for enterprise IT solutions.

Sr. Mechanical Design Engineer, Automated Test Equipment, Manufacturing Test Engineering

Tesla

Senior Mechanical Design Engineer role at Tesla, focusing on automated test equipment for manufacturing test engineering.