Senior Site Reliability Engineering - Infrastructure

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$150,000 - $250,000
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
5+ years of experience
Enterprise SaaS · AI

Description For Senior Site Reliability Engineering - Infrastructure

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their Infrastructure team. This role is part of NVIDIA's SRE discipline, which combines software and systems engineering practices to design, build, and maintain large-scale production systems. The position focuses on ensuring maximum reliability and uptime for NVIDIA's internal and external GPU cloud services.

As an SRE at NVIDIA, you'll work with cutting-edge technologies including Kubernetes and OpenStack, focusing on eliminating manual work through automation and performance tuning. The role demands expertise across systems, networking, coding, database management, and continuous delivery. You'll be responsible for maintaining high-efficiency production systems while enabling developers to implement changes safely.

The ideal candidate will have strong experience in infrastructure automation and distributed systems design, with expertise in languages like Python or Go. You'll need deep knowledge of Linux, networking, and containers, along with the ability to design and implement monitoring, logging, and alerting systems at scale.

NVIDIA offers a unique environment that values diversity, intellectual curiosity, and problem-solving in a blame-free setting. The company encourages self-direction while providing support and mentorship for professional growth. This is an excellent opportunity to join one of technology's most desirable employers and work on meaningful projects that impact the future of AI and accelerated computing.

Last updated 6 minutes ago

Responsibilities For Senior Site Reliability Engineering - Infrastructure

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and launch reviews
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation for production systems

Requirements For Senior Site Reliability Engineering - Infrastructure

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers
  • Experience with Kubernetes, OpenStack and Docker
  • Strong communication skills and systematic problem-solving approach
  • Ability to debug and optimize code and automate routine tasks

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineering - Infrastructure

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at NVIDIA, focusing on supporting and scaling generative AI applications across global infrastructure.

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining high-reliability systems for GPU cloud services.

Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.