Senior Site Reliability Engineer

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Site Reliability
Senior Software Engineer
Remote
8+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer

NVIDIA is seeking a Senior Site Reliability Engineer to join their cloud service team, focusing on supporting and building generative AI-powered visual applications. This role combines the excitement of working with cutting-edge AI technology and the challenges of maintaining high-performance, globally distributed systems. You'll be responsible for managing infrastructure across 60+ edge locations and major cloud providers, ensuring optimal performance of AI workloads on NVIDIA's GPU architectures.

The position offers a unique opportunity to work at the intersection of AI and infrastructure, requiring both deep technical expertise and strategic thinking. You'll be implementing SRE practices crucial to product quality, including proactive outage prevention, blameless postmortems, and continuous service improvement. The role involves collaboration with various teams, from service owners to research groups, making it ideal for someone who enjoys both technical challenges and cross-functional teamwork.

As an NVIDIAN, you'll be part of a company that's been at the forefront of innovation for over 25 years, currently leading the charge in generative AI development. The role offers exposure to groundbreaking technologies and the chance to work with some of the industry's best talents in a diverse, encouraging environment. This position is perfect for someone who combines strong SRE fundamentals with an interest in AI technologies and a desire to shape the future of computing.

The ideal candidate will bring extensive experience in production environments, strong coding skills, and a deep understanding of cloud technologies. Knowledge of AI/ML technologies and experience with containerization for AI models would be particularly valuable. You'll be joining a company that's widely recognized as one of technology's most desirable employers, offering the opportunity to work on projects that are defining the next era of computing.

Last updated 11 days ago

Responsibilities For Senior Site Reliability Engineer

  • Support Generative AI inferencing workloads in globally-distributed environment
  • Collaborate with service owner, architecture, research, and tools teams
  • Monitor and support critical high-performance, large-scale services
  • Maintain services by measuring availability, latency, and system health
  • Participate in on-call rotation for production support
  • Practice incident response and blameless postmortems
  • Architect, design, and optimize services
  • Scale systems through automation

Requirements For Senior Site Reliability Engineer

Python
Go
Kubernetes
  • BS degree in Computer Science or related technical field
  • 8+ years of experience in operating mission-critical services
  • Solid understanding of containerization and microservices architecture
  • Excellent understanding of Kubernetes ecosystem
  • Experience with ELK and Prometheus stacks
  • Cloud environments expertise (AWS, Azure, GCP, OCI)
  • Technical leadership experience
  • Understanding of SLO/SLIs and error budgeting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining high-reliability systems for GPU cloud services.

Senior Site Reliability Engineer - GPU Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.