Senior Site Reliability Engineer

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

Remote

8+ years of experience

AI · Enterprise SaaS · Cloud

Description For Senior Site Reliability Engineer

NVIDIA is seeking a Senior Site Reliability Engineer to join their cloud service team, focusing on supporting and building generative AI-powered visual applications. This role combines the excitement of working with cutting-edge AI technology and the challenges of maintaining high-performance, globally distributed systems. You'll be responsible for managing infrastructure across 60+ edge locations and major cloud providers, ensuring optimal performance of AI workloads on NVIDIA's GPU architectures.

The position offers a unique opportunity to work at the intersection of AI and infrastructure, requiring both deep technical expertise and strategic thinking. You'll be implementing SRE practices crucial to product quality, including proactive outage prevention, blameless postmortems, and continuous service improvement. The role involves collaboration with various teams, from service owners to research groups, making it ideal for someone who enjoys both technical challenges and cross-functional teamwork.

As an NVIDIAN, you'll be part of a company that's been at the forefront of innovation for over 25 years, currently leading the charge in generative AI development. The role offers exposure to groundbreaking technologies and the chance to work with some of the industry's best talents in a diverse, encouraging environment. This position is perfect for someone who combines strong SRE fundamentals with an interest in AI technologies and a desire to shape the future of computing.

The ideal candidate will bring extensive experience in production environments, strong coding skills, and a deep understanding of cloud technologies. Knowledge of AI/ML technologies and experience with containerization for AI models would be particularly valuable. You'll be joining a company that's widely recognized as one of technology's most desirable employers, offering the opportunity to work on projects that are defining the next era of computing.

Last updated 3 months ago

Responsibilities For Senior Site Reliability Engineer

Support Generative AI inferencing workloads in globally-distributed environment
Collaborate with service owner, architecture, research, and tools teams
Monitor and support critical high-performance, large-scale services
Maintain services by measuring availability, latency, and system health
Participate in on-call rotation for production support
Practice incident response and blameless postmortems
Architect, design, and optimize services
Scale systems through automation

Requirements For Senior Site Reliability Engineer

Python

Kubernetes

BS degree in Computer Science or related technical field
8+ years of experience in operating mission-critical services
Solid understanding of containerization and microservices architecture
Excellent understanding of Kubernetes ecosystem
Experience with ELK and Prometheus stacks
Cloud environments expertise (AWS, Azure, GCP, OCI)
Technical leadership experience
Understanding of SLO/SLIs and error budgeting

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

Remote

8+ years of experience

AI · Enterprise SaaS · Cloud

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Senior SRE role at NVIDIA focusing on observability and telemetry platforms, offering competitive compensation and the opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - GPU Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on GPU cluster management for AI workloads, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, requiring 5+ years of experience in large-scale infrastructure and GPU computing.

Senior SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

SRE Software Engineer, Storage and Data

NVIDIA

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for GPU cloud platforms.