Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google

Google is a leading global technology company specializing in internet-related services and products.

Sunnyvale, CA, USA • Kirkland, WA, USA • New York, NY, USA

$278,000 - $399,000

Site Reliability

Principal Software Engineer

In-Person

5,000+ Employees

15+ years of experience

AI · Enterprise SaaS

Description For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is seeking a Principal Site Reliability Engineer to lead ML Acceleration initiatives, focusing on optimizing the delivery and implementation of ML resources across their global infrastructure. This role combines deep technical expertise in distributed systems, capacity planning, and ML infrastructure with strategic leadership. The position involves working with cross-functional teams across Data Center Construction, Networking, and Machine Delivery to maximize ML capacity delivery efficiency. As part of Google's Technical Infrastructure team, you'll be instrumental in maintaining and developing the architecture that powers Google's extensive product portfolio. The role offers competitive compensation including base salary, bonus, equity, and comprehensive benefits. The position requires expertise in managing complex technical projects, influencing large teams, and driving innovation across diverse stakeholders. This is an opportunity to impact Google's global ML infrastructure strategy across more than 20 countries and three continents.

Last updated a month ago

Responsibilities For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Set and deliver technical projects for ambitious Google-level OKRs around ML capacity delivery into the fleet
Play a key role in overall portfolio management for existing ML capacity and related infrastructure
Support the development of the company's global ML strategy
Be responsible for a strategy that encompasses more than 20 countries across three continents and growing
Act as a key technical leader for Global Technical Infrastructure, engaging with other leaders across the region and globally

Requirements For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Linux

Kubernetes

Bachelor's degree in Computer Science, Engineering, a related field, or equivalent practical experience
15 years of professional experience in software development, or 10 years with a relevant advanced degree
Experience influencing teams of 20 or more, with cross-functional engagement
Experience with one of the following: data center design, networking/networking planning, machine delivery, or construction

Benefits For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Medical Insurance

Dental Insurance

Vision Insurance

Parental Leave

bonus
equity
benefits

Google

Google is a leading global technology company specializing in internet-related services and products.

Sunnyvale, CA, USA • Kirkland, WA, USA • New York, NY, USA

$278,000 - $399,000

Site Reliability

Principal Software Engineer

In-Person

5,000+ Employees

15+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To Google Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Engineering Director, P2020 Rollouts

Google

Lead Google's Rollouts production platform strategy and development, managing continuous deployment solutions for Alphabet and Google services.

Engineering Director, P2020 Rollouts

Google

Lead the strategy and development of Google's Rollouts production platform, managing continuous deployment solutions for Alphabet and Google services.

Engineering Director, P2020 Rollouts

Google

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Google

Lead technical initiatives in AI, security, and site reliability engineering at Google, architecting next-generation platforms and ensuring system reliability and security at scale.

Director, Software Engineering, Site Reliability

Lead LinkedIn's Site Reliability Engineering team in Bengaluru, directing 40+ engineers in scaling and maintaining critical infrastructure systems while driving innovation and automation.