Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google

Google is a global technology leader providing innovative internet-related services and products.

Sunnyvale, CA, USA • Kirkland, WA, USA • New York, NY, USA

$278,000 - $399,000

Site Reliability

Principal Software Engineer

In-Person

5,000+ Employees

15+ years of experience

AI · Enterprise SaaS

Description For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is seeking a Principal Site Reliability Engineer to lead ML Acceleration initiatives, focusing on optimizing the delivery and implementation of ML resources across their global infrastructure. This role combines deep technical expertise in distributed systems, capacity planning, and ML infrastructure with strategic leadership. You'll be responsible for transforming chips from global fabs into ML supercomputers within gigawatt-scale data centers. The position requires coordinating across Data Center Construction, Networking, and Machine Delivery teams to optimize ML capacity delivery. As part of Google's Technical Infrastructure team, you'll contribute to maintaining and developing next-generation platforms that power Google's extensive product portfolio. The role offers competitive compensation, including a robust benefits package, and the opportunity to work on large-scale, impactful projects that shape the future of Google's ML infrastructure.

Last updated 2 days ago

Responsibilities For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Set and deliver technical projects for ambitious Google-level OKRs around ML capacity delivery into the fleet
Play a key role in overall portfolio management for existing ML capacity and related infrastructure
Support the development of the company's global ML strategy
Be responsible for a strategy that encompasses more than 20 countries across three continents and growing
Act as a key technical leader for Global Technical Infrastructure, engaging with other leaders across the region and globally

Requirements For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Linux

Kubernetes

Bachelor's degree in Computer Science, Engineering, a related field, or equivalent practical experience
15 years of professional experience in software development, or 10 years with a relevant advanced degree
Experience influencing teams of 20 or more, with cross-functional engagement
Experience with one of the following: data center design, networking/networking planning, machine delivery, or construction

Benefits For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Medical Insurance

Vision Insurance

Dental Insurance

401k

Equity

Medical Insurance
Vision Insurance
Dental Insurance
401k
Equity

Google

Google is a global technology leader providing innovative internet-related services and products.

Sunnyvale, CA, USA • Kirkland, WA, USA • New York, NY, USA

$278,000 - $399,000

Site Reliability

Principal Software Engineer

In-Person

5,000+ Employees

15+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To Google Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Engineering Director, P2020 Rollouts

Google

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Google

Lead technical initiatives in AI, Trust, and Security for Google's Site Reliability Engineering organization, architecting and implementing large-scale distributed systems.

Principal Engineer, Core Networking Site Reliability

Google

Lead technical role responsible for Google's core network infrastructure, combining deep networking expertise with strategic leadership to ensure reliable operation of global systems.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Google

Lead technical initiatives in AI, security, and site reliability engineering at Google, architecting next-generation platforms and ensuring system reliability and security at scale.

VP, Software Engineering, SRE

Salesforce

Lead Salesforce's global SRE organization, driving reliability strategy and transformation while managing a 100+ person team.