Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is a leading global technology company specializing in internet-related services and products.
$278,000 - $399,000
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is seeking a Principal Site Reliability Engineer to lead ML Acceleration initiatives, focusing on optimizing the delivery and implementation of ML resources across their global infrastructure. This role combines deep technical expertise in distributed systems, capacity planning, and ML infrastructure with strategic leadership. The position involves working with cross-functional teams across Data Center Construction, Networking, and Machine Delivery to maximize ML capacity delivery efficiency. As part of Google's Technical Infrastructure team, you'll be instrumental in maintaining and developing the architecture that powers Google's extensive product portfolio. The role offers competitive compensation including base salary, bonus, equity, and comprehensive benefits. The position requires expertise in managing complex technical projects, influencing large teams, and driving innovation across diverse stakeholders. This is an opportunity to impact Google's global ML infrastructure strategy across more than 20 countries and three continents.

Last updated 4 days ago

Responsibilities For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

  • Set and deliver technical projects for ambitious Google-level OKRs around ML capacity delivery into the fleet
  • Play a key role in overall portfolio management for existing ML capacity and related infrastructure
  • Support the development of the company's global ML strategy
  • Be responsible for a strategy that encompasses more than 20 countries across three continents and growing
  • Act as a key technical leader for Global Technical Infrastructure, engaging with other leaders across the region and globally

Requirements For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Linux
Kubernetes
  • Bachelor's degree in Computer Science, Engineering, a related field, or equivalent practical experience
  • 15 years of professional experience in software development, or 10 years with a relevant advanced degree
  • Experience influencing teams of 20 or more, with cross-functional engagement
  • Experience with one of the following: data center design, networking/networking planning, machine delivery, or construction

Benefits For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Medical Insurance
Dental Insurance
Vision Insurance
Parental Leave
  • bonus
  • equity
  • benefits

Interested in this job?

Jobs Related To Google Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Engineering Director, P2020 Rollouts

Lead the strategy and development of Google's Rollouts production platform, managing continuous deployment solutions for Alphabet and Google services.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Lead AI platform development and security initiatives as a Principal Engineer at Google, architecting reliable and secure distributed systems for cloud AI infrastructure.

Engineering Director, P2020 Rollouts

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.

Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Lead ML infrastructure optimization and capacity planning at Google as a Principal SRE, managing global ML resource delivery and implementation.

Principal Engineer, Core Networking Site Reliability

Lead technical role responsible for Google's core network infrastructure, combining deep networking expertise with strategic leadership to ensure reliable operation of global systems.