Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is a leading global technology company specializing in internet-related services and products.
$278,000 - $399,000
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is seeking a Principal Site Reliability Engineer to lead ML Acceleration initiatives, focusing on optimizing the delivery and implementation of ML resources across their global infrastructure. This role combines deep technical expertise in distributed systems, capacity planning, and ML infrastructure with strategic leadership. The position involves working with cross-functional teams across Data Center Construction, Networking, and Machine Delivery to maximize ML capacity delivery efficiency. As part of Google's Technical Infrastructure team, you'll be instrumental in maintaining and developing the architecture that powers Google's extensive product portfolio. The role offers competitive compensation including base salary, bonus, equity, and comprehensive benefits. The position requires expertise in managing complex technical projects, influencing large teams, and driving innovation across diverse stakeholders. This is an opportunity to impact Google's global ML infrastructure strategy across more than 20 countries and three continents.

Last updated a month ago

Responsibilities For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

  • Set and deliver technical projects for ambitious Google-level OKRs around ML capacity delivery into the fleet
  • Play a key role in overall portfolio management for existing ML capacity and related infrastructure
  • Support the development of the company's global ML strategy
  • Be responsible for a strategy that encompasses more than 20 countries across three continents and growing
  • Act as a key technical leader for Global Technical Infrastructure, engaging with other leaders across the region and globally

Requirements For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Linux
Kubernetes
  • Bachelor's degree in Computer Science, Engineering, a related field, or equivalent practical experience
  • 15 years of professional experience in software development, or 10 years with a relevant advanced degree
  • Experience influencing teams of 20 or more, with cross-functional engagement
  • Experience with one of the following: data center design, networking/networking planning, machine delivery, or construction

Benefits For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Medical Insurance
Dental Insurance
Vision Insurance
Parental Leave
  • bonus
  • equity
  • benefits

Interested in this job?

Jobs Related To Google Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Engineering Director, P2020 Rollouts

Lead Google's Rollouts production platform strategy and development, managing continuous deployment solutions for Alphabet and Google services.

Engineering Director, P2020 Rollouts

Lead the strategy and development of Google's Rollouts production platform, managing continuous deployment solutions for Alphabet and Google services.

Engineering Director, P2020 Rollouts

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Lead technical initiatives in AI, security, and site reliability engineering at Google, architecting next-generation platforms and ensuring system reliability and security at scale.

Director, Software Engineering, Site Reliability

Lead LinkedIn's Site Reliability Engineering team in Bengaluru, directing 40+ engineers in scaling and maintaining critical infrastructure systems while driving innovation and automation.