Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is a global technology leader providing innovative internet-related services and products.
$278,000 - $399,000
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Google is seeking a Principal Site Reliability Engineer to lead ML Acceleration initiatives, focusing on optimizing the delivery and implementation of ML resources across their global infrastructure. This role combines deep technical expertise in distributed systems, capacity planning, and ML infrastructure with strategic leadership. You'll be responsible for transforming chips from global fabs into ML supercomputers within gigawatt-scale data centers. The position requires coordinating across Data Center Construction, Networking, and Machine Delivery teams to optimize ML capacity delivery. As part of Google's Technical Infrastructure team, you'll contribute to maintaining and developing next-generation platforms that power Google's extensive product portfolio. The role offers competitive compensation, including a robust benefits package, and the opportunity to work on large-scale, impactful projects that shape the future of Google's ML infrastructure.

Last updated 2 days ago

Responsibilities For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

  • Set and deliver technical projects for ambitious Google-level OKRs around ML capacity delivery into the fleet
  • Play a key role in overall portfolio management for existing ML capacity and related infrastructure
  • Support the development of the company's global ML strategy
  • Be responsible for a strategy that encompasses more than 20 countries across three continents and growing
  • Act as a key technical leader for Global Technical Infrastructure, engaging with other leaders across the region and globally

Requirements For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Linux
Kubernetes
  • Bachelor's degree in Computer Science, Engineering, a related field, or equivalent practical experience
  • 15 years of professional experience in software development, or 10 years with a relevant advanced degree
  • Experience influencing teams of 20 or more, with cross-functional engagement
  • Experience with one of the following: data center design, networking/networking planning, machine delivery, or construction

Benefits For Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Medical Insurance
Vision Insurance
Dental Insurance
401k
Equity
  • Medical Insurance
  • Vision Insurance
  • Dental Insurance
  • 401k
  • Equity

Interested in this job?

Jobs Related To Google Principal Site Reliability Engineer, ML Capacity Planning, Acceleration

Engineering Director, P2020 Rollouts

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Lead technical initiatives in AI, Trust, and Security for Google's Site Reliability Engineering organization, architecting and implementing large-scale distributed systems.

Principal Engineer, Core Networking Site Reliability

Lead technical role responsible for Google's core network infrastructure, combining deep networking expertise with strategic leadership to ensure reliable operation of global systems.

Principal Engineer, AI, Trust, Security, Site Reliability Engineering

Lead technical initiatives in AI, security, and site reliability engineering at Google, architecting next-generation platforms and ensuring system reliability and security at scale.

VP, Software Engineering, SRE

Lead Salesforce's global SRE organization, driving reliability strategy and transformation while managing a 100+ person team.