Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As a Borg Lifecycle SRE, you'll be responsible for ensuring Google Cloud's services maintain reliability and appropriate uptime while managing complex challenges of scale unique to Google Cloud. The role involves optimizing existing systems, building infrastructure, and automating processes.
The position requires expertise in coding, algorithms, complexity analysis, and large-scale system design. You'll work specifically with the Borg infrastructure, which is crucial to Google's operations, handling diverse challenges across global infrastructure and working on high-impact projects that drive innovation.
The SRE team at Google embraces a culture of diversity, intellectual curiosity, and problem-solving in a blame-free environment. You'll collaborate with people from various backgrounds and perspectives, working on meaningful projects while receiving support and mentorship for professional growth. The role combines technical leadership, hands-on engineering, and production support, making it ideal for those interested in large-scale distributed systems and infrastructure management.
Key aspects include managing Borg lifecycle phases, supporting different cell flavors, and participating in on-call rotations to ensure 24/7 reliability. You'll work closely with development teams and other SREs to design and implement scalable, reliable, and secure solutions that support various Google initiatives. This role offers an opportunity to impact Google's infrastructure at a global scale while working with cutting-edge technology and talented engineers.