Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As an SRE, you'll be responsible for ensuring Google Cloud's services maintain reliability and appropriate uptime for customer needs while driving continuous improvement. The role involves managing complex challenges of scale unique to Google Cloud, utilizing expertise in coding, algorithms, complexity analysis, and large-scale system design.
The SRE team emphasizes a culture of diversity, intellectual curiosity, and problem-solving in a blame-free environment. You'll work on optimizing existing systems, building infrastructure, and automating processes. The position offers opportunities to collaborate with people from diverse backgrounds and perspectives, encouraging big thinking and risk-taking.
The role provides a balance between self-direction on meaningful projects and supportive mentorship for learning and growth. You'll be part of maintaining both internally critical and externally-visible systems, monitoring capacity and performance, and contributing to Google Cloud's infrastructure development.
As a Software Engineer II in SRE, you'll participate in code reviews, documentation, and system design discussions, while also handling critical system issues and contributing to the team's technical direction. The position offers exposure to some of the most complex technical challenges in cloud computing while working with cutting-edge technologies at massive scale.