Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As an SRE, you'll ensure Google Cloud's services maintain reliability and appropriate uptime while managing capacity and performance. The role focuses on optimizing existing systems, building infrastructure, and automation.
You'll tackle unique scaling challenges specific to Google Cloud, applying expertise in coding, algorithms, complexity analysis, and large-scale system design. The SRE team values diversity, intellectual curiosity, and problem-solving in a blame-free environment. Google encourages collaboration, big thinking, and risk-taking while providing support and mentorship for growth.
The position offers the opportunity to work with distributed systems at massive scale, contribute to critical infrastructure, and be part of a team that directly impacts Google Cloud's performance and reliability. You'll work alongside diverse professionals, participate in design reviews, and help maintain Google's high standards for system reliability.
This role is perfect for engineers who enjoy the intersection of software development and systems engineering, are passionate about large-scale infrastructure, and want to work on problems that affect millions of users. You'll be part of a culture that promotes self-direction while ensuring you have the support needed to succeed in this challenging and rewarding position.