Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As an SRE, you'll ensure Google Cloud's services—both internally critical and externally-visible systems—maintain reliability and appropriate uptime for customer needs while driving continuous improvement. The role involves managing complex challenges of scale unique to Google Cloud, utilizing expertise in coding, algorithms, complexity analysis, and large-scale system design.
The position emphasizes optimizing existing systems, building infrastructure, and automating processes. Google's SRE culture values diversity, intellectual curiosity, problem-solving, and openness. The organization brings together people with diverse backgrounds and perspectives, encouraging collaboration and risk-taking in a blame-free environment.
You'll work on meaningful projects with self-direction while receiving support and mentorship for growth. Key responsibilities include managing project priorities, deadlines, and deliverables, as well as designing, developing, testing, deploying, maintaining, and enhancing software solutions. The role involves working with network telemetry services, implementing automated troubleshooting, improving monitoring systems, and ensuring service reliability through well-defined SLOs.
This is an excellent opportunity for engineers passionate about large-scale systems, automation, and reliability. You'll collaborate with partner teams, shape technical plans, and directly impact the reliability of Google's production network while working in a supportive, growth-oriented environment.