Site Reliability Engineering (SRE) at Google Cloud combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As an SRE, you'll ensure that Google Cloud's services—both internally critical and externally-visible systems—have reliability and uptime appropriate to customer needs, with a fast rate of improvement. You'll monitor systems capacity and performance, focusing on optimizing existing systems, building infrastructure, and eliminating work through automation.
The role offers unique challenges of scale specific to Google Cloud, allowing you to apply your expertise in coding, algorithms, complexity analysis, and large-scale system design. SRE culture values diversity, intellectual curiosity, problem-solving, and openness. The organization brings together people with varied backgrounds and perspectives, encouraging collaboration, big thinking, and risk-taking in a blame-free environment.
Key responsibilities include:
This role requires a Bachelor's degree in Computer Science or related field (or equivalent experience), and at least 2 years of experience with data structures/algorithms and software development in languages like Java, Python, Go, C, or C++. Preferred qualifications include experience with distributed systems, storage, or networking, and strong problem-solving skills.
Google is committed to diversity, equal opportunity, and creating a culture of belonging. They offer accommodations for applicants with needs and require English proficiency for efficient global collaboration.