Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google's services—both internally critical and externally-visible systems—have reliability, uptime appropriate to users' needs and a fast rate of improvement. SREs keep an ever-watchful eye on systems capacity and performance. Much of the software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.
On the SRE team, you'll have the opportunity to manage the complex challenges of scale unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving and openness is key to its success. The organization brings together people with a wide variety of backgrounds, experiences and perspectives, encouraging collaboration, big thinking, and risk-taking in a blame-free environment.
As a Databases Site Reliability Engineer, you'll be working on Google Cloud Platform's Spanner database. You'll collaborate with other teams to ensure Spanner is easy to manage and meets customers' needs with minimal operational load. You'll plan and execute projects to improve reliability and efficiency, participate in on-call rotations, and manage GCP Spanner allocations.
This role requires a mix of software engineering skills and systems knowledge, with a focus on large-scale distributed systems. You'll need experience with programming, Unix/Linux systems, and networking. The ideal candidate will also have experience with Site Reliability Engineering, System Design, and Distributed Computing, as well as excellent influencing skills.
Join Google's Technical Infrastructure team and be part of the backbone that keeps Google's vast product portfolio running smoothly and efficiently.