Google's Site Reliability Engineering (SRE) team is seeking a Senior Software Engineer to join their Cloud Incident Response team. This role combines software and systems engineering to build and maintain large-scale, distributed systems for Google Cloud Platform. The position focuses on ensuring service reliability, managing critical incidents, and driving continuous improvement through automation.
As an SRE, you'll be responsible for maintaining the stability and reliability of Google Cloud Platform through incident support and management. You'll work on creating comprehensive training programs and developing end-to-end processes for incident management lifecycles. The role involves building sophisticated tooling systems to improve cloud state visibility and incident detection.
The ideal candidate will have strong experience in distributed systems, software development, and incident management. You'll be part of a team that values intellectual curiosity, problem-solving, and openness. Google's Technical Infrastructure team offers opportunities to work on meaningful projects while providing support and mentorship for professional growth.
This position requires expertise in system design, troubleshooting, and automation. You'll collaborate with various teams across GCP, contribute to pre-launch activities, and drive improvements in system reliability. The role offers the chance to work on unique scaling challenges while making a significant impact on Google Cloud's infrastructure.
Working at Google means joining a diverse team of professionals from various backgrounds and perspectives. The company promotes self-direction and risk-taking in a blame-free environment, making it an ideal place for engineers who want to tackle complex technical challenges while growing their careers.