Site Reliability Engineer

Guidewire

Guidewire is the platform P&C insurers trust to engage, innovate, and grow efficiently. We combine digital, core, analytics, and AI to deliver our platform as a cloud service.

Dublin, Ireland

Site Reliability

Senior Software Engineer

Remote

501 - 1,000 Employees

5+ years of experience

AI · Finance · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Site Reliability Engineer

At Guidewire, we make software that offers Property and Casualty (P&C) Insurance companies the tools to take care of their customers when they need it the most, whether that's a time of crisis, a natural disaster, an accident, or exposure to cyber risks. We build the core applications that insurance companies use to sell and underwrite policies, settle claims, and bill their customers. We also have a portfolio of innovative products serving the needs of P&C insurance companies in areas such as data management, digital online portals, and predictive analytics. We run these products on the Guidewire Cloud Platform, and we help hundreds of insurance providers all over the world to handle billions of dollars of business.

As a Site Reliability Engineer, you will be part of a team that is passionately automating everything possible to make Guidewire systems run more efficiently. The Platform team is dedicated full-time to creating and running software that improves the reliability of systems in production, serving hundreds of customers and supporting millions of transactions each day. You will be ensuring the reliability of Guidewire's flagship cloud platform and InsuranceSuite products and building tooling to help ensure efficient operations and optimal availability of all SaaS multi-tenant and customer-focused systems.

This role requires a high degree of collaboration, teamwork, ownership and responsibility. If you like to be challenged and have a passion for solving problems at scale with systems like AWS, Kubernetes and Aurora, then we would love to hear from you. The ideal candidate is someone who exemplifies the ethics of, "If you have to do something more than once, automate it," and who can rapidly self-educate on new concepts and tools. Bonus points if you have prior experience doing production support of a SaaS platform and are comfortable working with bleeding edge highly containerized cloud-native environments in AWS.

Key responsibilities include:

Taking a purist SRE approach to shared multi-tenant infrastructure for resilient SaaS microservice-based containerized systems
Overseeing and automating the team's growing presence in AWS
Contributing to core infrastructure systems development
Platform reliability engineering of a complex single sign-on SAML/OAuth-based central authentication platform
Building and developing tooling to aid in driving 24x7x365 follow-the-sun operations of critical production systems
Automating deployment tasks and maintaining automation infrastructure
Creating system documentation and training materials
Building and maintaining observability tooling, metrics, and dashboarding
Improving incident management lifecycle
Enhancing platform observability with a self-healing approach to platform reliability
Collaborating with engineering teams, providing product feedback and contributing code where necessary

Last updated 8 months ago

Responsibilities For Site Reliability Engineer

Take a purist SRE approach to shared multi-tenant infrastructure for resilient SaaS microservice-based containerized systems
Oversee and automate the team's growing presence in AWS
Contribute to core infrastructure systems development with features, bug fixes, reliability improvements, etc
Platform reliability engineering of a complex single sign-on SAML/OAuth-based central authentication platform
Creatively build and develop tooling to aid in driving 24x7x365 follow-the-sun operations of critical production systems
Automate deployment tasks for core product and infrastructure tools and maintain automation infrastructure
Create system documentation and training materials to empower and educate fellow team members
Build and maintain observability tooling, metrics, and dashboarding for a global platform product infrastructure
Improve incident management lifecycle to identify, mitigate, and learn from reliability risks and issues
Enhance platform observability with helping create a self-healing approach to platform reliability
Collaborate with engineering teams, providing product feedback and where necessary contribute code to the product

Requirements For Site Reliability Engineer

Java

JavaScript

Kubernetes

Linux

Python

Bachelor's Degree in Computer Science or related field
Software engineering and task automation skills with Bash, Python, and/or Go
Solid understanding of agile software development methodologies
Deep background with Linux systems and engineering
Highly experienced with engineering and automating on Amazon Web Services (AWS)
Experience supporting web applications running on Java / Apache / Tomcat in a live production environment
Prior experience with IaC tools like Terraform/Terragrunt/Terraspace
Prior experience with devops/gitops tools (Git, Bitbucket, Flux CD, Teamcity) for gate promotions
Production-At-Scale support background in a heavily microservice-based world
Hands-on engineering and ops expertise in containerization (Docker, Helm, Kubernetes/EKS, CNI and Ingress networking)
Strong understanding of Single-Sign On, SAML, OAuth (Bonus if hands-on experience with Okta)
Seasoned expertise around x.509 certificate technology and basic concepts of encryption
Experience working with Relational Databases such as Aurora Postgres and/or Oracle RDS
Advanced exposure to application development, web UI (design and development), JSON, application architecture
Experience strongly utilizing observability tools (logging/APM) like Datadog, CloudWatch, and PagerDuty
Familiarity with event store/stream-processing technologies like Kafka or AWS SQS
Understanding of Open Application Model systems such as KubeVela or Crossplane
Ability to read, write, and speak English
Ability to speak in public settings, interface with customers, partners and vendors confidently
Ability to travel up to 25% of the time

Benefits For Site Reliability Engineer

Top Cloud Employer on Glassdoor
Fun work environment
Culture that lives by core values of integrity, rationality, and collegiality