Site Reliability Engineer - Chaos Engineering

Xero is a beautiful, easy-to-use platform that helps small businesses and their accounting and bookkeeping advisors grow and thrive.
$18,500 - $201,700
Site Reliability
Hybrid
Enterprise SaaS · Finance

Description For Site Reliability Engineer - Chaos Engineering

Xero, a leading platform for small business accounting and bookkeeping, is seeking a Site Reliability Engineer specializing in Chaos Engineering. This role is part of the Site Reliability Engineering organization and focuses on enhancing system resilience through controlled disruption testing.

The position involves designing and implementing chaos experiments to identify potential weaknesses in system architecture before they become actual problems. You'll be responsible for building and maintaining a comprehensive chaos engineering environment that enables scalable and repeatable testing across Xero's infrastructure.

As a Chaos Engineering SRE, you'll work with cutting-edge technologies including various cloud platforms (AWS, Azure, GCP) and container orchestration tools like Kubernetes. The role requires proficiency in programming languages such as Python, Go, Java, and others, along with experience in chaos engineering tools like Gremlin or Chaos Monkey.

Xero offers an exceptional benefits package including generous paid leave, comprehensive health coverage, 401k matching, and 26 weeks of paid parental leave. The company maintains a human-first culture that values diversity, inclusion, and work-life balance, making it an ideal place for engineers who want to make a meaningful impact while growing their careers.

The role combines technical expertise with collaborative leadership, as you'll be working across teams to implement improvements and educate others on chaos engineering principles. This is an opportunity to shape the reliability and resilience of systems that serve millions of small businesses worldwide while working with a supportive team that values innovation and technical excellence.

Last updated 23 days ago

Responsibilities For Site Reliability Engineer - Chaos Engineering

  • Design and implement chaos experiments to identify weaknesses in system architecture
  • Design and build failure mode and chaos engineering environment
  • Develop and maintain chaos engineering frameworks and tools
  • Collaborate with development and operations teams
  • Monitor system health and performance metrics
  • Educate team members on chaos engineering principles
  • Analyze system behavior during experiments and document findings
  • Continuously improve chaos engineering process and methodologies

Requirements For Site Reliability Engineer - Chaos Engineering

Python
Go
Java
Kubernetes
  • Proficient in programming languages such as Python, Go, Java, C#, C+, .NET for automation and tool development
  • Experienced in using chaos engineering tools like Gremlin, Chaos Monkey or Litmus
  • Excellent analytical skills to assess system performance and identify weaknesses
  • Effective communication skills to collaborate with cross-functional teams
  • Leadership abilities to drive chaos engineering initiatives
  • Knowledge of cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes)
  • Familiarity with monitoring and observability tools

Benefits For Site Reliability Engineer - Chaos Engineering

Medical Insurance
Dental Insurance
Vision Insurance
401k
Parental Leave
Mental Health Assistance
  • Generous paid leave
  • Employee Assistance Program
  • Health insurance
  • Life insurance
  • Income protection
  • Wellbeing and sports programmes
  • 26 weeks paid parental leave for primary caregivers
  • Employee Share Plan
  • Flexible working
  • Career development
  • 401k contribution matching
  • Dental insurance
  • Vision insurance
  • Fertility and family forming financial support
  • Office snacks and break areas

Interested in this job?