Site Reliability Engineer

Cloud-based, all-in-one white-label marketing and sales platform serving 60K+ agencies and 500K businesses globally
Site Reliability
Mid-Level Software Engineer
Remote
1,000 - 5,000 Employees
3+ years of experience
Enterprise SaaS

Description For Site Reliability Engineer

HighLevel is a leading cloud-based marketing and sales platform serving over 60K agencies and 500K businesses globally. Operating at an impressive scale with 40 billion API hits and 120 billion events monthly, the platform manages 200+ terabytes of application data and 6 petabytes of storage across 500 micro-services.

As a Site Reliability Engineer, you'll join a dynamic team focused on maintaining and improving system reliability and performance. You'll work with cutting-edge technologies including GCP, AWS, Kubernetes, and various monitoring tools to ensure the platform's robust operation.

The role offers an opportunity to work with a diverse, global team of ~1200 employees across 15 countries. HighLevel values work-life balance and maintains a strong company culture that fosters creativity and collaboration, whether you're working remotely or from their Dallas headquarters.

This position is perfect for engineers passionate about large-scale systems, automation, and infrastructure as code. You'll be instrumental in maintaining and improving the platform's reliability, working with modern tools and technologies while solving complex challenges at scale.

HighLevel is committed to diversity and inclusion, creating an environment where talented individuals from all backgrounds can thrive. The company offers a unique opportunity to work on enterprise-scale challenges while maintaining a collaborative and inclusive culture.

Last updated a month ago

Responsibilities For Site Reliability Engineer

  • Develop and improve observability using monitoring, logging, tracing, and alerting tools
  • Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA
  • Collaborate with developers to enhance application reliability, scalability, and performance
  • Drive cost optimisation efforts in cloud environments
  • Monitor multiple databases (MongoDB, Redis, ES, Queue based etc.)

Requirements For Site Reliability Engineer

Kubernetes
MongoDB
Python
Redis
  • 3+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Hands-on experience with GCP and AWS
  • Experience with Terraform, Helm, or equivalent tools
  • Experience with Docker, Kubernetes (GKE)
  • Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools
  • Proficiency in Python, Bash, or Shell scripting
  • Basic understanding of API parsing and JSON manipulation
  • Experience with Jenkins, GitHub Actions, ArgoCD, or similar tools
  • Experience with on-call rotations, SLOs, SLIs, SLAs
  • Experience in monitoring MongoDB, Redis, ES, Queue based systems

Interested in this job?

Jobs Related To HighLevel Site Reliability Engineer

Software Developer III, Site Reliability Development, Google Cloud

Site Reliability Developer role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and benefits.

Software Developer II, Site Reliability Development, Google Cloud

Mid-level Site Reliability Developer position at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and performance.

Software Developer II, Site Reliability Developer, Google Cloud

Google Site Reliability Engineer position focusing on building and maintaining large-scale distributed systems with competitive compensation and growth opportunities.

Program Manager, Platforms and Devices, Site Reliability Engineer

Program Manager role at Google leading SRE initiatives for Platforms and Devices, requiring 5+ years of program management experience and strong technical background.

Software Developer II, Site Reliability Developer, Google Cloud

Site Reliability Engineer role at Google focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.