Site Reliability Engineer

AION is building a next-generation decentralized AI cloud platform transforming high-performance computing, providing bare-metal performance for AI training and inference.
Site Reliability
Senior Software Engineer
Hybrid
3+ years of experience
AI · Enterprise SaaS · Cloud

Description For Site Reliability Engineer

AION is revolutionizing the AI cloud platform landscape through its innovative decentralized approach to high-performance computing (HPC). As a Site Reliability Engineer at AION, you'll be at the forefront of building and maintaining the infrastructure that powers this cutting-edge platform. The company is well-funded by major VCs and led by experienced founders with previous successful exits.

The role demands a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You'll be responsible for designing and implementing comprehensive monitoring solutions, creating self-healing infrastructure, and maintaining high availability across distributed systems. This position offers a unique opportunity to work with cutting-edge technologies while implementing SRE best practices at scale.

Your work will directly impact AION's mission of democratizing access to compute power for AI training, fine-tuning, inference, and data labeling. The platform's innovative Proof of Compute Contribution (PoCC) protocol and integration with Tether ensure a stable and efficient ecosystem. Working from the Bangalore office in a hybrid setup, you'll collaborate with top-tier talent from the tech industry while having the flexibility to work remotely for several months each year.

This role is perfect for someone who wants to make a significant impact at the intersection of web3 and AI, working on some of the most exciting challenges in the industry. You'll be joining at the ground floor of an AI startup, with substantial opportunity to influence both the company's and the industry's future. The position offers competitive compensation, professional growth opportunities, and the chance to work with a mission-driven team that's bridging the AI wealth gap through innovative solutions.

Last updated a day ago

Responsibilities For Site Reliability Engineer

  • Design and implement comprehensive monitoring and alerting systems across all AION platforms
  • Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes
  • Create and maintain runbooks and playbooks for handling common operational scenarios and incidents
  • Implement service mesh solutions for observability, traffic management, and security
  • Design and implement logging systems for distributed systems
  • Conduct capacity planning and resource optimization across cloud environments
  • Implement CI/CD pipelines for reliable and consistent deployments
  • Design and build self-healing systems that automatically recover from common failure modes
  • Develop infrastructure for compute platform and data annotation services
  • Design and implement disaster recovery strategies and testing procedures
  • Create and maintain production, staging, and development environments
  • Collaborate with security teams on infrastructure security and compliance

Requirements For Site Reliability Engineer

Kubernetes
Python
Go
Redis
  • 3-8 years of experience in Site Reliability Engineering or DevOps
  • Deep expertise with AWS, GCP, or Azure infrastructure services
  • Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
  • Strong experience with Terraform, Pulumi, or similar IaC tools
  • Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
  • Experience with Istio, Linkerd, or similar service mesh technologies
  • Understanding of network architectures, DNS, load balancing, and security groups
  • Proficiency in Bash, Python, or Go for automation scripts
  • Deep understanding of Docker, containerd, and OCI specifications
  • Knowledge of infrastructure security best practices and compliance requirements

Benefits For Site Reliability Engineer

  • Competitive salary and benefits package
  • Flexible work environment
  • Professional growth and development opportunities
  • Flexibility to work from anywhere for a few months during a year

Interested in this job?

Jobs Related To AION Site Reliability Engineer

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior Software Developer role in Site Reliability Engineering at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.

Senior Software Engineer, SRE, Cloud Incident Response

Senior SRE position at Google focusing on Cloud Incident Response, requiring expertise in distributed systems and incident management.

Senior Software Engineer, Site Reliability Engineering

Senior Site Reliability Engineering role at Google, focusing on building and maintaining large-scale distributed systems for Google Cloud services.

Senior Software Engineer, Site Reliability Engineering

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems for enterprise applications in Bengaluru.