Site Reliability Engineer

AION

AION is building a next-generation decentralized AI cloud platform transforming high-performance computing, providing bare-metal performance for AI training and inference.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

Hybrid

3+ years of experience

AI · Enterprise SaaS · Cloud

Description For Site Reliability Engineer

AION is revolutionizing the AI cloud platform landscape through its innovative decentralized approach to high-performance computing (HPC). As a Site Reliability Engineer at AION, you'll be at the forefront of building and maintaining the infrastructure that powers this cutting-edge platform. The company is well-funded by major VCs and led by experienced founders with previous successful exits.

The role demands a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You'll be responsible for designing and implementing comprehensive monitoring solutions, creating self-healing infrastructure, and maintaining high availability across distributed systems. This position offers a unique opportunity to work with cutting-edge technologies while implementing SRE best practices at scale.

Your work will directly impact AION's mission of democratizing access to compute power for AI training, fine-tuning, inference, and data labeling. The platform's innovative Proof of Compute Contribution (PoCC) protocol and integration with Tether ensure a stable and efficient ecosystem. Working from the Bangalore office in a hybrid setup, you'll collaborate with top-tier talent from the tech industry while having the flexibility to work remotely for several months each year.

This role is perfect for someone who wants to make a significant impact at the intersection of web3 and AI, working on some of the most exciting challenges in the industry. You'll be joining at the ground floor of an AI startup, with substantial opportunity to influence both the company's and the industry's future. The position offers competitive compensation, professional growth opportunities, and the chance to work with a mission-driven team that's bridging the AI wealth gap through innovative solutions.

Last updated a day ago

Responsibilities For Site Reliability Engineer

Design and implement comprehensive monitoring and alerting systems across all AION platforms
Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes
Create and maintain runbooks and playbooks for handling common operational scenarios and incidents
Implement service mesh solutions for observability, traffic management, and security
Design and implement logging systems for distributed systems
Conduct capacity planning and resource optimization across cloud environments
Implement CI/CD pipelines for reliable and consistent deployments
Design and build self-healing systems that automatically recover from common failure modes
Develop infrastructure for compute platform and data annotation services
Design and implement disaster recovery strategies and testing procedures
Create and maintain production, staging, and development environments
Collaborate with security teams on infrastructure security and compliance

Requirements For Site Reliability Engineer

Kubernetes

Python

Redis

3-8 years of experience in Site Reliability Engineering or DevOps
Deep expertise with AWS, GCP, or Azure infrastructure services
Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
Strong experience with Terraform, Pulumi, or similar IaC tools
Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
Experience with Istio, Linkerd, or similar service mesh technologies
Understanding of network architectures, DNS, load balancing, and security groups
Proficiency in Bash, Python, or Go for automation scripts
Deep understanding of Docker, containerd, and OCI specifications
Knowledge of infrastructure security best practices and compliance requirements

Benefits For Site Reliability Engineer

Competitive salary and benefits package
Flexible work environment
Professional growth and development opportunities
Flexibility to work from anywhere for a few months during a year

AION

AION is building a next-generation decentralized AI cloud platform transforming high-performance computing, providing bare-metal performance for AI training and inference.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

Hybrid

3+ years of experience

AI · Enterprise SaaS · Cloud

Interested in this job?

Jobs Related To AION Site Reliability Engineer

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior Software Developer role in Site Reliability Engineering at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.

Senior Software Engineer, SRE, Cloud Incident Response

Google

Senior SRE position at Google focusing on Cloud Incident Response, requiring expertise in distributed systems and incident management.

Senior Software Engineer, Site Reliability Engineering

Google

Senior Site Reliability Engineering role at Google, focusing on building and maintaining large-scale distributed systems for Google Cloud services.

Senior Software Engineer, Site Reliability Engineering

Google

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems for enterprise applications in Bengaluru.