Site Reliability Engineer

Enterprise-grade AI-focused GPU-as-a-service provider with a decentralized cloud computing infrastructure and network of over 40,000 GPUs.
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
DevOps
Mid-Level Software Engineer
Remote
51 - 100 Employees
3+ years of experience
AI · Enterprise SaaS · Cloud

Description For Site Reliability Engineer

Aethir, the leading Enterprise-grade AI-focused GPU-as-a-service provider, is seeking a Site Reliability Engineer to join their team in Kuala Lumpur, Malaysia. The company operates a revolutionary decentralized cloud computing infrastructure, managing over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, to deliver enterprise-grade GPU computing solutions globally.

Backed by prominent Web3 investors and having raised over $130M in ecosystem funding, Aethir stands at the forefront of decentralized computing innovation. This role presents a unique opportunity to work with cutting-edge technology in a rapidly growing environment.

As an SRE, you'll be instrumental in ensuring the reliability and performance of Aethir's production systems. Your responsibilities will span from monitoring and troubleshooting to system optimization, directly impacting the service quality for AI and gaming customers worldwide. You'll work with modern technologies including Kubernetes, Docker, and cloud platforms, while collaborating with cross-functional teams to resolve complex technical challenges.

The ideal candidate brings a strong technical foundation in systems architecture and performance monitoring, combined with excellent problem-solving abilities. You'll thrive in a fast-paced startup environment where your actions directly influence the platform's success. The role offers significant growth potential, with opportunities to work alongside global teams and contribute to innovative projects in the AI and cloud computing space.

Join Aethir to be part of a transformative journey in decentralized computing, where your expertise will help shape the future of GPU-as-a-service technology. The position offers competitive benefits, including career advancement opportunities and a collaborative work environment focused on innovation and excellence.

Last updated 6 days ago

Responsibilities For Site Reliability Engineer

  • Monitor, review, and respond to faults in production system
  • Monitor and review system architecture, process logic, and performance
  • Coordinate with business team for operations and maintenance issues
  • Respond to production failures as overall coordinator
  • Organize R&D, operations, and product teams for problem resolution
  • Manage failure response time and resolution
  • Conduct case studies on production issues and implement optimizations
  • Maintain system architecture and process documentation
  • Identify and implement operations improvements

Requirements For Site Reliability Engineer

Kubernetes
Python
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and monitoring
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes (K8S), CI/CD, and Docker
  • Expertise in AWS (VPC, S3, EC2) or Python
  • Prior experience in structured environments preferred

Benefits For Site Reliability Engineer

  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment

Interested in this job?

Jobs Related To Aethir Site Reliability Engineer

Operations Support Engineer

Remote Operations Support Engineer position at Aethir, focusing on Linux systems administration and mining infrastructure optimization in the Web3 ecosystem.

Systems Development Engineer, Region Services

Systems Development Engineer role at AWS Region Services team, focusing on automation, system health monitoring, and maintaining high availability for AWS cloud infrastructure.

Support Engineer II

Support Engineer II position at Amazon focusing on technical support, development, and operational excellence for the Perfect Order Experience team.

System Development Engineer, Amazon Fulfillment Technologies Support (AFTS)

DevOps engineer position at Amazon supporting fulfillment technology systems, focusing on automation, monitoring, and production support with competitive compensation and benefits.

Robotics Deployment Engineer, Robotics Deployment Engineering

Robotics Deployment Engineer position at Amazon, focusing on technical implementation of robotics systems across European facilities, requiring extensive travel and hands-on technical expertise.