Taro Logo

Site Reliability Engineer

Enterprise-grade AI-focused GPU-as-a-service provider with a decentralized cloud computing infrastructure and network of over 40,000 GPUs.
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
DevOps
Mid-Level Software Engineer
Remote
51 - 100 Employees
3+ years of experience
AI · Enterprise SaaS · Cloud
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Site Reliability Engineer

Aethir, the leading Enterprise-grade AI-focused GPU-as-a-service provider, is seeking a Site Reliability Engineer to join their team in Kuala Lumpur, Malaysia. The company operates a revolutionary decentralized cloud computing infrastructure, managing over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, to deliver enterprise-grade GPU computing solutions globally.

Backed by prominent Web3 investors and having raised over $130M in ecosystem funding, Aethir stands at the forefront of decentralized computing innovation. This role presents a unique opportunity to work with cutting-edge technology in a rapidly growing environment.

As an SRE, you'll be instrumental in ensuring the reliability and performance of Aethir's production systems. Your responsibilities will span from monitoring and troubleshooting to system optimization, directly impacting the service quality for AI and gaming customers worldwide. You'll work with modern technologies including Kubernetes, Docker, and cloud platforms, while collaborating with cross-functional teams to resolve complex technical challenges.

The ideal candidate brings a strong technical foundation in systems architecture and performance monitoring, combined with excellent problem-solving abilities. You'll thrive in a fast-paced startup environment where your actions directly influence the platform's success. The role offers significant growth potential, with opportunities to work alongside global teams and contribute to innovative projects in the AI and cloud computing space.

Join Aethir to be part of a transformative journey in decentralized computing, where your expertise will help shape the future of GPU-as-a-service technology. The position offers competitive benefits, including career advancement opportunities and a collaborative work environment focused on innovation and excellence.

Last updated 3 months ago

Responsibilities For Site Reliability Engineer

  • Monitor, review, and respond to faults in production system
  • Monitor and review system architecture, process logic, and performance
  • Coordinate with business team for operations and maintenance issues
  • Respond to production failures as overall coordinator
  • Organize R&D, operations, and product teams for problem resolution
  • Manage failure response time and resolution
  • Conduct case studies on production issues and implement optimizations
  • Maintain system architecture and process documentation
  • Identify and implement operations improvements

Requirements For Site Reliability Engineer

Kubernetes
Python
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and monitoring
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes (K8S), CI/CD, and Docker
  • Expertise in AWS (VPC, S3, EC2) or Python
  • Prior experience in structured environments preferred

Benefits For Site Reliability Engineer

  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment

Interested in this job?