Site Reliability Engineer

Enterprise-grade AI-focused GPU-as-a-service provider with a decentralized cloud computing infrastructure.
Kuala Lumpur, Federal Territory of Kuala Lumpur, Malaysia
Site Reliability
Staff Software Engineer
Hybrid
AI · Enterprise SaaS

Description For Site Reliability Engineer

Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers (containers) to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. With a network of over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, Aethir provides enterprise-grade GPU computing at scale.

We are seeking a Site Reliability Engineer (SRE) for our new headquarters in Kuala Lumpur, Malaysia. This role is crucial in monitoring, troubleshooting, and optimizing our production system to ensure high performance and stability for our AI and gaming customers worldwide.

Key responsibilities include:

  • Monitoring, reviewing, and responding to system faults
  • Continuously reviewing system architecture and performance
  • Coordinating with the business team to resolve operations issues
  • Promptly responding to and resolving production failures
  • Organizing teams for collaborative problem-solving
  • Conducting case studies and implementing optimizations
  • Maintaining comprehensive system documentation
  • Identifying and implementing process improvements

Requirements:

  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Experience in building operations infrastructure platforms

We offer benefits such as a hypergrowth startup environment, fantastic career progression opportunities, and a collaborative, innovative work environment. Join us in shaping the future of decentralized computing!

Last updated 6 months ago

Responsibilities For Site Reliability Engineer

  • Monitor, review, and respond to system faults
  • Review system architecture and performance
  • Coordinate with business team on operations issues
  • Respond to and resolve production failures
  • Organize teams for collaborative problem-solving
  • Conduct case studies and implement optimizations
  • Maintain system documentation
  • Identify and implement process improvements

Requirements For Site Reliability Engineer

Kubernetes
Python
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Experience in operations and maintenance development
  • Strong understanding of system architecture and troubleshooting
  • Excellent communication and collaboration skills
  • Proficiency in Kubernetes, CI/CD, and Docker
  • Expertise in AWS or Python
  • Ability to work in a fast-paced startup environment

Benefits For Site Reliability Engineer

  • Hypergrowth Startup Environment
  • Fantastic Career Progression Opportunities
  • Work within a Global and Local Team
  • Collaborative and innovative work environment

Interested in this job?

Jobs Related To Aethir Site Reliability Engineer

Lead Site Reliability Engineer- Azure Cloud enablement

Lead Site Reliability Engineer position at JPMorgan Chase focusing on Azure cloud infrastructure, offering competitive compensation and comprehensive benefits.

Site Reliability Engineer III- DevOps

Senior Site Reliability Engineer role at JPMorgan Chase focusing on AWS, Kubernetes, and DevOps practices with competitive compensation and comprehensive benefits.

Site Reliability Developer 4

Senior Site Reliability Developer position at Oracle focusing on cloud infrastructure, automation, and system reliability with competitive compensation and benefits.

Staff Software Engineer, Reliability Engineering

Staff Software Engineer position at Airbnb focusing on Site Reliability Engineering, developing and maintaining tools for service reliability at scale.

Sr Staff Software Engineer, Reliability Engineering

Senior Staff SRE position at Airbnb focusing on reliability strategy, incident management, and system architecture, offering competitive compensation and remote work flexibility.