Software Engineer - Model Training Infrastructure

Anyscale commercializes Ray, a popular open-source project creating an ecosystem of libraries for scalable machine learning.
$170,112 - $237,000
Machine Learning
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Description For Software Engineer - Model Training Infrastructure

Anyscale, backed by Andreessen Horowitz, NEA, and Addition with $250+ million in funding, is revolutionizing distributed computing through Ray, their open-source project. They're seeking a Senior Software Engineer for their Distributed Training team to develop and maintain core ML libraries like Ray Train and Ray Tune. The role involves building scalable ML infrastructure used by major companies like OpenAI, Uber, and Spotify.

The position requires 5+ years of experience in production software systems and offers a competitive salary range of $170,112-$237,000. The ideal candidate will have strong ML framework knowledge, distributed systems experience, and architectural skills. They'll work on creating fault-tolerant ML libraries, engaging with the open-source community, and collaborating with global ML teams.

The role provides comprehensive benefits including healthcare (99% covered), 401k, stock options, parental leave, and education stipends. Based in San Francisco or Palo Alto, this opportunity offers the chance to shape the future of ML infrastructure while working with industry experts. The position combines technical leadership with hands-on development, making it perfect for those passionate about scaling ML applications.

Last updated an hour ago

Responsibilities For Software Engineer - Model Training Infrastructure

  • Develop scalable, fault-tolerant distributed machine learning libraries
  • Create end-to-end experience for training machine learning models
  • Solve complex architectural challenges
  • Contribute to and engage with the open-source community
  • Share work through talks, tutorials, and blog posts
  • Collaborate with experts in distributed systems and machine learning
  • Work directly with end-users for product enhancement
  • Partner with engineering and product managers
  • Play key role in building and shaping company

Requirements For Software Engineer - Model Training Infrastructure

Python
Kubernetes
  • Minimum 5+ years of experience building, scaling, and maintaining software systems in production
  • Strong fundamentals in algorithms, data structures, and system design
  • Proficiency with machine learning frameworks and libraries (PyTorch, TensorFlow, XGBoost)
  • Experience designing fault-tolerant distributed systems
  • Solid architectural skills

Benefits For Software Engineer - Model Training Infrastructure

Medical Insurance
401k
Parental Leave
Commuter Benefits
Education Budget
Equity
  • Healthcare plans, with premiums covered by Anyscale at 99%
  • 401k Retirement Plan
  • Education & Wellbeing Stipend
  • Paid Parental Leave
  • Fertility Benefits
  • Flexible Time Off
  • Commute reimbursement
  • 100% of in office meals covered
  • Stock Options

Interested in this job?

Jobs Related To Anyscale Software Engineer - Model Training Infrastructure

Senior Machine Learning Engineer

Senior Machine Learning Engineer position at Envision Employment Solutions, focusing on AI development with Python, cloud platforms, and deep learning frameworks.

Senior Machine Learning Engineer

Senior Machine Learning Engineer position at Novibet, focusing on developing and deploying ML models for player behavior prediction and personalized recommendations.

Senior Machine Learning Engineer

Senior Machine Learning Engineer position at Sonar, focusing on LLM deployment and scaling, bridging AI research and production in a global tech company.

Senior Machine Learning Engineer

Senior Machine Learning Engineer position at Docsumo, leading AI teams and developing cutting-edge document processing solutions using advanced ML and LLMs.

Senior Machine Learning Engineer - (Remote - US)

Senior Machine Learning Engineer position at Jobgether, focusing on ML infrastructure and MLOps with AWS, offering remote work in the US with comprehensive benefits.