Software Engineering Reliability PMTS

Customer Company specializing in AI + Data + CRM, helping companies connect with customers through innovative solutions.
$211,500 - $334,600
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS

Description For Software Engineering Reliability PMTS

Salesforce is seeking a Principal Software Engineer in Site/Product Reliability Engineering to lead their AgentForce platform reliability efforts. This pivotal role involves working in US operations, including weekend shifts aligned with India hours. The position focuses on maintaining and scaling the availability and performance of Salesforce's AgentForce platform, with particular emphasis on generative and predictive AI platform production support.

The role demands expertise in multi-system debugging and triage across various Salesforce platforms including Core, Service Cloud, Sales Cloud, Data Cloud, and AI Cloud, as well as integration with LLM providers like OpenAI, Azure OpenAI, and AWS Bedrock. The successful candidate will lead production triage processes, implement automated solutions, and maintain comprehensive documentation of production incidents.

Key responsibilities include establishing reliability processes, collaborating with lead engineers, investigating alerts and customer-reported issues, and ensuring scalable services. The role requires strong infrastructure and scaling management skills, particularly in handling Large Language Models and associated services. The position offers an opportunity to work with cutting-edge AI technology while maintaining high availability and reliability standards.

Candidates should possess 8+ years of experience in production support, strong knowledge of cloud services, and expertise in implementing reliability processes across full-stack, end-to-end ML platforms. The role offers competitive compensation, with salaries ranging from $211,500 to $334,600, depending on location.

This is an excellent opportunity for a seasoned professional looking to make a significant impact in AI platform reliability while working with a company at the forefront of customer relationship management technology. The role combines technical leadership with hands-on problem-solving in a dynamic, innovative environment.

Last updated a month ago

Responsibilities For Software Engineering Reliability PMTS

  • Lead and shape the production triage process for AgentForce
  • Collaborate with cross-functional teams and external partners
  • Maintain comprehensive documentation of production issues
  • Support capacity modeling and forecasting
  • Create and maintain playbooks and knowledge articles
  • Utilize availability and trust dashboards
  • Participate in 24x7 on-call support
  • Drive improvements based on key metrics and customer feedback

Requirements For Software Engineering Reliability PMTS

Python
Linux
Kubernetes
Go
  • Bachelor's degree or equivalent in Computer Science, Engineering, or related field
  • 8+ years of experience in production support and triaging roles
  • Experience in DevOps or data center management roles
  • Strong knowledge of cloud services (AWS preferred)
  • Proficiency in scripting languages (Python, Shell, Golang)
  • Knowledge of AI model deployment and scaling
  • Experience with container technologies (Docker, Kubernetes)

Benefits For Software Engineering Reliability PMTS

Medical Insurance
Dental Insurance
Vision Insurance
  • Competitive compensation and benefits package
  • Collaborative work environment
  • Opportunity to lead and scale key initiatives within AI platform

Interested in this job?

Jobs Related To Salesforce Software Engineering Reliability PMTS

VP, Software Engineering, SRE

Lead Salesforce's global SRE organization, driving reliability strategy and transformation while managing a 100+ person team.

Principal/Architect- Software Engineering - Availability

Principal SRE role at Salesforce leading technical strategy, mentoring teams, and building reliable distributed systems at scale.

Principal/Architect- Availability Engineering & SRE

Principal/Architect role leading Salesforce's Site Reliability Engineering team, focusing on large-scale distributed systems and technical strategy.

VP, Software Engineering, SRE

Lead Salesforce's global SRE organization, driving reliability strategy and transformation while managing a 100+ person team.

Engineering Director, P2020 Rollouts

Lead Google's Rollouts platform development, managing continuous deployment solutions for Alphabet's services as Engineering Director in Dublin.