rapidsos logo

Site Reliability Engineering Manager

rapidsos Boston or New York


No Relocation

Posted: May 21, 2026

Job Description

In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies.

At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact. 

RapidSOS is ​​the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real-time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com.

What this role is about:
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.

You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.

What you’ll do: 

  • Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure;  ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
  • Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
  • Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
  • Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
  • Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
  • Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
  • Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
  • Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
  • Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
  • Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
  • Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes

What we’re looking for in our ideal candidate: 

  • 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
  • You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
  • Experience moving a team from reactive ops toward engineering-first reliability practices 
  • You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
  • Ability to write Python,review production-quality scripts, and tooling
  • You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisionsHands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)

What we offer: 

  • The chance to work with a passionate team on solving one of the largest challenges globally 
  • Competitive salary and benefits and equity participation 
  • A dynamic, flexible and fun start-up work environment with a highly talented team

If you're curious to learn more about RapidSOS, you can check out https://rapidsos.com/blog/ 

Starting pay for a successful applicant will depend on a variety of job-related factors, which may include experience, relevant skills, training, education, location, business needs, or market demands. The salary range for this role is $185,000 - $215,000. This role will also be eligible to receive equity options. #LI-Remote 

Additional Content

In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies.

At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact. 

RapidSOS is ​​the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real-time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com.

What this role is about:
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.

You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.

What you’ll do: 

  • Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure;  ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
  • Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
  • Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
  • Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
  • Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
  • Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
  • Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
  • Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
  • Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
  • Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
  • Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes

What we’re looking for in our ideal candidate: 

  • 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
  • You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
  • Experience moving a team from reactive ops toward engineering-first reliability practices 
  • You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
  • Ability to write Python,review production-quality scripts, and tooling
  • You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisionsHands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)

What we offer: 

  • The chance to work with a passionate team on solving one of the largest challenges globally 
  • Competitive salary and benefits and equity participation 
  • A dynamic, flexible and fun start-up work environment with a highly talented team

If you're curious to learn more about RapidSOS, you can check out https://rapidsos.com/blog/ 

Starting pay for a successful applicant will depend on a variety of job-related factors, which may include experience, relevant skills, training, education, location, business needs, or market demands. The salary range for this role is $185,000 - $215,000. This role will also be eligible to receive equity options. #LI-Remote