
Site Reliability Engineering Manager
rapidsos • Boston or New York
Posted: May 21, 2026
Job Description
In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies.
At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact.
RapidSOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real-time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com.
What this role is about:
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.
You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.
What you’ll do:
- Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
- Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
- Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
- Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
- Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
- Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
- Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
- Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
- Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
- Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
- Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes
What we’re looking for in our ideal candidate:
- 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
- You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
- Experience moving a team from reactive ops toward engineering-first reliability practices
- You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
- Ability to write Python,review production-quality scripts, and tooling
- You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisionsHands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)
What we offer:
- The chance to work with a passionate team on solving one of the largest challenges globally
- Competitive salary and benefits and equity participation
- A dynamic, flexible and fun start-up work environment with a highly talented team
If you're curious to learn more about RapidSOS, you can check out https://rapidsos.com/blog/
Starting pay for a successful applicant will depend on a variety of job-related factors, which may include experience, relevant skills, training, education, location, business needs, or market demands. The salary range for this role is $185,000 - $215,000. This role will also be eligible to receive equity options. #LI-Remote
Additional Content
In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies.
At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact.
RapidSOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real-time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com.
What this role is about:
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.
You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.
What you’ll do:
- Own the reliability, scalability, and operational health of RapidSOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
- Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
- Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
- Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
- Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
- Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
- Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
- Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
- Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
- Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
- Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption, cross-region architecture, security changes
What we’re looking for in our ideal candidate:
- 7+ years in SRE, platform engineering, or DevOps, with at least two years where you were responsible for a team and not just your own work
- You’ve been directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical
- Experience moving a team from reactive ops toward engineering-first reliability practices
- You’ve worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production
- Ability to write Python,review production-quality scripts, and tooling
- You’ve applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisionsHands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)
What we offer:
- The chance to work with a passionate team on solving one of the largest challenges globally
- Competitive salary and benefits and equity participation
- A dynamic, flexible and fun start-up work environment with a highly talented team
If you're curious to learn more about RapidSOS, you can check out https://rapidsos.com/blog/
Starting pay for a successful applicant will depend on a variety of job-related factors, which may include experience, relevant skills, training, education, location, business needs, or market demands. The salary range for this role is $185,000 - $215,000. This role will also be eligible to receive equity options. #LI-Remote