Staff Site Reliability Engineer

Jobgether • Germany • Netherlands

No Relocation

Posted: May 21, 2026

Additional Content

Job Description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Site Reliability Engineer in Germany. Join a highly collaborative engineering environment where reliability, scalability, and automation are central to delivering world-class developer experiences at global scale. In this role, you will help design and maintain resilient infrastructure systems supporting millions of users worldwide, while driving operational excellence across distributed cloud environments. You will work closely with engineering and infrastructure teams to improve observability, optimize performance, and build automation that reduces operational complexity. This position offers the opportunity to lead incident response initiatives, shape reliability standards, and influence infrastructure strategy across the organization. It is an ideal opportunity for a senior SRE professional who thrives in fast-moving environments and enjoys solving complex distributed systems challenges while mentoring teams and promoting engineering best practices.
Accountabilities: Design and implement comprehensive observability solutions, including monitoring, logging, tracing, dashboards, and alerting systems to improve visibility into infrastructure health and performance. Define, track, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) in collaboration with engineering and product teams. Lead high-severity incident response efforts, coordinate troubleshooting activities, conduct blameless post-mortems, and implement long-term preventive solutions. Build and maintain infrastructure automation and Infrastructure as Code solutions using tools such as Terraform or Pulumi. Develop self-healing systems and automation processes that reduce operational overhead and improve system resilience. Optimize large-scale Kubernetes and cloud-native deployments, focusing on scalability, reliability, latency reduction, and capacity planning. Investigate and resolve complex distributed systems issues across multiple layers of the infrastructure stack. Review architectural and system designs to ensure reliability, scalability, operational efficiency, and security best practices. Mentor engineers across teams and help establish reliability-focused engineering culture and operational standards. Build internal tools, integrations, and automation workflows using languages such as Python or Go to support platform operations and infrastructure improvements. Requirements: 8–10 years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields. Strong software engineering skills with hands-on experience developing production-grade applications or tooling in Python or Go. Deep expertise in distributed systems architecture, cloud-native environments, and service-oriented infrastructure design. Extensive experience with Kubernetes, container orchestration, Docker, and modern cloud infrastructure technologies. Proven ability to design and maintain advanced observability and monitoring ecosystems using tools such as Prometheus, Grafana, Datadog, or OpenTelemetry. Strong background in incident management, root cause analysis, troubleshooting, and operational excellence practices. Hands-on experience with Infrastructure as Code and automation tools such as Terraform, Pulumi, or similar technologies. Excellent written and verbal communication skills, with the ability to explain complex technical topics clearly across teams and stakeholders. Demonstrated leadership and mentoring experience working with engineers across multiple seniority levels. Comfortable working across the full infrastructure stack and solving highly complex technical challenges in fast-paced environments. Experience with Google Cloud Platform (GCP), high-throughput systems, startup environments, or technical content creation is considered a strong advantage. Benefits: Competitive salary package with equity opportunities. Fully remote work environment across Europe. Flexible time off policy and paid holidays. Health, dental, vision, and life insurance coverage. Paid parental, medical, and caregiver leave programs. Short-term and long-term disability coverage. Monthly wellness stipend to support personal well-being. Autonomous and flexible work culture with strong ownership opportunities. Quarterly team gatherings and collaborative company events. Professional equipment and remote workspace support. Opportunity to work on globally scaled infrastructure challenges using modern cloud-native technologies.
How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
apply for this job

RemoteJob Guru

Menu

Staff Site Reliability Engineer

Additional Content

Job Description