Challenges

On-Call Culture Done Right: Incident Response Without Burnout

By Marc Molas·August 31, 2023·10 min read

On-call is one of the fastest ways to destroy an engineering team's morale if you do it wrong. And most companies do it wrong.

The symptoms are predictable: the same two people always get paged because nobody else "knows the system well enough." Engineers dread their on-call weeks. Incidents recur because nobody fixes root causes. The best engineers leave, and you can't figure out why your retention is terrible.

I've carried a pager through years of enterprise-scale incident response, and I can tell you that building a healthy on-call culture is not complicated. It requires clear thinking, a few good tools, and leadership that treats on-call as a first-class responsibility rather than an afterthought.

SLAs vs. SLOs: Know What You're Actually Managing

Before you build an on-call rotation, you need to know what you're defending. This starts with understanding the difference between SLAs and SLOs, because most teams confuse them.

SLA (Service Level Agreement) is a contract with your customers. "We guarantee 99.9% uptime. If we breach it, you get service credits." SLAs have legal and financial consequences.

SLO (Service Level Objective) is an internal target stricter than the SLA. If your SLA promises 99.9%, your SLO might target 99.95%. The SLO gives you a buffer — an error budget — before you breach the SLA.

If your SLO is 99.95% over a rolling 30-day window, you have roughly 21 minutes of allowed downtime per month. When you're within budget, ship features aggressively. When you're burning through it, slow down and prioritize reliability.

Why this matters for on-call: your on-call engineers should know the SLOs they're defending and the current error budget status. "We have 14 minutes of budget left this month" creates urgency. "Keep the system up" is vague enough to be meaningless.

A Sustainable Rotation Starts at Four People

The most common mistake with on-call is making it too burdensome for individuals. Here's what works for teams of 5-8 engineers, which is the typical size at startups:

Weekly rotation, single primary. One person handles all pages for one week (Monday to Monday). Simple and effective with enough people in the rotation.

The minimum viable rotation is 4 people. Fewer than 4 means each person is on-call more than 25% of the time — unsustainable. At 5-6, you get a comfortable one-week-in-five cadence.

Follow-the-sun for distributed teams. Engineers in Europe cover 08:00-20:00 CET, Americas cover the rest. Nobody loses sleep. This is one of the real advantages of distributed teams.

Secondary on-call as escalation. If the primary can't resolve within 30-60 minutes, it escalates to the secondary — someone with deeper system knowledge. Rotate both roles.

Hard rule: the on-call person is not expected to do normal sprint work at the same capacity. Being on-call means you're interruptible. If you also expect them to close 8 story points, you're setting them up to do both things badly.

The Tooling Baseline Is Four Tools, Not a Platform

You don't need a massive investment in tools, but you need the basics:

Alerting and paging: PagerDuty or Opsgenie. These handle alert routing, escalation policies, schedules, and on-call overrides. PagerDuty is the industry standard. Opsgenie (now part of Atlassian) is a solid, cheaper alternative. Do not rely on Slack notifications or email for paging. People silence Slack. People miss emails. A phone call at 3 AM from PagerDuty does not get ignored.

Runbooks: For every alert that pages someone, there should be a runbook. A runbook is a document that answers: What does this alert mean? What's the likely cause? What are the first 3 things to check? How do you mitigate it? Where are the logs and dashboards? A runbook turns a 45-minute panic session into a 10-minute diagnosis. Store them in your wiki, link them directly in the alert.

Status page: Statuspage (Atlassian), Instatus, or even a simple static page. When something is down, your customers should learn about it from your status page, not from trying to use the product and failing. The on-call engineer should be able to update the status page in under a minute.

Incident channel: A dedicated Slack channel (or equivalent) that gets auto-created for each incident. All communication about the incident happens there. No DMs, no side threads. This creates an automatic timeline that's invaluable for the postmortem.

Blameless Postmortems: How to Actually Run One

"Blameless postmortem" has become a buzzword that many teams claim to practice and few actually do. Here's what a real one looks like:

Timing: Within 48 hours of resolution. Wait a week and people forget the details.

Attendees: Everyone involved in the incident, plus anyone who wants to learn.

Structure:

Timeline reconstruction. What happened, in what order, from first signal through resolution.
Root cause analysis. Not "who messed up" but "what in the system allowed this to happen?" A human error is never the root cause — the system that let it reach production is.
Contributing factors. What made detection slow? What made resolution hard?
Action items. Concrete, assigned, with due dates. "Improve monitoring" is not an action item. "Add an alert on payment error rate exceeding 2% over 5 minutes, assigned to Sofia, due September 15" is.

The critical cultural element: nobody gets punished for incidents. If people fear blame, they hide information. If they hide information, you can't learn. If you can't learn, incidents recur.

Compensating On-Call Properly

This is the hill I will die on: if you don't compensate on-call engineers, you don't have a rotation — you have exploitation.

Being on-call constrains your personal time. You can't go camping without cell service. You keep your laptop accessible. Pretending it's "just part of the job" is how you lose your best people.

I hear the counter-argument: senior salaries already price this in. Sometimes they do — and if that's your model, fine, but write it into the contract. Implicit compensation is just unpaid work with extra steps.

Compensation models that work:

Flat stipend per on-call shift. 200-500 EUR per week, regardless of whether you get paged.
Per-incident bonus. Additional compensation for actual responses outside business hours.
Time off in lieu. Paged at 3 AM for 2 hours? Half-day off the next day. Non-negotiable.
Combination. Stipend + time off in lieu is the most common and most equitable model.

What matters is that it's explicit, in the employment contract, and applied consistently.

Signs Your On-Call Culture Is Broken

If any of these sound familiar, you have work to do:

People dread on-call weeks. Not mild annoyance — actual dread. They mention it in 1:1s and trade shifts constantly.
The same person always gets paged. Knowledge silo or misconfigured alerts — either way, it's unsustainable.
Incidents recur. The same failure every few weeks. Postmortem action items never get prioritized.
No compensation or recognition. On-call is expected but invisible.
On-call is used as hazing. New engineers get put on-call before they understand the system.
There are no runbooks. Every incident is a fresh investigation from scratch.

Every one of these is fixable, and none of the fixes requires a big budget. If this were my team, here's what I'd do this quarter:

Audit every alert that pages a human. If it has no runbook, write one or stop paging on it.
Get the rotation to at least four people and put the escalation path in writing.
Put on-call compensation in the employment contract — stipend, time off in lieu, or both.
Run a blameless postmortem within 48 hours of the next incident, with assigned, dated action items.

What all four require is leadership that takes operational health as seriously as feature delivery.

At Conectia, the senior engineers we embed into your teams have lived through good on-call cultures and terrible ones. They bring operational maturity — writing runbooks, setting up proper alerting, building the automation that prevents incidents instead of just responding to them. When your team has people who treat production reliability as a craft, on-call stops being a burden and becomes a normal, well-managed part of engineering life.

Need engineers who build reliable systems, not just features? Talk to a CTO — our senior LATAM engineers bring the operational maturity that turns on-call from a dreaded obligation into a sustainable practice.

On-Call Culture Done Right: Incident Response Without Burnout

SLAs vs. SLOs: Know What You're Actually Managing

A Sustainable Rotation Starts at Four People

The Tooling Baseline Is Four Tools, Not a Platform

Blameless Postmortems: How to Actually Run One

Compensating On-Call Properly

Signs Your On-Call Culture Is Broken

Related Articles

53% Recall: Why Microsoft's Own AIOps Confirms the Engineer Is Still Essential

The Feasible Sovereign Operating Region: Why Your AI Roadmap Hits an Energy–Carbon–Water Wall (2/3)

Let the LLM Talk, Not Touch: The Closed-Loop Architecture That Actually Survives Production (3/3)

Ready to build your engineering team?