Why growing engineering teams need incident process

At 1:17 a.m., Mira gets paged for a spike in checkout errors. The company has 23 engineers, three product squads, and one person who really understands the payment retry worker. He is camping somewhere without reception because nobody believed the retry worker was a single point of failure until it failed.

The Slack thread starts with useful facts and then turns into noise. The backend lead asks for database graphs. The product manager asks whether to pause a launch email. The CTO joins from a phone and says, “Can someone give me the short version?” Mira is trying to read logs while six people ask her to narrate her thinking.

This team does not have an incident management problem because it is too large. It has one because it waited until production knowledge stopped fitting inside the founding engineers’ heads.

The common belief in startup engineering is that incident process belongs after scaling: after multiple on-call rotations, after enterprise customers, after the DevOps hire, after the first painful outage. I think that is backwards. Growing teams need incident processes before they feel necessary, because by the time they feel necessary, the habits that break incident response have already become normal.

The real threshold is not headcount

Headcount is a lazy proxy. A twelve-person engineering team can have a bigger incident coordination problem than a forty-person team if the twelve are split across time zones, own different parts of the request path, and only learn about production through Slack archaeology.

The threshold is the end of shared context. When the same three people built the API, the worker queue, the database migrations, and the deploy script, incident response can look like a messy conversation and still work. Everyone knows which dashboard lies, which cron job is fragile, and which customer integration sends malformed payloads every Monday.

Engineering team growth breaks that model quietly. A new frontend squad ships a change that doubles calls to a pricing endpoint. The platform engineer who tuned that endpoint six months ago moved to infrastructure. The on-call engineer sees CPU saturation but does not know the business event behind it. Nobody did anything reckless; the team just outgrew telepathy.

I call this the hallway incident model: production is operated by whoever happens to know the right hallway conversation. It feels efficient because there is no template, no severity language, no assigned incident commander, and no communication cadence. It is efficient only while the people who remember the hallway are awake and employed.

The implication is uncomfortable for founders and engineering managers: the first incident process belongs near the first real split in ownership, not near the first catastrophic outage. If deploys, databases, queues, customer communication, and on-call are no longer handled by the same tiny group, improvisation has become a liability disguised as speed.

The first process replaces telepathy, not judgment

Most resistance to early incident management comes from a false picture of process. People imagine a 40-page runbook, severity committees, mandatory forms, and someone policing language in the incident channel while the site burns. That version deserves the eye-roll it gets.

The first useful process is much smaller. It names who coordinates, who investigates, who talks to stakeholders, where decisions get written, and when the next update goes out. That is not bureaucracy; it is load shedding for the on-call engineer’s brain.

During an outage, the most expensive resource is not CPU or database IOPS. It is the attention of the few people who can still form a correct theory under pressure. Every “any updates?” message steals from diagnosis. Every duplicate dashboard dive burns time. Every private DM creates a second version of reality.

A lightweight process feels slower for the first ten minutes because someone pauses to declare severity, create the channel, assign roles, and set a communication rhythm. That pause is exactly why it works. It prevents the incident from becoming a crowd around the keyboard.

Counterintuitively, the less often a growing team has major incidents, the earlier it needs this muscle. A team that sees production fires every week develops ugly but real instincts. A team with one serious outage per quarter has no shared memory to fall back on, so the process has to carry the memory for them.

Waiting turns heroics into policy

At a 35-engineer B2B startup I advised, the data ingestion pipeline fell behind every few weeks when a large customer uploaded end-of-month files. The “process” was that Theo, the senior DevOps engineer, would notice lag before the alert fired, log into a worker box, drain a poisoned job, and tell customer success that things were fine.

This worked until Theo went on parental leave. The next backlog lasted four hours, not because the failure mode was new, but because the response existed only as Theo’s memory. The postmortem found missing runbooks, weak alerts, and unclear ownership, but the real defect was older: the company had converted one person’s vigilance into an operating model.

That is the hero on-call trap. A small team survives early incidents through individual judgment, then mistakes that survival for resilience. Managers praise the save, customers forgive the miss, and the team gets a little more comfortable with production systems that only behave when the right person is watching.

Waiting also changes the shape of the eventual process. A process written after a bruising outage is usually a monument to the last failure: extra approval for one risky migration, a new alert for one noisy queue, a status update rule because one executive felt blindsided. Trauma writes narrow policy.

Early process has a better chance of being boring in the right way. It can define how incidents are coordinated before everyone is trying to punish the last mistake. It can protect responders without pretending to predict every failure.

This is where many scaling efforts go sideways. Engineering leaders wait for the first severe outage to justify incident management, then overcorrect with heavy ceremony. The team learns that process arrives as punishment, so the next process conversation starts with resistance before anyone reads the proposal.

Stakeholder communication fails before diagnosis does

Technical teams like to believe the hard part of an incident is finding root cause. Sometimes it is. More often in growing companies, the first visible failure is communication, because customer success, sales, support, and executives all discover the outage through different channels.

A payments API starts returning intermittent 502s at 9:42 a.m. Support sees five tickets from large accounts. Sales hears from a prospect in a live trial. The CEO gets a text from a board member whose portfolio company depends on the API. Meanwhile, the on-call engineer is still deciding whether the errors are from the load balancer or a bad deploy.

Without a declared incident owner and update cadence, every stakeholder tries to reduce uncertainty by going straight to the person with the pager. They are not being unreasonable. They are reacting to a vacuum the engineering team created.

This is why status pages and incident channels are not cosmetic. They are pressure valves. A short, honest update every fifteen or twenty minutes can keep ten people from interrupting the two people who are closest to the fix.

The mistake is treating external communication as something to add after the technical response matures. In reality, communication is part of the technical response because it preserves the conditions needed to think. The incident commander is not there to perform confidence; they are there to keep the diagnostic loop from being torn apart by panic.

Process has to arrive before platform ownership fragments

Platform engineering usually appears when startup engineering has already accumulated enough shared pain: slow deploys, inconsistent environments, unclear service ownership, and an on-call rotation that nobody wants to join. The new platform lead inherits not just tooling gaps, but years of informal production behavior.

That inheritance is expensive. If every service has its own alert style, every team has its own definition of urgent, and every incident channel has a different rhythm, the platform team becomes the translation layer for the entire company. They spend their first year normalizing habits that could have been set when the company had fewer people and less pride invested in local customs.

This is the DevOps sponge problem: the team with the broadest access absorbs every ambiguity. If nobody knows whether a database failover is an application incident or an infrastructure incident, DevOps owns it. If nobody knows who updates customers, DevOps gets asked for impact. If nobody knows who can call a rollback, DevOps becomes the permission booth.

A small incident process drawn early protects platform teams from becoming human middleware. It gives service owners a common operating contract before ownership fragments into a dozen different habits. The contract does not need to be elaborate; it needs to be used often enough that a new engineer can join on-call without learning production etiquette by embarrassment.

The right moment is usually earlier than the org chart suggests. If a new hire can deploy code before they can explain how the company runs an incident, engineering team growth has outrun operational maturity.

Final thoughts

The strongest argument for early incident management is not that it prevents outages. It prevents a young engineering team from teaching itself the wrong response to outages: private DMs, hidden heroes, delayed customer updates, and diagnosis performed for an audience.

A growing team does not need corporate ceremony. It needs a small, practiced way to keep authority, attention, and truth in the same room when production is failing.

The first incident process is not a sign that the startup has become corporate; it is the guardrail that lets the startup stay fast after production stops being small.

Your incident process is late when the pager gets busy