Reliability fails in planning before production


At 2:17 a.m., Maya from payments gets paged because checkout success has dropped from 99.4% to 92%. The incident channel fills with the usual cast: the SRE on call, the payments lead, a customer support manager, and eventually the CTO, who asks the question everyone hates and everyone understands: "Why did we let this get this bad?"
The answer is not in the deploy that triggered the failure. The answer was in a roadmap meeting six weeks earlier, when the same payments team was asked to ship wallet support, migrate a ledger table, and absorb two unplanned compliance requests without moving the date.
Most CTOs do not underinvest in reliability because they dislike uptime. They underinvest because they treat reliability as an operational discipline owned by SRE and incident management, while the real reliability budget gets spent quietly in product planning.
That is the mistake: reliability is not primarily a production problem. It is a leadership accounting problem, and production is where the bad accounting gets audited.
The roadmap is the first incident timeline
After the checkout incident, the postmortem timeline starts at 1:58 a.m., when error rates began climbing after a schema change. That is convenient, tidy, and mostly false. The first meaningful event happened when engineering leadership accepted a quarter plan with no room for the migration work to be done safely.
Teams do not skip backfills, shadow reads, rollback testing, and load tests because they forgot what good engineering looks like. They skip them because every hour spent reducing blast radius is an hour that appears to delay a promised launch. The cost of risk reduction is visible in planning; the cost of not doing it is invisible until the pager turns it into evidence.
This is the planning blind spot. A CTO can ask for higher reliability and still approve a plan that makes it mathematically unlikely. The contradiction stays hidden because roadmaps track feature commitments, while reliability work is often stored in phrases like "hardening," "tech debt," and "follow-up."
The mechanism is simple: product dates create named accountability, reliability risk creates diffuse accountability. When wallet support slips, one executive asks why. When reliability risk accumulates, everyone agrees it is unfortunate after the outage.
A serious CTO reads a roadmap the way an incident commander reads a dashboard. Where are the single points of failure? Which migrations have no rollback? Which teams are carrying more operational load than their staffing model admits? Those questions belong before the commit, not after the page.
SRE cannot rescue a product plan that lies
At a 140-engineer marketplace company, the search team owned ranking, indexing, and the API that powered most discovery traffic. They had one embedded SRE for two quarters. The CTO proudly described this as "bringing reliability closer to the team."
In practice, that SRE became a human buffer between an overloaded product team and the consequences of its commitments. She wrote runbooks for a brittle indexing pipeline, tuned alerts for a cache eviction pattern no one had time to fix, and spent Friday afternoons negotiating which known risks could survive another weekend.
This is the SRE absorbent layer. Leadership keeps the same delivery promises, adds an SRE, and mistakes the reduction in visible pain for an increase in actual reliability. The pages get triaged faster, the incident channel is calmer, and the underlying system remains one batch job away from a revenue-impacting outage.
SRE works when it has authority to change the system of work: block unsafe launches, reduce scope, force capacity conversations, and make error budgets painful to spend. SRE fails when it is asked to make fragile plans look professional during failure.
The uncomfortable line is this: if SRE can only advise, reliability is still owned by whoever controls scope and deadlines. A CTO who wants uptime but refuses to let reliability change the roadmap has not created an SRE function. They have created a nicer incident concierge.
Uptime targets are cheap until they spend something real
A consumer subscription company set a 99.95% uptime target for its core API after a rough quarter. The number looked mature in the board deck. It also had no connection to staffing, release policy, dependency risk, or the team's actual ability to recover at 3 a.m.
The API team had five engineers, two of whom were new, and one senior engineer who understood the rate limiter, billing integration, and failover path. Their on-call rotation covered nights and weekends, but most incidents still routed through the same senior engineer because everyone knew he could fix things fastest.
That uptime target did not create reliability. It created a lie with decimals.
A target matters only when missing it changes behavior before the next outage. If the team burns through error budget and the roadmap continues untouched, the target is decoration. If the target forces a feature freeze, a dependency replacement, or a staffing decision, it becomes an operating constraint.
Many CTOs like uptime targets because they sound quantitative without forcing immediate tradeoffs. Real reliability numbers are irritating because they turn vague risk into denied work. The moment a target cannot say no to something, it stops being a target and becomes marketing copy for engineering leadership.
Incident management is not the control plane
There is a strange pattern in companies that have been burned a few times. Their incident management process gets better faster than their systems do. The war room has roles, the status page updates are crisp, the severity definitions are laminated into the onboarding wiki, and the same preventable incidents keep happening.
Call this incident theater. It is not fake work; during an outage, good coordination matters. The theater begins when leadership mistakes the visible discipline of response for the harder discipline of changing what creates incidents.
I have seen a data platform team run a near-perfect incident for a warehouse outage caused by a saturated job queue. The incident commander kept the channel clean, customer support got updates every 20 minutes, and the postmortem identified the same missing queue isolation that appeared in two earlier reviews. The action item died again because it competed with a revenue analytics launch.
Good incident management reduces confusion during harm. It does not reduce the production of harm by itself. The control plane for reliability sits in prioritization, architecture ownership, staffing, and the authority to stop unsafe work.
This is why experienced engineers roll their eyes when a CTO responds to repeated outages by asking for better postmortems. The postmortem was not the weak link. The weak link was the absence of a mechanism that made the next plan different from the last one.
The on-call rotation tells the truth
If a CTO wants the fastest read on reliability, skip the architecture diagram and inspect the last month of pages. Who got woken up? Which alerts were ignored? Which service has a runbook that says "restart and watch" because no one knows the real failure mode?
On-call is where engineering leadership's choices become a sleep schedule. A team can claim ownership of a service, but if every serious incident routes to one staff engineer in another timezone, ownership is fiction. A platform can claim self-service, but if every deploy failure requires the same three people in the release channel, self-service is branding.
The most revealing signal is not the number of alerts. It is the number of alerts that no longer surprise anyone. When an engineer mutes the same disk-pressure page for the third night because "it always clears after compaction," the team has accepted operational debt as background noise.
This is alert fatigue debt, and it compounds differently from code debt. Code debt slows delivery. Alert fatigue debt trains humans to distrust the system that is supposed to tell them when customers are in trouble.
A CTO who treats on-call health as a local team issue misses the organizational signal. Bad rotations, noisy alerts, and hero-based recovery are not symptoms of weak engineers. They are receipts for reliability decisions leadership already made.
Final thoughts
The common CTO mistake is not ignoring reliability. It is respecting reliability only after it becomes an incident.
Reliability work has to compete at the same altitude as product work, because that is where the tradeoff is created. If it lives only in SRE rituals, postmortem action items, and heroic on-call recovery, it will keep losing to commitments made by people who do not carry the pager.
The strongest reliability program is not the one with the cleanest incident template. It is the one where a risky roadmap gets changed before the outage makes the argument undeniable.
Production does not create leadership debt; it collects it.