Skip to content

Operational Disruption: When the Downtime Taxonomy Lies

By Christian Fieg · Last updated: April 2026

Ask a production manager how their unplanned downtime breaks down by reason, and you will almost always get a confident answer with a tidy distribution: so much percent changeover, so much percent technical fault, so much percent material shortage, so much percent minor stops, a small residual of unknown. The numbers will add to a hundred, they will be drawn from the MES or the shift report, and they will be wrong — not wrong by a few percentage points around the edges, but wrong in a way that materially misleads the people trying to improve the plant. This is not a moral problem. Nobody in the plant is being dishonest. The distortion is structural, it enters the data at a specific point, and once you see where, the fix is straightforward. It is also not the fix most plants attempt first.

I have spent twenty-five years in manufacturing — starting as a maintenance engineer at Johnson Controls in 1998, then Six Sigma Black Belt in automotive electronics, then globally responsible for MES and traceability across four continents with Johnson Controls and Visteon, and since 2021 at SYMESTIC. The pattern I am about to describe has shown up in every plant I have helped instrument in all those years, across industries, across countries, across scales. It is one of the most reliable patterns in manufacturing data, and I consider it a close cousin of the OEE-honesty problem I have written about elsewhere. Where OEE lies about a single headline number, the operational-disruption taxonomy lies about the shape of the problem. Both distortions have the same root cause: a human being is asked, in the moment, to produce information they do not actually have.

What operational disruption actually is

Operational disruption is the umbrella term for any unplanned interruption of the production system — mechanical failure, process fault, material starvation, quality excursion, control-system alarm, anything that takes the line from its intended running state to a state in which it is not producing salable output. The standard infrastructure around it is well-defined: an incident is detected (by sensor, alarm, or operator observation), logged in the CMMS or MES with a reason code and timestamps, investigated if the impact is large enough to justify the effort, resolved, and closed. The headline metrics are MTBF (mean time between failures) and MTTR (mean time to repair), with OEE as the composite consequence metric. None of this is controversial. The infrastructure exists in most mid-market plants and works adequately for its basic purpose.

The problem is not the infrastructure. The problem is what happens at the moment a stoppage occurs, when the operator — or a machine-side interface driven by the operator — has to produce a reason code and assign it to the stoppage event. That act of classification is the single most contaminated data point in the entire production-monitoring stack, and almost everything downstream of it that tries to improve the plant depends on it being correct.

Where the distortion enters

When a line stops, the operator has, in the best case, a clear visual cue — an alarm on the HMI, a pile of scrap at the reject station, a raw-material hopper that has run empty — and they classify the event quickly and correctly. That happens a meaningful fraction of the time, but it is not the majority case in the plants I have seen. The majority case is that the operator sees the line has stopped, does not immediately know why, has to get the line running again as the overriding priority, and classifies the stoppage with whatever label feels most plausible given what they were doing in the minutes before the stop. The classification happens in five to ten seconds, under time pressure, often without any diagnostic information about what the machine was actually doing when it stopped.

The categories operators reach for under that pressure are not randomly distributed. They systematically over-select categories that are (a) familiar, (b) not personally awkward to report, and (c) adjacent to whatever activity was happening at the time. Changeover is over-reported because changeovers are the most common non-fault stoppage and any stop that happens near a changeover window tends to get absorbed into the changeover category, even if the underlying cause was unrelated. Material shortage is over-reported because it is blame-free and easily verifiable after the fact. The "unknown" category, by contrast, is under-reported — sometimes severely — because logging a stoppage as unknown feels to the operator like admitting they were not paying attention, which is uncomfortable regardless of whether it was true.

From 25 years of MES rollouts
In almost every plant where I have compared the reported stoppage-by-reason distribution against the reality — usually by going back after the fact and cross-referencing the MES reason codes against the machine-signal data that was captured independently — the share of events classified as "unknown" in the official report is a small fraction of the share that was genuinely unknown at the moment of classification. The missing mass has been redistributed, mostly into changeover and material shortage. This means the top categories in the official report are overstated, the "unknown" bucket is understated, and any improvement program that prioritises by category share is aimed at targets that are partly fictional.

The distribution mismatch, in structural terms

The honest shape of a stoppage-by-reason distribution in a mid-market plant, when you look at it with machine-signal ground truth rather than with operator-classified codes, has a much larger unknown-or-uncategorised bucket than the official report acknowledges. It is not unusual for the real unknown share to be two or three times what the report says, with the difference having been absorbed disproportionately into the top two or three "known" categories. This is the single most common distortion in operational-disruption data, and it is present regardless of whether the plant is running a clean CMMS with a modern MES or a dog-eared paper stoppage log by hand.

Why it matters is straightforward. Improvement programs in manufacturing typically follow Pareto logic: identify the top one or two stoppage categories, attack them first, move on to the next. When the top categories are overstated and the "unknown" bucket is understated, the Pareto prioritisation points at problems that are real but smaller than they appear, and it hides from view the problem that is actually the largest single contributor — the one whose root cause is literally not yet known. Many plants spend years attacking the overstated categories, get diminishing returns, and never understand why the aggregate downtime is not moving. The overstated categories get modestly better. The uncategorised part, which was the larger share all along, is still waiting to be investigated.

The fix is architectural, not procedural

The obvious reaction to this is to tighten the operator-classification discipline: better training, better reason-code menus, more accountability for miscategorisation, mandatory fields before the stoppage can be closed. I have seen all of these tried, often several in combination, and they do not work in any sustained way. They do not work because they are trying to solve a structural problem with a behavioural intervention. The operator still does not know why the line stopped, still has five seconds to classify it, and still has to pick from a closed list of categories. Tighter discipline mostly moves the distortion around without reducing it.

The fix that does work is architectural. Separate the primary capture of the stoppage event from the classification of its reason. The stoppage itself — the moment the line stopped, how long it was stopped, what parameter excursion or alarm preceded it, what was happening on the machine in the last seconds before the stop — is captured automatically from the machine signal with no human involvement required, and it is captured correctly because the machine is not lying about its own state. The reason code is then produced either by the system itself where the signature is unambiguous (a specific alarm code from the PLC, a specific parameter excursion pattern) or by the operator as an annotation, after the line is running again, with the full machine-signal context available. The crucial difference is that when no reliable reason can be determined, the event stays as "unknown" rather than being pushed into an adjacent category — and the "unknown" bucket becomes a legitimate, honest part of the distribution that the organisation can then prioritise investigating.

An honest "unknown" bucket of 30% is not a failure of data capture. It is an accurate report of what the plant actually knows about its own downtime, and it is the starting point for making that number smaller in the only way that genuinely works: by going and finding out.

What this looks like in the running plant

In a plant that has made this architectural separation, the MTTR and MTBF numbers are still visible and still tracked, but they behave differently than in a plant where the classification is contaminated. The Pareto of stoppage reasons shifts — often quite sharply in the first month after switching from operator-classified to signal-derived-plus-annotated capture — with the "unknown" bucket growing to its real size and the formerly top categories shrinking to theirs. This is initially uncomfortable for the production organisation, because the improvement programs targeting the formerly-top categories suddenly look less impactful than they did, and the organisation has to confront the fact that a large share of its downtime is genuinely not yet diagnosed. The discomfort passes. What replaces it is a clearer-eyed view of where the actual opportunity lies, and a reduction in the wasted-effort rate of improvement projects aimed at misidentified targets.

In the SYMESTIC product set, the two modules most relevant to honest disruption capture are Alarms (the signal-driven event capture, including the parameter and alarm context from the seconds before the stop, which is the ground truth against which operator annotation is later checked) and Production Metrics (the MTBF, MTTR, and OEE dashboards that consume the corrected distribution). The combination does two useful things in sequence: it captures the stoppage independently of the operator so the event itself is not lost or distorted, and it lets the operator annotate after the fact with the full context visible, which improves annotation quality without putting the operator under the five-second classification pressure that was causing the distortion in the first place. The underlying problem — that a human under time pressure with incomplete information cannot reliably classify stoppage reasons — is not a problem any software vendor can argue away. What a good MES can do is stop making the human do that job at the worst possible moment, and let the machine's own signal carry the primary burden of capture. That sequencing change, more than any other single thing, is what makes a plant's disruption data worth looking at.

About the author
Christian Fieg
Christian Fieg
Head of Sales at SYMESTIC. 25+ years in manufacturing — Johnson Controls, Visteon, iTAC, Dürr. Six Sigma Black Belt. Global MES rollouts across four continents. Author of "OEE: Eine Zahl, viele Lügen" (2025). · LinkedIn
Start working with SYMESTIC today to boost your productivity, efficiency, and quality!
Contact us
Symestic Ninja