Skip to content

Downtime Analysis: The Six Losses Framework

By Christian Fieg · Last updated: April 2026

What is downtime analysis?

Downtime analysis — sometimes called stoppage analysis, idle time analysis or, in the OEE vocabulary, availability loss analysis — is the structured study of every minute a production asset is not producing what it should be producing. It covers planned stops (scheduled maintenance, changeovers), unplanned stops (breakdowns, material starvation, quality holds), micro-stops (those sub-5-minute interruptions that don't look like much individually but collectively destroy availability), and the grey zone in between: machines that are powered on, consuming energy, waiting for something, and producing nothing. The textbook definition stops there. The serious definition adds one word: correctly categorised. Uncategorised downtime data is worse than no data, because it looks like information while being misinformation — and decisions get made on it anyway.

I have spent 25 years watching downtime data lie to production management teams. Six Sigma Black Belt at Johnson Controls, global MES and traceability programme for 900+ machines and 750+ users across China, Mexico, Tunisia, Macedonia, France and Russia, Center of Excellence lead for the global MES programme at Visteon, and now Head of Sales at SYMESTIC covering 15,000+ connected machines in 18 countries. Across all of those plants, the single most consistent finding is not that downtime is too high or too low — it is that the categorisation of downtime is systematically wrong, and the decisions built on it are aimed at the wrong problems. I wrote a book about this in 2025 called "OEE: One Number, Many Lies." Downtime categorisation is the most common place the lies start.

The Six Big Losses — the only framework that matters

Serious downtime analysis starts with Nakajima's Six Big Losses framework — the same one that underpins OEE and TPM. It is not one category to shoehorn every stop into; it is six structurally different loss types with different root causes and different countermeasures. Categorising a stop as "mechanical fault" when it is actually a setup overrun is not a data-entry error, it is an analytical error that points improvement work in the wrong direction for years.

# Loss OEE dimension Typical countermeasure
1 Equipment failures / breakdowns Availability Reliability-centred maintenance, predictive maintenance, root-cause analysis
2 Setup and changeover Availability SMED, sequencing, work standardisation
3 Idling and minor stops (< 5 min) Performance Root-cause analysis on micro-stop patterns — the most underanalysed category
4 Reduced speed / slow cycles Performance Cycle-time tracking against design speed, parameter optimisation
5 Startup / yield losses Quality Setup validation, first-piece inspection, parameter recipes
6 Process defects / in-run scrap Quality SPC, in-line quality capture, parameter correlation

Notice what the framework does: it forces every minute of lost output into a category that maps directly onto a countermeasure. A plant that reports "3,200 minutes of unplanned downtime last month" has a data point. A plant that reports "1,100 minutes of equipment failures, 900 minutes of setup, 600 minutes of micro-stops, 400 minutes of reduced speed, 200 minutes of startup loss" has a roadmap. Same data, completely different decision-making capability. The difference is not better measurement; it is better categorisation of the same measurement.

Why most downtime data is wrong

Here is the single finding that took me fifteen years to accept and that I now lead every customer conversation with: the headline number is almost always right; the breakdown is almost always wrong. A plant that records 15 % unplanned downtime is probably genuinely running at roughly that level. But the reason distribution underneath that number — the breakdown that tells you where to invest improvement effort — is unreliable in almost every plant I have ever audited. The distortion is not random noise. It is systematic, and it has four specific causes.

The "Unknown" drain. When classifying a stop is harder than ignoring it, operators ignore it. Stops get logged against a default code — "unknown," "other," or whatever is first in the dropdown list — and the real root cause disappears into a bucket that is analytically worthless. In one plant I ran globally at Johnson Controls, 71 % of all stops were classified as "minor mechanical fault" because that reason was first in the list and selecting it took one tap instead of the eighteen seconds needed to drill through the proper classification tree. The data looked clean. Management had been optimising against it for six months. When we rebuilt the terminal interface to surface the top-5 most probable reasons based on machine state and recent patterns — one-tap classification, no drill-down — the reason distribution shifted within a week and the real failure modes became visible for the first time. They had nothing to do with "minor mechanical faults."

Distortion Mechanism What it does to the data
The default-pick bias Operators pick whichever reason is first in the dropdown to save time Entire reason classes are artificially inflated; real causes disappear
The lunch-break batching Micro-stops accumulated over a shift get booked en bloc at the break Stop frequency is understated by 5–10×; MTBF looks much better than it is
The career-preserving category Stops are classified as whatever is least likely to look bad for the responsible department "Material waiting" inflates; "equipment failure" deflates
The < 5 minute ignore Micro-stops are not logged at all because they don't feel worth the paperwork Category 3 losses (idling and minor stops) are structurally invisible — often the biggest real loss

The four distortions are not operator misbehaviour. They are a rational response to a badly designed measurement system. The fix is not training; training degrades within six weeks. The fix is to automate machine-state capture (via OPC UA or digital I/O gateway) so the stop itself is detected mechanically, and make the reason classification so fast — one tap, context-aware top-5 reasons — that the operator has no incentive to game it. That is the only design that produces reliable downtime data at scale. I have seen every other approach tried, and every other approach has failed.

The MTBF and MTTR discipline

Downtime analysis has two mathematical companions that are frequently reported, frequently misunderstood, and frequently abused: MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair). Together they tell you two different things about the same phenomenon and lead to two different countermeasures. A plant improving MTBF is working on the reliability problem — making failures happen less often. A plant improving MTTR is working on the response problem — making failures shorter when they do happen. Both matter; they require different capabilities; and the data cost of calculating either correctly is non-trivial without automated machine-state capture.

Metric Formula What it tells you Typical lever
MTBF Total operating time / Number of failures How reliable the asset is — how long it runs between breakdowns Preventive maintenance, predictive maintenance, design improvements
MTTR Total repair time / Number of failures How fast the organisation responds when a breakdown happens Response protocols, spares strategy, maintenance scheduling, operator first-response capability

The subtle point: MTBF and MTTR are meaningful only if "failure" is consistently defined. A plant that counts every micro-stop as a failure will have dramatically worse MTBF than a plant that counts only stops > 10 minutes. Neither is wrong, but they are not comparable. The first discipline of MTBF/MTTR reporting is therefore a written definition of what counts as a failure for the metric. Without that definition the numbers are meaningless; with it, they become one of the most powerful diagnostic tools in the maintenance toolkit.

What to actually do with downtime data — the improvement hierarchy

Collecting good downtime data is 40 % of the work. The other 60 % is knowing what to do with it. The serious mistake at this stage is to Pareto the data, pick the biggest bar on the chart, and throw a project at it. That approach fails more often than it succeeds — because the biggest bar is frequently something chronic that the organisation has already tried to fix three times, or something genuinely immovable like changeover-time in a high-mix environment. The right sequence of interventions follows a hierarchy, from lowest effort to highest.

Level Intervention Typical impact
1. Measurement hygiene Fix the classification before fixing anything else — automate state capture, restructure the reason tree, eliminate the default-pick bias 2–4 percentage points OEE improvement with zero process change — purely from seeing reality correctly
2. Micro-stop surfacing Make Category-3 losses visible — they are almost always the largest real loss and almost always the least analysed Single-digit to double-digit performance-component improvement
3. Setup-time attack SMED on the top-3 changeover patterns 30–50 % changeover time reduction is realistic; translates to 2–5 % availability
4. Breakdown elimination RCA on the top failure modes, reliability-centred maintenance, predictive maintenance where ROI permits Slow, compounding — 1–3 % availability per year, sustained over years
5. Design-level intervention Capital changes to the equipment — automation upgrades, tooling redesign, process replacement Large but slow; usually the last lever, not the first

The common mistake is to start at level 4 or 5 while levels 1 and 2 are still broken. No predictive-maintenance algorithm will outperform fixing the reason-classification system, because the algorithm is being trained on the same corrupted labels the operators provided. The hierarchy is not optional; it is the order in which the interventions actually compound. Plants that follow it see 5–10 % OEE improvement in the first year, sustained. Plants that skip to level 4 burn their capital-expenditure budget on projects that cannot pay back because the measurement foundation underneath them is unreliable.

What this looks like at SYMESTIC deployments

Across the SYMESTIC installed base, the downtime-analysis pattern is consistent. Machine-state capture is automated from day one — OPC UA for modern controls, digital I/O gateways for brownfield equipment with no native digital interface, 1–2 hours per machine, no PLC modification, no production interruption. Reason classification uses context-aware top-5 reasons derived from the machine state and the last hour's pattern, delivered as one-tap buttons on the shop floor terminal so the default-pick bias is designed out rather than trained out. Six Big Losses categorisation is the native data model, not a reporting overlay.

The outcomes from the named references tell the same story across industries. Meleghy (automotive, six plants across Germany, Spain, Czech Republic and Hungary) saw 10 % fewer stoppages, 7 % higher output and 5 % higher availability within six months of enterprise MES rollout. Carcoustics (automotive moulding and stamping, 500+ machines across Poland and Germany) landed 4 % fewer stoppages, 3 % higher output and 8 % higher availability in the same six-month window. Klocke (pharma, non-validated packaging, Weingarten) hit 12 % output improvement and 8 % availability improvement within three weeks — because the biggest gain was at level 1 of the improvement hierarchy, not level 4. Neoperl (assembly) reduced stops by 10 %, raised availability by 8 %, cut scrap by 15 % and gained 15 % productivity by correlating PLC alarms with downtime classifications and defect patterns. The capital expenditure across these four deployments was effectively zero; the improvement came from correct categorisation of data that had always been there.

FAQ

What is downtime analysis?
Downtime analysis is the structured study of every minute a production asset is not producing what it should be producing — covering planned stops (maintenance, changeovers), unplanned stops (breakdowns, material starvation), micro-stops (sub-5-minute interruptions), and the grey zone of machines powered on and waiting. It is sometimes called stoppage analysis, idle time analysis, or, in OEE vocabulary, availability loss analysis. The serious version of the discipline is not about measuring downtime; it is about correctly categorising it. Uncategorised downtime data is worse than no data.

What is the Six Big Losses framework?
Nakajima's Six Big Losses is the TPM-derived framework that structures every minute of lost output into six categories: equipment failures, setup and changeover, idling and minor stops, reduced speed, startup losses, and process defects. Losses 1 and 2 hit availability; losses 3 and 4 hit performance; losses 5 and 6 hit quality. The value of the framework is that each loss category maps directly onto a different countermeasure — so the categorisation itself points at where to invest improvement effort. A single bucket of "3,200 minutes of downtime" gives no guidance; the same number split across the six losses produces a roadmap.

Why is most downtime data wrong?
The headline number is usually right; the breakdown underneath it is usually wrong. Four systematic distortions cause it: the default-pick bias (operators choose whichever reason is first in the dropdown to save time), lunch-break batching (micro-stops get logged en bloc at the break, understating frequency), the career-preserving category (stops are classified as whatever is least embarrassing for the responsible department), and the sub-5-minute ignore (micro-stops are not logged at all because the paperwork is not worth it). In one plant I ran at Johnson Controls, 71 % of all stops were classified as "minor mechanical fault" — not because that was the true cause, but because it was first in the dropdown list.

What is the difference between MTBF and MTTR?
MTBF (Mean Time Between Failures) measures reliability — how long an asset runs between breakdowns, calculated as total operating time divided by number of failures. MTTR (Mean Time To Repair) measures response — how fast the organisation recovers when a breakdown happens, calculated as total repair time divided by number of failures. Improving MTBF is a maintenance and engineering problem; improving MTTR is a response-process and spares-strategy problem. Both metrics are meaningful only if "failure" is consistently defined — the first discipline of MTBF/MTTR reporting is a written definition of what counts as a failure.

What are micro-stops and why do they matter?
Micro-stops are stoppages shorter than 5 minutes — Category 3 in the Six Big Losses framework. They are the single most underanalysed loss category in most plants, for two reasons. First, they are frequently not logged at all because manual classification is slower than the stop itself. Second, their individual size makes them look insignificant, even though their cumulative effect is often the largest availability loss in the plant. A machine averaging one 3-minute micro-stop every 15 minutes loses 20 % of its available time — and almost never in a way that shows up in the traditional downtime report. Surfacing micro-stops through automated machine-state capture is usually the single highest-leverage intervention in a downtime-analysis programme.

What is the right improvement sequence after a downtime analysis?
A five-level hierarchy, starting from the lowest effort and compounding upward. Level 1 is measurement hygiene — fix the classification before fixing anything else, and expect 2–4 percentage points of OEE improvement with zero process change from just seeing reality correctly. Level 2 is micro-stop surfacing — making the invisible Category 3 losses visible. Level 3 is setup-time attack — SMED on the top changeover patterns. Level 4 is breakdown elimination — RCA, RCM, predictive maintenance where the ROI permits. Level 5 is design-level intervention — capital changes to the equipment. Plants that follow the sequence see sustained 5–10 % OEE improvement; plants that skip to level 4 burn their capex on projects that cannot pay back because the measurement foundation is still unreliable.

How does SMED fit into downtime analysis?
SMED (Single-Minute Exchange of Die) is the Shingo-originated methodology for reducing setup and changeover times — directly addressing Category 2 of the Six Big Losses. Its core insight is the distinction between internal setup (work done while the machine is stopped) and external setup (work that can be done while the machine is still running). Moving work from internal to external is usually worth 30–50 % changeover-time reduction on the first pass. SMED only pays back if setup time is actually measured at the cycle level — and that measurement is itself a downtime-analysis output. The two disciplines are tightly coupled: SMED without reliable changeover data is guessing; downtime analysis without SMED as the follow-through on Category 2 is incomplete.

Can predictive maintenance replace downtime analysis?
No — predictive maintenance is a subset of downtime analysis, not a replacement for it. Predictive maintenance is a specific intervention at level 4 of the improvement hierarchy, aimed at Category 1 losses (equipment failures). It depends entirely on correctly labelled failure data to train its models. A plant that deploys predictive maintenance without first fixing the classification system at level 1 is training its algorithm on corrupted labels, and the resulting predictions will be correspondingly unreliable. Predictive maintenance amplifies good downtime analysis; it does not substitute for it.

How does SYMESTIC implement downtime analysis?
Automated machine-state capture from day one — OPC UA for modern controls, digital I/O gateways for brownfield equipment, 1–2 hours per machine with no PLC modification and no production interruption. Reason classification uses context-aware top-5 reasons surfaced on the shop floor terminal as one-tap buttons, so the default-pick bias is designed out of the workflow rather than trained out. Six Big Losses categorisation is the native data model, and MTBF/MTTR are computed fields rather than export-and-join exercises. Across the installed base the typical outcome is 5–10 % OEE improvement in the first 12 months — stacked across the five levels of the improvement hierarchy, not produced by any single intervention. See SYMESTIC Production Metrics.


Related: OEE · Six Big Losses · MTBF · MTTR · SMED · Machine Data Acquisition · Predictive Maintenance · MES · SYMESTIC Production Metrics

About the author
Christian Fieg
Christian Fieg
Head of Sales at SYMESTIC. 25+ years in manufacturing — maintenance engineer and Six Sigma Black Belt at Johnson Controls, global MES and traceability lead for 900+ machines and 750+ users across China, Mexico, Tunisia, Macedonia, France and Russia, Manager Center of Excellence for the global MES programme at Visteon, Sales Manager MES DACH at iTAC, Senior Sales Manager at Dürr. At SYMESTIC since 2021. Author of "OEE: One Number, Many Lies" (2025). · LinkedIn
Start working with SYMESTIC today to boost your productivity, efficiency, and quality!
Contact us
Symestic Ninja