Skip to content

Machine Downtime: Categories, MTBF/MTTR & How to Actually Measure It

By Martin Brandel · Last updated: April 2026

What is machine downtime?

Machine downtime is any interval in which a production asset is not producing good parts while it was scheduled to. The definition sounds trivial, and that is exactly why most plants measure it wrong. The word "downtime" is used interchangeably for three very different things — planned stops, unplanned failures and micro-stops below the radar — and each has a different root cause, a different owner and a different fix.

In OEE terms, machine downtime is the biggest single contributor to the Availability factor. In practical terms, it is usually the largest hidden cost in a manufacturing plant. I have spent 30 years connecting machines to higher-level systems, starting with Simatic S5 in 1991 and arriving at OPC UA and IoT gateways today. The constant over all those years has been this: operators and plant managers systematically underestimate how much their machines actually stand still, because the numbers they work with come from paper, memory or ERP back-flushing — never from the machine itself.

The first hour after a real-time machine connection goes live at a new customer is always the same. The dashboard shows more stops than anyone expected, shorter runs than the work instructions assume, and micro-stops that nobody knew existed. That hour is what this article is really about.

The three categories of downtime

Most plants manage one category competently, tolerate the second, and are completely blind to the third.

Category Trigger Typical duration Usually visible?
Planned downtime Changeovers, preventive maintenance, scheduled breaks Minutes to hours Yes — scheduled in advance
Unplanned downtime Breakdowns, tool failures, material shortages, quality holds Minutes to days Partially — large events only
Micro-stops (idling) Jams, sensor triggers, minor adjustments, waiting for operator Seconds to a few minutes Almost never, without automation

The third category is the one that decides whether a plant's OEE improvement programme succeeds. In typical assembly and packaging operations, micro-stops account for 15–30 % of all availability loss — and in almost every project I have run, the plant's baseline number for micro-stops was either "zero" or "negligible". Both were wrong by an order of magnitude.

Downtime and the Six Big Losses

Nakajima's Six Big Losses framework maps cleanly onto downtime categories. The breakdown is useful because it turns a vague "we have too much downtime" into a structured problem with specific owners.

Loss Downtime type Primary countermeasure
1. Equipment breakdowns Unplanned TPM, predictive maintenance
2. Setup & adjustments Planned SMED, standardised changeovers
3. Minor stops / idling Micro-stops Automatic capture, root-cause analysis
4. Reduced speed Hidden losses (not classical downtime, but performance loss) Process engineering, condition monitoring

The key insight: categories 1 and 2 are visible, category 3 almost never is. Plants that reach world-class OEE are the ones that stopped fighting only the visible losses.

MTBF and MTTR: the two numbers that matter

Two metrics translate machine downtime from an operational nuisance into a quantified engineering problem. Both are derived directly from the timestamps of start-stop-start cycles — which is exactly what a PLC or an IoT gateway records automatically.

MTBF = Total operating time ÷ Number of failures
Mean Time Between Failures — reliability indicator

MTTR = Total downtime ÷ Number of failures
Mean Time To Repair — maintainability indicator

A rising MTBF means the machine runs longer between stops — maintenance and equipment condition are improving. A falling MTTR means each stop is resolved faster — response time, spare-part availability and operator competence are improving. You need both trends moving in the right direction. Improving MTBF while MTTR stagnates means you are buying availability with maintenance overtime; improving MTTR while MTBF deteriorates means you are getting good at firefighting a burning building.

Where downtime really comes from

The canonical list of "causes of downtime" in textbooks is correct but unhelpful. In 30 years of brownfield connectivity projects, the real causes cluster differently:

Cause cluster What it looks like in practice Typical share
Upstream / downstream starvation Machine runs fine but waits for material, parts, or the next station 20–35 %
Changeovers & setups Tool changes, format changes, recipe changes 10–25 %
Mechanical failures Breakdowns, tool wear, tolerance issues 15–25 %
Quality holds & rework Line stopped for inspection or correction of defects 5–15 %
Operator-related Breaks, shift handover, manual intervention, missing operator 10–20 %
IT / control system PLC faults, network issues, software bugs 2–8 %

The surprise for most plant managers is the first row. People assume downtime is dominated by breakdowns, because those are loud. In practice, starvation and upstream/downstream issues usually win — and they are the ones that never end up in the maintenance log.

Why most downtime stays invisible

Paper-based reason codes and end-of-shift logs miss three things systematically. This is the single biggest reason that OEE numbers from "our SCADA already has it" are usually 10–20 percentage points too optimistic.

Blind spot Why paper can't catch it
Micro-stops under 5 minutes Not worth writing down — but 200 of them per shift kill the day
Chronic short failures with the same root cause Logged separately, never correlated — the pattern is invisible
Reason codes attributed post hoc End-of-shift reconstruction is shaped by memory and politics, not data

In one packaging plant where we installed an IoT gateway on a line that was "well understood" by its maintenance team, the first week revealed 412 stops below 90 seconds — none of which appeared in any existing report. The sum of those micro-stops was larger than the recorded "big" downtime for the same period.

How to actually capture downtime — the technical side

The right connection method depends on the machine's age and control system. After hundreds of brownfield integrations, the pattern is clear: there is a suitable option for every machine, no matter how old. "Our machines can't deliver data" is almost never true in 2026.

Machine type Connection method Typical effort per machine
Modern PLC (S7-1500, TIA, Beckhoff, Rockwell) OPC UA server, read-only access to alarm and state tags 2–4 hours
Older PLC (S7-300/400, S5) Edge gateway with protocol adapter, no PLC change 2–4 hours
No PLC / relay-logic machines from the 1980s–90s Digital-I/O gateway tapping cycle signals and status lamps 2–4 hours
Standalone machines (no network infrastructure) IoT gateway with GSM/4G uplink, no LAN needed 2–4 hours

In every case: no PLC reprogramming, no CE re-certification, no production interruption. That is the non-negotiable rule when connecting brownfield equipment — touch the machine's logic and you inherit it, which no plant wants.

Reducing downtime: the right sequence

Step Action Typical reduction in stop time
1 Automatic capture of every stop with timestamp — no interpretation yet Baseline only, but the baseline is finally real
2 Classify stops via PLC alarms, not operator input Reveals Pareto of true causes, often different from what people believed
3 Attack the top three causes with dedicated Kaizen teams 20–40 % in 8–12 weeks
4 Introduce autonomous maintenance on the cleanest line 10–20 % additional
5 SMED for the worst changeovers 30–60 % on targeted changeovers
6 Predictive maintenance on bottleneck equipment 5–15 % additional on critical machines

The order matters. Steps 3–6 without step 1 produce improvement theatre: you optimise the wrong thing, because the baseline was never real. Step 1 alone, without steps 3–6, produces dashboards and no change.

What an MES does with downtime data

Without MES With SYMESTIC MES
Operators write reason codes at shift end Stops detected automatically, PLC alarm attached to each event
Micro-stops invisible Every stop > 2 seconds captured and categorised
MTBF / MTTR calculated manually, monthly Live per machine, per failure mode, with trend
Alarm correlation impossible Alarms tied to downtime events and quality defects
Maintenance informed after the stop Notification within seconds, mobile alert with machine state

The Neoperl reference case is a representative example: PLC-triggered stop detection, machines documenting their own technical downtime without operator intervention, and correlation of specific alarms with quality defects. Result: 10 % fewer stops, 8 % higher availability, 15 % less scrap. Those numbers are not unusual — they are what a realistic downtime programme produces in the first year, once the measurement is honest.

FAQ

What counts as "downtime" in OEE?
In the strict ISO 22400 and Nakajima definitions, downtime is any loss of scheduled run time due to stops. This explicitly includes planned stops like changeovers and preventive maintenance — they count against Availability even though they are scheduled. Reduced-speed running does not count as downtime; it is a Performance loss. The practical boundary is cleaner: if the machine is not producing good parts while it was supposed to, it is some form of loss — and downtime is the subset where the machine is actually stopped. Plants that exclude "planned" stops from their downtime number are not wrong, but they should stop calling the result "OEE"; it is closer to a Technical Efficiency metric.

How small a stop is a "micro-stop"?
Convention varies, but the working definition in TPM circles is any stop below 5 minutes. Some plants tighten that to below 2 minutes. The threshold matters less than the principle: below the threshold, operators are not expected to document the stop manually, which means any reporting depends entirely on automatic capture. In lines with high cycle frequency (packaging, assembly), individual micro-stops of 10–30 seconds can aggregate to 2–3 hours per shift without anyone noticing. The first time a line's true micro-stop total is displayed on a dashboard is usually the moment the improvement programme becomes real.

Do I need to re-certify the machine if I add a downtime gateway?
No — and this is the single biggest fear that blocks brownfield connectivity projects. A gateway that reads signals passively (OPC UA in read-only mode, digital I/O via tap, MQTT subscriber) does not modify the machine's control logic, does not change its safety behaviour and does not alter its CE-relevant functions. CE re-certification is only required when you modify the machine in a way that affects its conformity assessment — which read-only data capture does not. The gateway sits in the OT network as an observer, not an actor. In the hundreds of integrations I have done, no CE case has ever been triggered by downtime capture.

How fast can real-time downtime monitoring go live?
For a first line with a modern PLC, counting from the kick-off workshop: dashboards with live cycle-time, stop detection and OEE are realistic in one to two weeks. For a brownfield line with mixed vintages, plan two to four weeks per line, with the bulk of the time going into signal definition and stop classification, not hardware installation. The Klocke rollout — all packaging lines at Weingarten connected via digital-I/O gateways in three weeks — is representative of what is achievable when the approach is right. The barrier is almost never the technology; it is the organisational decision to start.

What is the ROI of automatic downtime capture?
The honest answer: it depends almost entirely on baseline OEE. A plant running at 80 % OEE has less headroom than one at 55 %. Across the customer base I have worked with, typical first-year results are 5–10 % availability gain, of which roughly half comes from eliminating previously invisible micro-stops and half from faster response to large stops. At a typical machine with €50–150 per hour of production value, the payback on the gateway and MES subscription is usually three to six months — before any deeper TPM or SMED work is even started. The business case is not made by saving maintenance hours; it is made by producing more good parts in the same calendar time.


Related: OEE · TPM · SMED · MDE · BDE · Cycle Time · Alarms · Process Data

About the author
Martin Brandel
Martin Brandel
MES Consultant at SYMESTIC. 30+ years in industrial automation, from Simatic S5 in 1991 to OPC UA and cloud MES today. Specialist in brownfield machine connectivity without PLC intervention. Dipl.-Ing. Communications Engineering. · LinkedIn
Start working with SYMESTIC today to boost your productivity, efficiency, and quality!
Contact us
Symestic Ninja