←

Downtime Monitoring: Detection, Reason Codes & OEE

By Martin Brandel · Last updated: April 2026

What downtime monitoring actually is — and why the hard part is not the reporting, it's deciding that the machine has stopped

Downtime monitoring is the continuous discipline of detecting, classifying, and analysing production stoppages so that availability losses become addressable rather than merely visible in a monthly OEE report. The dashboard layer — "top-10 reasons by duration, MTBF, MTTR, shift comparison" — is the well-understood part of the problem. Every MES vendor can ship that. The part that decides whether the whole system works or quietly corrupts your OEE data for the next three years sits two layers deeper: how does the gateway actually decide, from a stream of PLC signals and sensor pulses, that "the machine has stopped" — and not just that it paused for 1.3 seconds between two cycles, or that the conveyor upstream is empty, or that a proximity sensor glitched for 40 milliseconds?

I have been setting up downtime-detection logic on factory floors since 1991 — first with Simatic S5 and COROS visualisation, today with OPC UA subscriptions and IoT gateways feeding the SYMESTIC Cloud MES. The machines have changed; the category of mistakes hasn't. A downtime-monitoring programme that is built from the dashboard downwards — reason codes first, detection logic later — reliably produces clean-looking numbers that are wrong, and the plant spends the next two years making decisions against data that systematically under- or over-counts the real problem. A programme that is built from the signal upwards — detection logic first, reason codes second, dashboards third — produces numbers that match what a supervisor observes with a clipboard. This article is about how to do the second one.

This piece is the detection-engineering companion to three earlier articles in the glossary trilogy: production data acquisition (the signal extraction layer), the downtime reason catalog (the taxonomy layer, written by Christian), and real-time production data (the management-cadence layer). Downtime monitoring sits in the operational middle of these three — where the signals meet the taxonomy, and where either the foundation holds or everything above it becomes theatre.

The three questions the gateway has to answer before a stoppage event can exist

Every downtime-detection system, regardless of vendor or protocol, answers three questions about the machine at every moment. The quality of the answers determines the quality of the data. The three questions, in the order they matter:

Question	Typical signal	What it answers
Is the machine powered and ready?	Main-switch status, drive enable, PLC heartbeat.	Distinguishes powered-off / maintenance mode from production state.
Is the machine cycling?	Cycle-complete pulse, spindle rotation, motor current, part counter.	Distinguishes running from standing. The core availability signal.
Is the machine producing?	Good-part counter, light curtain at output, weight-cell increment.	Distinguishes useful cycling from running-while-empty (starved). Crucial, routinely forgotten.

Most downtime-monitoring systems ask only the first two. The third — "is the machine actually producing something?" — is missed in maybe half the deployments I have inspected. The symptom is the antipattern I call The Starved Blind Spot: the line reports 98 % availability because the injection-moulding machine kept cycling through the shift, but the output counter shows it produced nothing for 20 minutes because upstream material was empty. Availability was nominally perfect; useful availability was 85 %. The KPI lies in the operator's favour and the plant takes no corrective action, because the data never surfaces the problem. A mature detection layer treats "starved" as a distinct operating state, not as running — and that single distinction is often worth 3–5 OEE points of clarity on its own.

The debounce dilemma — why your micro-stop count depends on a threshold nobody discusses

When a PLC signal transitions from running to not-running, the gateway has to decide: is this a real stoppage worth recording, or is it the 400-millisecond dwell between two cycles, or a sensor glitch, or the operator nudging the emergency-stop mushroom for a second to check something? The decision is made by a debounce threshold — the minimum duration a "not running" state must persist before the gateway registers it as a stoppage event. This one integer, usually hidden in a gateway configuration file that nobody looks at after go-live, quietly determines your entire downtime statistic.

The antipattern I call The Debounce Dilemma is what happens when the threshold is set uniformly across machines of different types. Set it too short — say, 1 second — and a CNC with routine tool-change dwells generates 3000 spurious "micro-stops" per shift, flooding the reason-prompt dialog and training operators to dismiss everything without classifying it. Set it too long — say, 30 seconds — and a high-speed press with 60 real 8-second jams per shift shows zero stoppages, because each jam self-clears below the threshold. The first error over-reports the problem; the second error under-reports it; both are equally corrosive to decisions.

Machine class	Typical cycle dwell	Recommended debounce	Rationale
High-speed press / stamping	0.3–1 s	3–5 s	Cycle dwell is short; real jams last several seconds.
Injection moulding	15–60 s	20–40 s	Cycle time itself is long; threshold must exceed the longest normal cycle.
CNC machining centre	Variable, 30 s – 15 min	Use "end-of-program" + timeout (e.g. 3 min after M30)	Pure time debounce fails; needs state-based logic keyed off M-codes.
Assembly line (cycle-locked)	Fixed takt, 20–120 s	1.5× takt	If next cycle doesn't fire within 1.5 × takt, the line has stopped.
Packaging line	1–5 s per pack	5–10 s	Short cycles; product-change dwells need separate handling.

The rule I teach every customer I on-board: debounce is a per-machine-class parameter, not a platform setting. The gateway configuration must allow it. If your vendor's system only lets you set one global value, your micro-stop data will be wrong in either direction for half your machines within the first week, and neither you nor the operators will know which half.

Short-stop, long-stop, and the classical Six Big Losses boundary

Once a stoppage event exists, the second structural decision is how to classify it by duration. The classical reference here is Seiichi Nakajima's Six Big Losses framework from the 1980s, which is still the most useful lens I know for operational prioritisation. It splits availability losses into two disjoint categories with different root causes, different owners, and different improvement strategies:

Class	Typical duration	Typical root causes	Owner
Short stops (small stops, idling)	< 5 minutes	Jams, misfeeds, sensor trips, minor clearing tasks. Self-resolved by operator, no maintenance ticket.	Operator / process engineer — stability, SOP, design.
Long stops (breakdowns)	≥ 5 minutes	Component failure, breakdown, material shortage requiring logistics, tooling breakage.	Maintenance / supply chain — MTTR, spare-parts, preventive.

The five-minute threshold is the most common industry convention and works well for most discrete manufacturing; some organisations use 10 minutes for slower processes, and that's fine as long as it is applied consistently and explicitly. The mistake to avoid — and I have seen it at several mid-market plants — is treating every stoppage as a single pool without the duration split, which mixes two problems that need different interventions into one Pareto. A Pareto of all stoppages will show "jam" at the top every time; the short/long split reveals that the top long-stop is a recurring servo-drive failure that nobody addresses because it is buried under ten thousand one-second jams on the same chart.

The four states every mature detection system tracks — and why three is not enough

A weak downtime system tracks two states: running and stopped. A competent one tracks four. The distinction matters because each of the additional two states points to a different corrective action, and each requires different upstream signals to detect:

State	Definition	Detection signal	OEE treatment
Running	Cycling and producing.	Cycle pulse + part-counter incrementing.	Uptime.
Stopped (downtime)	Machine capable of running but not cycling. Operator or maintenance required.	No cycle pulse for > debounce window, power on.	Availability loss.
Starved	Machine cycling but upstream empty — no material to process. Not the machine's problem.	Cycle pulse present, part-counter not incrementing, or upstream-empty signal from MES/PLC.	Categorised as organisational/supply loss, not machine loss.
Blocked	Machine ready to cycle but downstream full — can't hand off. Also not the machine's fault.	Ready-to-cycle signal + downstream-full signal.	Categorised as organisational/flow loss, not machine loss.
Changeover / planned	Setup, cleaning, maintenance. Planned, not counted as availability loss.	Explicit state code from MES or operator start/stop event.	Excluded from availability (but tracked separately for SMED).

The starved-and-blocked distinction is what lets a plant fix the right problem. A press that shows 74 % availability looks like a machine problem; if the system decomposes that into 88 % genuinely-available and 14 % starved-by-upstream, the intervention moves from maintenance to material-supply logistics — two entirely different budgets, teams, and timelines. Plants without this distinction routinely overhaul machines that were never the bottleneck, while the actual constraint — a paint-drying oven three stations upstream — stays invisible for years.

The cascade illusion — how one stop becomes six in your dashboard

On a cycle-linked assembly line, stopping station 4 stops stations 5, 6, and 7 within seconds (starved downstream) and stations 1, 2, and 3 within minutes (blocked upstream). A naive detection system records seven stoppage events — one real, six spurious. The plant's downtime Pareto shows seven machines with availability problems when the truth is that only one station had a failure and the other six were obediently idle because the line's control logic stopped them. This is the pattern I call The Cascade Illusion, and it is the single most common cause of misdirected maintenance effort I have seen in connected lines.

The corrective architecture at the detection layer has two parts:

Line-context awareness at the gateway. Each station's gateway knows the identity of the station immediately upstream and downstream and subscribes to their run/stop signals. When station 4 stops, the gateway at station 6 immediately reclassifies its own stoppage as "starved by station 4" rather than as an independent stoppage. The event is still recorded, but it is tagged as secondary, not as a primary fault.
Primary-fault attribution at the cell/line level. A higher-level service on the platform correlates concurrent stoppages across a connected line, identifies the station with the earliest stop timestamp, and declares it the primary. All other stoppages within the causal window are marked as cascade events. The top-10 Pareto by primary stoppage now shows six real problems instead of thirty.

Without this correction, the dashboards are loud and wrong; with it, they are quiet and actionable. I have seen plants cut their reported daily stoppage count by 60 % in a single configuration change — not because fewer stops occurred, but because the same stops stopped being counted six times each.

Phantom stops, silent idles, and sensor hygiene — the physics layer nobody wants to talk about

Two failure modes at the physical-signal layer corrupt downtime data in opposite directions, and both require deliberate engineering at the gateway to suppress. I have named them for operational clarity because they need to be discussed by name in commissioning reviews:

The Phantom Stop. A proximity sensor misses a cycle pulse because a chip deflects the beam for 80 milliseconds, or a light curtain registers a false break because of ambient light at a specific sun angle, or an encoder line is picking up EMI from a welder three stations over. The PLC briefly sees "not running," the gateway registers a stoppage event, the operator is prompted for a reason, the machine actually never stopped, and the classification is guessed — usually as "minor jam" because nobody wants to admit they don't know. Ten of these per shift and the Pareto is dominated by a problem that does not exist. Countermeasure: require the not-running state to persist across at least two missed expected cycles, not just one debounce window — this filters instantaneous sensor noise without missing real stoppages.

The Silent Idle. A machine is sitting for 40 minutes with power on and no operator because lunch break overran; no cycles, no alarms, no PLC fault, no reason code ever entered. The gateway records 40 minutes of unclassified downtime; the dashboard shows a reason code of "unknown" or worse, the event disappears into a catch-all bucket. When the shift ends and the production manager reviews the day, he sees respectable availability and has no signal that a recurring scheduling problem eats 40 minutes per shift. Countermeasure: any stoppage exceeding a time threshold (say, 15 minutes) without an operator classification must trigger an escalation — visual alarm at the line, andon light, notification to the shift supervisor — until a reason code is entered or the machine starts running again. No silent 40-minute gaps. Ever.

These are not software-only problems; they are physical-world problems that must be solved at the signal layer by someone who has actually been in the maintenance shop and understands why a proximity sensor behaves differently at 40 °C than at 25 °C. Every downtime-monitoring implementation I have done in 35 years has required a commissioning pass where we watch the signals for 48 hours on the shop floor and tune thresholds against observed behaviour — never against a specification sheet. Vendors who promise "plug-and-play" downtime detection without this commissioning pass are selling you a dashboard, not a monitoring system.

The reason-code tax — why operator-side UX determines whether your data is worth using

Every downtime event that cannot be auto-classified from the signal stream alone arrives at the operator as a prompt: "the machine has been stopped for X seconds, please tell us why." That prompt is the make-or-break interface of the entire monitoring system. Do it badly and the data is garbage within two weeks; do it well and the plant keeps classifying reliably five years in. The antipattern I call The Reason-Code Tax is the sum of every second the operator spends classifying instead of running the line, and it is what kills reason-code quality when a system is designed by people who have never stood at a machine for eight hours.

The six rules I impose on every operator-prompt UI we commission, drawn from 35 years of watching operators use — and systematically sabotage — badly designed ones:

#	Rule	Why it matters
1	No prompt under 60 seconds. Short stops self-classify as "micro-stop / unclassified"; only stoppages beyond a per-machine threshold trigger operator input.	If you prompt on every 3-second dwell, operators dismiss everything. You lose the classification on the 4-minute stops that actually matter.
2	Two-touch maximum. Level-1 category (8 options) on first touch; Level-2 reason (3–6 options within category) on second. Never more.	Three-touch menus get abandoned. Operators pick whatever is on the first screen and move on. The tree depth is your data quality.
3	Most-recent-first ordering. The operator's last-used reason sits at position 1 on the next prompt.	Recurring problems get classified consistently when the UI remembers. Random ordering makes operators pick whatever they see first.
4	20-second default. If the operator hasn't classified within 20 s of the machine restarting, the event is auto-tagged "unclassified — pending" and flagged for shift-supervisor resolution.	Reason codes must never block the operator from starting the line again. If they do, the line starts anyway and the code is lost forever. The 20-second pending bucket preserves the event for later correction.
5	Machine-aware code list. Each machine class sees only its applicable reason codes — the press operator doesn't see "conveyor jam", the packaging operator doesn't see "tool wear".	A flat plant-wide list of 80 codes forces operators to scroll. Per-machine filtered lists of 6–12 codes are classified accurately in under 3 seconds.
6	Auto-classification where the signal allows. PLC fault codes (e.g. "servo-drive-overtemp") map directly to reason codes without operator input. Operator only classifies what the PLC cannot determine.	Every stoppage the PLC can identify itself should be classified automatically. Operators then only handle the ambiguous cases — which is where their context is actually valuable.

The positive pattern I call The 20-Second Prompt — rule 4 above — is the most counter-intuitive of the six to people who design these systems from behind a desk, and it is the single largest contributor to operator cooperation that I have observed in the field. The instinct of a software engineer is to treat the missing reason code as a blocking error; the reality of a shop floor is that the operator has a line to restart in 90 seconds and will enter whatever makes the dialog go away if you force a choice. Giving them the explicit permission to defer the classification — with an audit trail that routes unresolved events to the shift supervisor at shift end — is what produces a 95 %-classified dataset instead of a 40 %-classified one.

From a mid-market stamping retrofit, 2015: The customer was a family-owned German stamping operation with about 50 employees and a mixed machine park — six hydraulic presses from the early 1990s, four servo-driven progressive stamping lines from the 2000s, and two new high-speed blanking lines installed the year we arrived. Twelve machines total, and the mandate was to bring them all onto a unified downtime-monitoring dashboard within six weeks. The customer had signed off on a downtime-detection configuration sheet that specified "5 seconds debounce across all machines" because that was what the vendor had recommended as a sensible default. Two weeks after go-live, I was back on site because the plant manager was convinced the system was broken. He had two specific complaints. The first: the old hydraulic press in station 3 was producing 800 "micro-stops" per shift on the dashboard, which everyone on the shop floor agreed was physically impossible — the press made maybe 400 cycles per shift in total. The second: the high-speed blanking line was showing zero stoppages for an entire week despite maintenance having logged two bearing alarms during that time. Both symptoms had the same root cause and neither was a software fault. The hydraulic press had a cycle dwell of 8–12 seconds at its current stroke rate. The 5-second debounce meant that every single production cycle was being counted as a "stop followed by a start," because the 8-second dwell between cycles always exceeded 5 seconds. The press was, from the gateway's perspective, stopping 800 times per shift — which was technically true, and operationally meaningless. The high-speed blanking line, meanwhile, had real jams that lasted 2–4 seconds each; the 5-second debounce threshold self-cleared every one of them before it could register as an event, so the maintenance alarms in the PLC log had no corresponding stoppage events in the MES. Two machines, same setting, opposite failure modes, both invisible until we sat on the shop floor for a day with a laptop and watched the signal traces. The fix was to set the hydraulic press to a 20-second debounce (longer than its normal cycle) and the blanking line to a 2-second debounce (shorter than its shortest real jam). Within three days the dashboards went from unusable to usable, and the plant manager's trust in the data came back. The lesson I have carried with me from that project, and from the several dozen similar cases since, is that there is no such thing as a "default" debounce setting that works across a real factory's machine park. It is a per-machine-class configuration parameter that must be set by someone who has watched the signal behave for long enough to know what "normal" looks like — which is almost always at least 24 hours of observation, sometimes a week for machines with seasonal or product-mix variation. Vendors who tell you their system is "plug-and-play" for downtime detection are either not doing this work or doing it badly. Every plant I have commissioned since that 2015 project gets a deliberate 48-hour signal-observation phase before any dashboard goes live, and the number of angry phone calls two weeks after go-live has dropped from routine to zero.

Downtime events as the feedstock for the rest of the MES

A downtime event produced cleanly at the detection layer is not the end of the chain; it is the beginning of four downstream workflows that the MES orchestrates automatically once the foundation is solid. This is what separates a monitoring system from a control instrument:

Downstream workflow	What the downtime event triggers
OEE calculation	Unplanned stoppage durations feed the availability component; planned stoppages are excluded; starved/blocked states reroute to supply/flow loss buckets.
Alarm management	Stoppages exceeding configured thresholds (duration, frequency, recurrence) escalate to shift supervisor, maintenance, and — for critical lines — plant management.
CMMS / maintenance ticketing	Long stops with maintenance-related reason codes auto-generate a CMMS ticket, pre-populated with machine, timestamp, PLC fault code, and the operator's context note. No manual ticket creation.
Shift log / handover	Unresolved stoppages from the outgoing shift appear in the incoming shift's handover view with context — preventing the recurring "nobody told me" pattern at 06:00.

A downtime-monitoring system that feeds all four of these downstream workflows automatically is a platform; a system that only feeds a dashboard is a report. The difference in value is roughly an order of magnitude, and it is entirely a function of whether the event at the detection layer is clean, typed, and attributable.

The six antipatterns of downtime monitoring — and the detection-layer discipline that prevents each

Antipattern	Symptom	Detection-layer discipline
The Debounce Dilemma	Uniform debounce across heterogeneous machines; over-counts some, under-counts others.	Per-machine-class debounce, set from 48-hour signal observation, not from spec sheet.
The Starved Blind Spot	Availability looks high because the machine cycles; output shows otherwise because material is missing.	Ask all three gateway questions: powered, cycling, producing. Treat starved and blocked as distinct states.
The Cascade Illusion	One upstream stop registers as six downstream stops; Pareto is dominated by secondary effects.	Line-context aware gateways; primary-fault attribution at the cell layer; cascade events tagged, not counted as primary.
The Phantom Stop	Sensor glitches and EMI register as stoppages that never physically occurred; reason codes guessed.	Not-running state must persist across ≥ 2 missed expected cycles, not just one debounce window.
The Silent Idle	Machine sits for 40 minutes between shifts or during breaks; no operator present; event disappears into "unknown".	Escalation timer: any stop > 15 min without classification triggers andon + supervisor notification until resolved.
The Reason-Code Tax	Operator-prompt UX is too demanding; operators dismiss or guess; data quality collapses in weeks.	The six UX rules above — short-stop auto-classification, two-touch max, machine-filtered lists, 20-second deferrable default.

FAQ

What is downtime monitoring in one sentence?
Downtime monitoring is the continuous discipline of detecting, classifying, and analysing production stoppages so that availability losses become addressable — combining a detection layer (signal-to-event logic at the gateway), a classification layer (reason codes entered or auto-attributed), and an analysis layer (dashboards, alarms, maintenance triggers) into a single operational instrument.

How does downtime monitoring differ from the downtime reason catalog?
The reason catalog is the taxonomy — the list of categories and codes a stoppage can be classified against (covered in the companion article). Downtime monitoring is the running programme that applies that taxonomy in real time to observed signals. You need both: a catalog without monitoring is a binder on a shelf; monitoring without a catalog is free-text chaos.

How does downtime monitoring differ from machine data acquisition?
Machine data acquisition (covered in the companion article) is the general discipline of extracting signals from machines. Downtime monitoring is a specific application of those signals — the state-machine logic that turns a stream of cycle pulses and run/stop bits into typed stoppage events with durations, reasons, and primary-fault attribution. Acquisition is a prerequisite; monitoring is what you build on top.

What debounce threshold should I use?
Per-machine-class, never global. High-speed presses: 3–5 seconds. Injection moulding: 20–40 seconds. Assembly lines: 1.5× takt. CNC: state-based with M-code logic, not pure time. The final values must be tuned from 48 hours of observed signal behaviour on each machine class — not from the vendor's default. Plants that skip this step get bad data within the first two weeks of go-live, and they discover it months later when a decision is made against it.

What is The Starved Blind Spot?
The failure mode in which a machine cycles at full rate but produces nothing because upstream material is empty — and the monitoring system reports high availability because it only tracks "cycling" and not "producing." Prevented by adding the third gateway question ("is the machine producing?") and treating starved and blocked as distinct OEE loss categories separate from machine downtime.

What is The Cascade Illusion?
The pattern in which one upstream stop on a connected line registers as multiple downstream stops — because each downstream station also stops running when its predecessor stops. A naive system counts them all as independent stoppages; a correctly built system tags them as secondary-cascade events and attributes the primary fault to the originating station. The correction typically reduces reported stoppage counts by 40–60 % on interconnected lines without any change in actual machine behaviour.

What is The Reason-Code Tax?
The operational cost of the operator-side UX for classifying stoppages — every second the operator spends classifying is a second the line isn't running. A badly designed prompt UI drives operators to dismiss or guess, which corrupts the data within weeks. A well-designed prompt UI (short-stop auto-classification, two-touch maximum, machine-filtered code lists, 20-second deferrable default) produces reliable classification over years. UX discipline, not classifier sophistication, is what separates useful downtime data from garbage.

What is the difference between a short stop and a long stop?
The classical Six Big Losses framework separates short stops (under 5 minutes — jams, misfeeds, minor clearing tasks, usually self-resolved by the operator) from long stops (5 minutes or more — breakdowns, component failures, material shortages requiring logistics intervention). The split matters because the two classes have different root causes, different owners, and different improvement strategies: short stops point to process stability and design; long stops point to MTTR, spare-parts, and preventive maintenance. A Pareto that doesn't split by duration hides one of the two problems.

How does downtime monitoring connect to OEE?
Unplanned stoppage durations feed the availability component of OEE. Planned stoppages are excluded from the availability denominator. Starved and blocked states reroute to organisational/flow loss buckets, not to machine downtime, so that the machine's availability isn't penalised for problems upstream or downstream of it. Downtime monitoring done well makes OEE honest; done poorly, it makes OEE plausible — which is worse, because the number looks right but the decisions made against it are wrong.

Do I need a separate system for downtime monitoring, or should it be part of the MES?
Part of the MES. Standalone downtime tools exist and work at the detection layer, but they don't feed OEE, alarms, CMMS, or shift log automatically — each of which is where most of the operational value lives. Plants that start with a standalone downtime tool typically end up migrating to an integrated cloud-MES within 18–24 months. Starting with the integrated platform is the cheaper and faster path, particularly for mid-market manufacturers where the IT team that would integrate separate systems doesn't have the capacity.

Can downtime monitoring work on brownfield machines without OPC UA?
Yes, and the majority of the installations I commission are exactly this. Cycle pulses from 24V digital I/O, part counters from existing light curtains, motor-run signals from contactor auxiliaries, PLC alarms from discrete output bits on the control cabinet — all of these work as detection inputs feeding into an IoT gateway. No PLC reprogramming required, no production interruption required, no vendor of the original machine involved. A 1995 mechanical press is as monitorable as a 2025 servo-press; the input signals are simpler, the gateway logic is identical.

About the author

Martin Brandel

MES Consultant & Automation Expert at SYMESTIC. Dipl.-Ing. Nachrichtentechnik. 35+ years connecting machines to higher-level systems — since 1991 with Simatic S5 and COROS visualisation, today with OPC UA, IoT gateways, and cloud-MES. Started as automation engineer at an engineering bureau (warehouse management, material-flow control, plant control with Simatic S5); then Hermos AG (1995–2000) with large projects in Eastern Europe and China (conveyor systems, process engineering, paint shops); at SYMESTIC since 2000, building and leading the automation department for 11 years (software standards for process plants in beverage and wood industries, Simatic S5 → S7/TIA retrofit projects); since 2019 MES Consultant and project lead, accompanying manufacturing customers from first enquiry to go-live. Expertise: machine data acquisition (MDE), operator data acquisition (BDE), brownfield machine integration, PLC programming (Simatic S5, S7, TIA Portal), retrofit, OPC UA, IoT-gateway integration, downtime-detection engineering, process control systems, material-flow control, industrial automation, MES project management, process engineering, CE conformity. · LinkedIn

Start working with SYMESTIC today to boost your productivity, efficiency, and quality!