Skip to content

Failure Rate in Manufacturing: MTBF, MTTR, OEE and MES Data

By Martin Brandel · Last updated: March 2026

What Is the Failure Rate in Manufacturing?

The failure rate in manufacturing describes how frequently a machine, production line, or product fails within a given time period or production volume. It is a fundamental metric in reliability engineering and production management.

In practice, the term "failure rate" is used for two different things in a factory:

Type What it measures Unit OEE impact
Machine failure rate How often a machine stops due to an unplanned technical failure. The frequency of breakdowns per operating hour or per production shift. Failures per hour, failures per shift, or expressed as MTBF (Mean Time Between Failures) in hours. Directly reduces Availability. Every unplanned stop is lost production time.
Product failure rate (scrap rate) How many products fail to meet quality requirements. The percentage of defective parts (scrap + rework) relative to total parts produced. Percentage (%), parts per million (ppm), or defects per million opportunities (DPMO). Directly reduces Quality. Every defective part is wasted material, energy, and machine time.

Both types of failure rate are critical for production performance, and both are captured by an MES (Manufacturing Execution System). This article covers both.


Machine Failure Rate: MTBF and MTTR

The machine failure rate describes the reliability of production equipment. Two key metrics define it:

Metric Full name Formula What it tells you Good vs. bad
MTBF Mean Time Between Failures Total operating time / Number of failures How long a machine runs on average before the next unplanned stop. Higher = more reliable. MTBF of 50 hours = machine fails roughly once every 2 shifts. MTBF of 500 hours = machine fails roughly once per month.
MTTR Mean Time To Repair Total repair time / Number of failures How long it takes on average to restore a machine to operating condition after a failure. Lower = faster recovery. MTTR of 15 minutes = quick repair. MTTR of 4 hours = major intervention required.
Failure rate (λ) Lambda 1 / MTBF = Number of failures / Total operating time The probability of failure per unit of time. Higher = less reliable. λ = 0.02 per hour means 2 failures per 100 operating hours. λ = 0.002 per hour means 2 failures per 1,000 operating hours.
Availability (derived from MTBF and MTTR) MTBF / (MTBF + MTTR) The percentage of planned production time that the machine is actually operational. MTBF = 100h, MTTR = 2h: Availability = 98%. MTBF = 20h, MTTR = 4h: Availability = 83%.

The relationship is straightforward: a high MTBF and a low MTTR produce high availability. A low MTBF or a high MTTR (or both) produce low availability. Availability is the first factor in the OEE calculation.


The Bathtub Curve: How Failure Rates Change Over Time

Machine failure rates are not constant. They follow a characteristic pattern known as the bathtub curve, which describes three distinct phases in the life of a machine:

Phase Name Failure rate behavior Typical causes Action
Early life (infant mortality) Burn-in period High failure rate that decreases over time. Many failures in the first weeks/months of operation. Manufacturing defects in components. Assembly errors. Incorrect installation. Software bugs. Improper commissioning. Thorough commissioning and acceptance testing. Burn-in runs before production start. Close monitoring during the first weeks of operation.
Useful life (normal operation) Stable period Low and relatively constant failure rate. Failures occur randomly and are not predictable from age alone. Random component failures. External factors (power fluctuations, material variations). Operator errors. Environmental conditions. Preventive maintenance at defined intervals. Condition monitoring. Alarm analysis to detect developing problems early.
Wear-out (end of life) Aging period Increasing failure rate over time. Failures become more frequent and more severe as components reach end of life. Mechanical wear (bearings, seals, guides). Fatigue (springs, belts, chains). Degradation (insulation, electronics). Corrosion. Predictive maintenance based on condition data. Planned replacement of wear parts before failure. Capital planning for machine replacement.

For maintenance and production managers, the bathtub curve explains why new machines sometimes fail more than expected (early life phase) and why old machines become increasingly unreliable (wear-out phase). The goal of maintenance management is to extend the useful life phase as long as possible.


Product Failure Rate (Scrap and Rework)

The product failure rate measures how many products do not meet quality specifications. In manufacturing, this is expressed as:

Metric Formula What it tells you Typical values
Scrap rate Scrapped parts / Total parts produced x 100% Percentage of parts that are defective and cannot be repaired. These parts are destroyed or recycled. Automotive: 0.5% to 3%. Plastics: 1% to 5%. Metal processing: 0.5% to 2%.
Rework rate Reworked parts / Total parts produced x 100% Percentage of parts that failed initial inspection but can be repaired and brought to specification. Varies widely by process. Assembly: 2% to 10%. Electronics: 1% to 5%.
First pass yield (FPY) Good parts (first time) / Total parts produced x 100% Percentage of parts that pass quality inspection on the first attempt without rework. World-class: > 99%. Good: 95% to 99%. Poor: < 90%.
PPM (parts per million) Defective parts / Total parts x 1,000,000 Number of defective parts per million produced. Used in automotive for customer complaints and warranty claims. Automotive OEM requirement: < 10 ppm. Typical supplier: 50 to 500 ppm.

The product failure rate (scrap + rework) directly determines the Quality factor in OEE. If 5% of parts are defective, the Quality factor is 95%, which reduces the overall OEE by 5 percentage points.


How MES Data Enables Failure Rate Analysis

MES data Machine failure rate analysis Product failure rate analysis
Automatic machine state detection Every machine stop is detected automatically with timestamp and duration. This provides the raw data for MTBF and MTTR calculation without manual logging. Machine state data provides context: which machine state was active when defective parts were produced?
PLC alarm capture Every alarm code is recorded. Alarm Pareto shows the most frequent failure causes. Alarm ranking by duration shows which failures consume the most production time. Alarm-quality correlation reveals which machine alarms coincide with increased product defect rates.
Production counts Total parts produced provides the denominator for failure rate per unit. Good parts, scrap, and rework are counted automatically. Scrap rate and first pass yield are calculated in real time.
Downtime classification Each downtime event is classified by cause (technical failure, changeover, material shortage, etc.). Only technical failures are included in MTBF calculation. Downtime classification separates quality-related stops (e.g., tool adjustment after defect) from other stop causes.
Process parameter recording Process parameter trends can predict machine failures: increasing vibration, rising temperature, or decreasing pressure indicate developing problems. Process parameter values per part enable correlation analysis: which parameter ranges produce defects? Which produce zero-defect results?
Trend analysis MTBF and MTTR trends over weeks and months show whether machine reliability is improving (maintenance actions working) or deteriorating (wear-out phase approaching). Scrap rate and rework rate trends over time show whether quality is improving or deteriorating. Validates the effectiveness of corrective actions.

At Neoperl, SYMESTIC captures PLC alarms automatically and correlates them with both downtimes and quality defects. The result: 10% fewer downtimes (machine failure rate reduction) and 15% less scrap (product failure rate reduction) through targeted actions based on alarm-quality correlation data. At Meleghy, automatic downtime detection across 6 plants provides the data for MTBF analysis at every critical process step.


Failure Rate and OEE

OEE factor Failure rate impact Example
Availability Machine failure rate directly reduces availability. Every unplanned stop subtracts from operating time. MTBF and MTTR determine the availability percentage: Availability = MTBF / (MTBF + MTTR). A machine with MTBF = 40h and MTTR = 2h has 95.2% availability. Improving MTBF to 80h (same MTTR) increases availability to 97.6%.
Performance Micro-stops (very short failures of seconds to minutes) reduce effective cycle time but are often not classified as downtime. They reduce the performance factor without appearing in availability. A machine that stops for 3 seconds every 10 cycles due to a sensor issue loses 5% of its effective speed. This is invisible without automatic cycle time measurement.
Quality Product failure rate (scrap + rework) directly reduces the quality factor. Every defective part produced is wasted machine time, material, and energy. A 3% scrap rate means the quality factor is 97%. Reducing scrap to 1% increases quality to 99%, improving OEE by 2 percentage points.

Reducing Failure Rates: Machine and Product

Strategy Machine failure rate reduction Product failure rate reduction
Data-driven root cause analysis Alarm Pareto identifies the top failure causes. Systematic elimination starts with the alarm that causes the most cumulative downtime. Scrap Pareto identifies the top defect types. Defect-cause analysis reveals which process steps or parameters produce the most defects.
Preventive maintenance Time-based or cycle-based maintenance schedules replace wear parts before they fail. MTBF data determines optimal maintenance intervals. Tool and die maintenance prevents quality drift. Calibration schedules maintain measurement accuracy.
Condition monitoring Process parameters (vibration, temperature, pressure) are monitored for trends that indicate developing failures. Alert when thresholds are approached. Process parameter monitoring detects quality drift before defective parts are produced. SPC control charts signal out-of-control conditions.
Alarm-quality correlation Identifies which machine alarms precede failures. Enables predictive intervention before a full breakdown occurs. Identifies which machine events cause quality defects. Enables targeted actions on the specific machine condition that produces scrap.
Standardization Standardized operating procedures reduce operator-induced failures. Clear startup, shutdown, and changeover procedures. Standardized process parameters for each product. Poka-yoke mechanisms prevent wrong setup or wrong material.

Frequently Asked Questions About Failure Rate

What is the difference between failure rate and downtime?

Failure rate measures how often failures occur (frequency). Downtime measures how long the machine is stopped (duration). A machine can have a high failure rate but low total downtime if each failure is resolved quickly (low MTTR). Conversely, a machine can have a low failure rate but high total downtime if each failure takes hours to repair (high MTTR). Both metrics are needed for a complete picture. MES captures both automatically: the number of stop events (failure rate) and the duration of each stop (downtime).

How is MTBF calculated from MES data?

MTBF = Total operating time / Number of unplanned technical failures. The MES records every machine state transition with timestamp. Operating time is the sum of all "running" periods. Unplanned technical failures are all stops classified as technical (excluding planned stops like changeovers, breaks, or maintenance). Example: A machine ran for 160 hours in a week and had 8 unplanned technical stops. MTBF = 160 / 8 = 20 hours. This means the machine fails on average once every 20 hours of operation.

Can failure rate analysis be done on old machines?

Yes. Even machines without a modern PLC can contribute to failure rate analysis. A digital I/O gateway captures the basic machine state (running / stopped) with timestamp. From this data alone, the MES can calculate MTBF, MTTR, and availability. At Klocke, SYMESTIC provides this level of analysis on packaging lines connected via digital I/O gateways with LTE connectivity, without any LAN infrastructure.

What is a good MTBF for a production machine?

It depends entirely on the machine type and process. A CNC machining center might have an MTBF of 200+ hours. An automated assembly line with 40 stations might have an MTBF of 4 to 8 hours for the entire line (because any station failure stops the line). What matters is not the absolute number but the trend: is MTBF improving over time? And the comparison: which machines have significantly lower MTBF than comparable machines? These are the machines where maintenance and engineering effort should focus.

How does failure rate relate to TPM (Total Productive Maintenance)?

TPM aims to achieve zero breakdowns, zero defects, and zero accidents. The machine failure rate (MTBF) and the product failure rate (scrap rate) are the primary metrics for measuring TPM success. An MES provides the data infrastructure for TPM: automatic downtime detection, alarm capture, scrap tracking, and OEE calculation. Without this data, TPM activities are based on assumptions rather than facts.


About the author:
Martin Brandel
MES Consultant at SYMESTIC. Over 30 years in industrial automation. Dipl.-Ing. Nachrichtentechnik.
LinkedIn
Start working with SYMESTIC today to boost your productivity, efficiency, and quality!
Contact us
Symestic Ninja