Skip to content

RCM: The 7 Questions, FMEA Logic & MES Data

By Martin Brandel · Last updated: April 2026

What is Reliability Centered Maintenance (RCM)?

Reliability Centered Maintenance (RCM) is a structured method for determining the maintenance requirements of a physical asset in its operating context. It was developed by Stanley Nowlan and Howard Heap at United Airlines in the late 1960s — originally to design the maintenance programme for the Boeing 747. Their 1978 report for the US Department of Defense remains the foundational document. The method was later formalised in SAE JA1011 (1999/2009), which defines the evaluation criteria that a process must satisfy to be called "RCM."

RCM does not start with the machine and ask "what maintenance does it need?" —it starts with the function the machine performs and asks "what can cause this function to fail, what happens when it does, and what is the most cost-effective thing to do about it?" This function-first logic is what distinguishes RCM from traditional preventive maintenance, which applies time-based schedules uniformly regardless of failure consequence. RCM produces a different maintenance strategy for each failure mode — some get preventive tasks, some get condition-based monitoring, some get redesigned, and some are deliberately run to failure — because the economics and consequences are different in each case. In discrete manufacturing, RCM is the methodology that connects MTBF/MTTR data, FMEA analysis, and maintenance strategy selection into a single coherent framework. The MES provides the operating data — failure history, alarm patterns, process parameter trends, operating hours — that makes RCM analysis fact-based rather than opinion-based.

What are the 7 RCM questions (SAE JA1011)?

SAE JA1011 defines 7 questions that must be answered, in sequence, for each asset. Any process that claims to be RCM must answer all 7. These questions form the logical backbone of the entire analysis:

# RCM question What it determines MES data input Example (hydraulic press)
1 What are the functions and performance standards of the asset in its present operating context? The purpose the asset serves. Not "what is it?" but "what does it do?" — including primary and secondary functions, with quantified performance standards. MES defines the operating context: the press runs 2 shifts, produces 450 cycles/hour, and is part of a 3-station line where it is the bottleneck. Primary function: "Form blanks into brackets at 450 cycles/hour with ≤ 0.3 % reject rate." Secondary: "Contain hydraulic fluid without leaks."
2 In what ways can it fail to fulfil its functions? Functional failures — not component failures. "The press cannot form brackets" is a functional failure. "The hydraulic pump is broken" is a failure mode (Q3). MES downtime log reveals the functional failures: "press stopped" (total loss of function), "press running but reject rate > 0.3 %" (partial loss of function). FF1: Total loss of forming capability. FF2: Forming capability degraded (reject rate exceeds 0.3 %). FF3: Hydraulic fluid leak (secondary function failure).
3 What causes each functional failure? Failure modes — specific events that cause the functional failure. This is the FMEA (Failure Mode and Effects Analysis) step. MES alarm history identifies the specific failure modes: alarm #3017 (hydraulic pressure low), alarm #4022 (servo drive overload), alarm #5001 (temperature sensor fault). Each alarm is a failure mode. FM1: Hydraulic pump wear → pressure below 240 bar. FM2: Servo drive overheating. FM3: Die alignment drift → rejects. FM4: Hydraulic hose rupture.
4 What happens when each failure occurs? Failure effects — the sequence of events when the failure mode occurs. Not the consequence (Q5), but the observable evidence and what the operator sees. MES provides the effects data: "alarm #3017 fires, press stops within 3 seconds, last 2 parts are underformed (reject), line stops because press is bottleneck, average repair time 45 minutes." FM1 effect: "Alarm fires. Press stops. 2 reject parts produced. Downstream stations idle. Technician called. Average repair: 45 min. 450 parts lost."
5 In what way does each failure matter? Failure consequences — classified into 4 categories: hidden failure, safety/environmental, operational, non-operational. The consequence determines which maintenance strategy is justified. MES provides the production impact: downtime hours, lost output, scrap cost. Combined with cost data from ERP: material cost per reject, hourly production value, overtime cost to recover lost output. FM1 consequence: Operational — 45 min downtime × €800/hour production value = €600 per event. Occurs 2×/month = €14,400/year. Preventive action justified if it costs less than €14,400/year.
6 What should be done to predict or prevent each failure? Proactive tasks: scheduled restoration, scheduled discard, or on-condition (condition-based monitoring). The task must be technically feasible and worth doing. MES data determines task feasibility: Is there a detectable degradation pattern before failure? MES process data shows: hydraulic pressure declines gradually over 2–3 weeks before reaching the failure threshold. → On-condition task is technically feasible. FM1: On-condition task — monitor hydraulic pressure trend weekly via MES process data. When pressure drops below 260 bar (P-F interval = ~2 weeks), schedule pump maintenance.
7 What should be done if a suitable proactive task cannot be found? Default actions: failure-finding task (for hidden failures), redesign, or run-to-failure (if the consequence is acceptable). MES cost data supports the decision: if the failure consequence is €200/event and occurs once per year, and any preventive task costs more than €200/year, run-to-failure is the rational choice. FM4 (hose rupture): Random failure with no detectable degradation pattern. No on-condition task feasible. Consequence: €600 + environmental cleanup. Decision: keep spare hose on site, accept run-to-failure with fast response.

How does RCM select the maintenance strategy for each failure mode?

RCM does not prescribe a single maintenance strategy for an entire machine. It assigns a different strategy to each failure mode based on the failure pattern and the consequence. This is the core distinction between RCM and traditional maintenance — and the reason RCM produces lower-cost, higher-reliability outcomes:

Maintenance strategy RCM assigns this when… MES data required Example Typical assets
On-condition (CBM) A detectable degradation exists before failure (P-F curve). The P-F interval is long enough to act. The task is technically feasible and worth doing. MES process data: vibration, temperature, pressure, current — monitored per cycle. MES alarm trends: rising alarm frequency as degradation indicator. At Neoperl, SPS alarm correlation is exactly this. Monitor spindle vibration via MES process data. When vibration exceeds 4.5 mm/s (warning threshold), schedule bearing replacement within 2 weeks (P-F interval). Bearings, motors, pumps, drives — any component with gradual, measurable degradation.
Scheduled restoration The component has a defined useful life (age-related failure pattern). Restoration at a fixed interval is cheaper than the failure consequence. MES provides actual operating hours per machine — not calendar estimates. At Brita, digital machine signals provide the operating-hour foundation. The MES feeds real hours to trigger the task at exactly the right time. Replace hydraulic seals every 4,000 operating hours. MES tracks operating hours automatically — CMMS triggers the work order. Seals, filters, belts, lubricants — wear items with predictable life.
Scheduled discard Same as restoration, but the component is discarded, not restored. Applies when restoration is impractical or the safety consequence demands it. Same as above — MES operating hours trigger the discard interval. Replace safety-critical hydraulic hoses every 6 years (manufacturer specification), regardless of condition. Safety-critical components with manufacturer-mandated replacement intervals.
Failure-finding The failure is hidden — the operator would not know it has occurred under normal circumstances. Applies to protective devices and redundant systems. MES can automate failure-finding: test the safety interlock every shift by triggering the test cycle and verifying the response signal from the PLC. Test the light curtain on the press every shift start. If it does not stop the press within 50 ms, it has failed — and nobody would know until someone is injured. Safety interlocks, pressure relief valves, fire suppression, backup systems.
Run-to-failure No proactive task is technically feasible or worth doing. The failure consequence is acceptable (low cost, no safety impact, no secondary damage). MES validates the decision: if the failure occurs and costs less than predicted, run-to-failure was the right choice. If it costs more, reassign the strategy. Indicator lights, non-critical sensors, cosmetic covers. Replace when broken. Total cost: €20 part + 10 min labour. Non-critical, low-cost, easily replaceable components with no secondary damage potential.
Redesign The failure consequence is unacceptable (safety, environmental, or extremely high cost) and no maintenance task can reduce the risk to an acceptable level. MES failure history proves that maintenance cannot solve the problem: despite preventive and condition-based tasks, the failure mode recurs. The data justifies the capital investment in redesign. The cooling system fails 4× per year despite all maintenance. MES data shows: each failure costs €3,200 in downtime. Redesign with a larger cooler costs €8,000. Payback: 7 months. Design-limited components where the failure rate cannot be reduced by maintenance.

The key insight from Nowlan and Heap's original research — confirmed across industries since 1978 — is that only a minority of failure modes follow an age-related pattern (where time-based replacement is effective). Most failure modes in complex systems are random or have an increasing probability of failure in early life (infant mortality). This means that fixed-interval replacement, applied without analysis, often does not improve reliability and may actually introduce new failures through maintenance-induced errors. RCM solves this by matching the maintenance strategy to the actual failure pattern — and the MES provides the failure data to determine which pattern applies.

How does RCM differ from TPM and preventive maintenance?

Dimension RCM TPM Traditional preventive maintenance
Starting point The function of the asset. "What must this machine do?" The losses of the asset. "What is reducing our OEE?" The machine manual. "What does the manufacturer recommend?"
Analysis method FMEA-based: failure mode → consequence → task selection logic. Loss-based: 6 big losses → root cause → countermeasure. Schedule-based: time or meter interval → task list.
Task assignment Different strategy per failure mode: some CBM, some time-based, some run-to-failure, some redesign. Operator-driven autonomous maintenance + planned maintenance + focused improvement. Uniform time-based schedule for all failure modes.
Strength Optimises the maintenance strategy per failure mode. Eliminates unnecessary tasks. Justifies every task economically. Engages operators in maintenance. Reduces the 6 big losses. Creates a maintenance culture. Simple to implement. Good for assets with predictable, age-related wear patterns.
Limitation Analysis-intensive. Requires detailed failure data and cross-functional team. Does not address operator behaviour or culture. Does not systematically analyse failure modes or consequence economics. May apply uniform maintenance to all components. Does not consider failure consequences. Over-maintains non-critical items, under-maintains critical ones.
MES role MES provides the failure data, alarm history, and process parameter trends that RCM analysis is built on. MES provides OEE, downtime Pareto, and the 6 big losses quantification that drives TPM improvement. MES provides operating hours for meter-based schedules.

In practice, the best maintenance programmes combine elements of all three: RCM to design the strategy, TPM to engage operators in daily maintenance, and preventive schedules for the simple wear items that RCM identified as age-related. The MES is the shared data layer across all three.

FAQ

Is RCM only for aviation and nuclear?
No — that is where it was invented, but RCM has been applied to every asset-intensive industry since the 1980s: power generation, mining, oil and gas, railways, and manufacturing. In discrete manufacturing, RCM is most valuable for critical bottleneck machines where the consequence of failure is high (line stop, scrap, customer delivery impact). A plant does not need to apply RCM to every machine — start with the 5 machines that account for 80 % of your unplanned downtime (the MES downtime Pareto identifies them). Those are the machines where RCM analysis pays for itself.

How long does an RCM analysis take?
For a single complex machine (e.g., a hydraulic press with 50+ failure modes), a cross-functional team (maintenance, operations, engineering) typically needs 3–5 facilitated sessions of 4 hours each. The output is a maintenance strategy for every failure mode on that machine. The investment pays for itself if it eliminates even one unplanned stop per month on a bottleneck machine. The critical success factor is data quality: with MES failure history and alarm data, the team spends time analysing — without it, they spend time guessing.

What is the P-F curve?
The P-F curve (Potential Failure to Functional Failure curve) is the central concept in RCM condition-based maintenance. Point P is where degradation becomes detectable (e.g., vibration increases). Point F is where functional failure occurs (machine stops). The interval P-F is the window in which you can detect the problem and act before failure. On-condition tasks are only feasible if the P-F interval is long enough to schedule and execute the maintenance. MES process data monitoring shortens the detection time: instead of checking vibration once per week with a handheld sensor, the MES monitors it every cycle — making the effective P-F interval longer and the on-condition task more reliable.

How does RCM relate to predictive maintenance?
Predictive maintenance is one of the task types that RCM can assign — specifically, on-condition tasks using advanced analytics (machine learning on historical sensor data) to predict remaining useful life. RCM is the analysis framework that determines which failure modes benefit from predictive maintenance. Not every failure mode does — random failures with no detectable precursor cannot be predicted, and low-consequence failures do not justify the investment. RCM prevents the common mistake of applying predictive maintenance everywhere: it focuses the expensive monitoring on the failure modes where it actually reduces cost.


Related: TPM · Predictive Maintenance · MTBF · MTTR · MTTF · OEE Explained · SYMESTIC Alarms Module · SYMESTIC Process Data · SYMESTIC Production Metrics · MES: Definition & Functions

About the author
Martin Brandel
Martin Brandel
MES Consultant at SYMESTIC. Dipl.-Ing. Nachrichtentechnik. Over 30 years in industrial automation. Has connected machines from every decade since the 1980s to data systems — and learned that the most expensive maintenance strategy is not the one that costs the most, but the one that is applied to the wrong failure mode. The P-F curve is only useful if someone is actually measuring P. · LinkedIn
Start working with SYMESTIC today to boost your productivity, efficiency, and quality!
Contact us
Symestic Ninja