MES Software: Vendors, Features & Costs Compared 2026
MES software compared: vendors, functions per VDI 5600, costs (cloud vs. on-premise) and implementation. Honest market overview 2026.
Root cause analysis (RCA) is a structured investigation method that traces a problem — a machine failure, a quality defect, a recurring downtime event — back to its underlying cause, not just the visible symptom. The principle: fixing the symptom means the problem returns; fixing the root cause means it does not. RCA is not a single tool — it is a family of methods (Ishikawa diagram, 5 Whys, fault tree analysis, Pareto analysis, is/is-not matrix) applied within a structured problem-solving framework (8D, A3, DMAIC). In discrete manufacturing, RCA is the daily work of continuous improvement teams, maintenance engineers, and quality managers.
The quality of an RCA depends entirely on the quality of the data it starts from. An RCA based on operator memory ("the machine stopped sometime around 2 PM, maybe because of the hydraulics") produces different results than an RCA based on MES data: "machine M-007 stopped at 14:03:17, preceded by alarm #3017 (hydraulic pressure below 240 bar) at 14:02:54, third occurrence in 72 hours, process parameter log shows pressure declining since 08:00." The MES does not perform the RCA — the human does. But the MES provides the timestamped, machine-level evidence that turns RCA from a brainstorming exercise into a forensic investigation.
| Method | How it works | Best suited for | MES data that feeds it | Limitation |
|---|---|---|---|---|
| 5 Whys | Ask "Why?" repeatedly (typically 5 times) until the answer stops being a symptom and becomes a systemic cause. Linear chain: symptom → cause 1 → cause 2 → … → root cause. | Simple, single-cause problems. Quick shopfloor investigations during shift meetings. First-response tool when a downtime event occurs. | MES provides the starting point: "Machine stopped at 14:03" (not "machine stopped sometime this afternoon"). Each "Why?" can be checked against MES data — alarm logs, process parameter trends, order history. | Assumes a single linear cause chain. Complex problems with multiple interacting causes require a different method. The quality of the 5 Whys depends entirely on the knowledge of the people in the room — and the data available to them. |
| Ishikawa diagram (Fishbone / Cause-and-effect) | Structures potential causes into categories — the classic 6 M's: Man, Machine, Material, Method, Measurement, Milieu (Environment). Team brainstorms causes within each category. | Complex problems where the cause is unknown and could come from multiple categories. Structured team brainstorming. Automotive 8D reports (D4: root cause analysis). | MES data fills the "Machine" and "Method" branches with facts: which alarms occurred, which process parameters drifted, which order was running, which shift was active. At Neoperl, SPS alarm correlation filled the "Machine" branch of the fishbone with quantified data instead of opinions. | Does not by itself prove causation — it generates hypotheses. Each hypothesis must be verified with data or experiment. Without data, the fishbone degenerates into a list of opinions ranked by the seniority of the person who stated them. |
| Pareto analysis | Ranks causes by frequency or impact. The 80/20 rule: typically 20 % of causes account for 80 % of the losses. Prioritises which root cause to investigate first. | Recurring problems with multiple potential causes. Downtime analysis, scrap analysis, defect type ranking. The first step before any deep RCA: "Which problem costs us the most?" | MES generates the Pareto automatically: downtime by reason code, rejects by defect type, alarm frequency by alarm number. The SYMESTIC production metrics module provides this as a standard dashboard. At Meleghy, the downtime Pareto across 6 plants revealed which loss categories to attack first. | Pareto shows what to investigate — not why it happens. It is a prioritisation tool, not a root cause tool. After the Pareto identifies the #1 loss, you still need 5 Whys, Ishikawa, or fault tree to find the root cause. |
| Fault tree analysis (FTA) | Top-down deductive analysis. Starts with the failure event (top) and maps all possible combinations of causes using AND/OR logic gates. Visual tree structure. | Safety-critical failures, complex systems with redundancy, failures that require multiple conditions to occur simultaneously (AND gates). Standard in aerospace, nuclear, automotive safety. | MES alarm data and process data provide the evidence for which branches of the fault tree were active at the time of failure. Alarm timestamps and process parameter logs narrow down which combination of conditions actually occurred. | Time-intensive to construct properly. Requires deep system knowledge. Overkill for simple problems — use 5 Whys instead. Best reserved for critical, expensive, or safety-relevant failures. |
| Is/Is-not analysis | Systematic comparison: Where does the problem occur vs. where does it not? When does it occur vs. when not? On which machine/product/shift vs. which not? The differences narrow down the cause. | Sporadic problems that occur sometimes but not always. "Why does this defect happen on machine 5 but not on machine 6?" — the difference between them is likely the cause. | MES enables is/is-not analysis directly: filter by machine, shift, product, order, time period. "Scrap rate on press 5 is 3.2 %; on press 6 it is 0.8 %. Both run the same product. What is different?" The MES shows: press 5 has 12 % higher cycle time variation and alarm #4011 occurs 5× more frequently. | Requires the problem to have a clear pattern (occurs here, not there). If the problem is universal (occurs everywhere equally), is/is-not does not help — use Ishikawa or FTA instead. |
In automotive manufacturing, RCA is not performed in isolation — it is embedded in the 8D (Eight Disciplines) problem-solving process, which is the standard corrective action framework required by most OEMs (VW Formel Q, BMW QMT, Ford TOPS-8D). The 8D report structures the entire investigation from symptom to verified corrective action. RCA is steps D4 and D5:
| Step | 8D discipline | What it requires | MES data input | RCA method used |
|---|---|---|---|---|
| D1 | Establish team | Cross-functional team with process knowledge. | — | — |
| D2 | Describe the problem | Precise problem description: What, Where, When, How big, Who detected. The is/is-not matrix is the standard tool. | MES provides the "When" (exact timestamp), "Where" (which machine, which station), "How big" (how many defective parts, which orders affected), and the is/is-not data (which machines/shifts are affected vs. not). | Is/Is-not analysis |
| D3 | Interim containment | Immediate action to protect the customer while the root cause is investigated. Sort, inspect, quarantine. | MES identifies the affected scope: which serial numbers, which batches, which orders were produced on the affected machine during the affected time window. At Meleghy, the bidirectional SAP integration enables exact containment scope definition. | — |
| D4 | Root cause analysis | Identify all potential root causes. Verify each with data or experiment. Distinguish occurrence cause (why did the defect happen?) from escape cause (why was it not detected?). | MES alarm history, process parameter trends, downtime logs, cycle time variation, OEE drill-downs. At Neoperl, SPS alarm correlation directly identified the root cause — the alarm pattern that preceded the quality defect. | Ishikawa + 5 Whys (most common). FTA for complex/safety cases. |
| D5 | Verify corrective actions | Prove that the proposed corrective action actually eliminates the root cause — before full implementation. | MES provides the before/after comparison: defect rate before corrective action vs. after. Process parameter stability before vs. after. Alarm frequency before vs. after. This is the evidence that the corrective action works. | Data comparison (statistical test or trend comparison) |
| D6 | Implement permanent corrective action | Roll out the verified corrective action. Update control plans, work instructions, Poka Yoke devices. | MES can enforce the corrective action digitally: updated process parameter limits, new alarm thresholds, routing changes via production control module. | — |
| D7 | Prevent recurrence | Systemic actions: update FMEA, apply to similar processes/products, standardise the learning. | MES data enables horizontal deployment: "This root cause was found on press 5. Do presses 3, 4, 6 show the same alarm pattern?" If yes, apply the corrective action proactively. | — |
| D8 | Recognise the team | Close the report. Acknowledge the team's contribution. | — | — |
| RCA step | Without MES | With MES |
|---|---|---|
| Problem detection | Operator notices the problem during the shift — or the customer reports it weeks later. Detection delay: hours to weeks. | MES detects the deviation in real time: alarm fires, process parameter exceeds limit, reject rate spikes above threshold. Detection delay: seconds. At Brita, deviations are visible on the shopfloor monitor immediately. |
| Problem description (D2) | "The machine had a lot of downtime last week." No timestamps, no quantities, no machine-level detail. The is/is-not matrix is filled from memory. | "Machine M-007 had 14 unplanned stops between Monday 06:00 and Friday 18:00, totalling 8.3 hours. 11 of 14 stops were preceded by alarm #3017. Machine M-008 (same product, same shift) had 2 unplanned stops in the same period." The is/is-not matrix is filled from data. |
| Root cause identification (D4) | Ishikawa session generates 15 potential causes. The team debates which one is most likely. The most senior person's opinion wins. Verification: "Let's try it and see." | Ishikawa session generates 15 potential causes. MES alarm data eliminates 12 of them immediately (the alarm pattern is specific to one cause). Process parameter trend confirms: hydraulic pressure has been declining for 3 days. Root cause identified in 1 hour instead of 1 week. |
| Verification (D5) | "We changed the hydraulic filter. It seems better now." No before/after data. No statistical confidence. If the problem recurs in 3 months, nobody remembers whether the corrective action ever actually worked. | MES shows: alarm #3017 frequency before filter change: 11 occurrences/week. After: 0 occurrences in 4 weeks. Unplanned downtime on M-007 before: 8.3 hours/week. After: 0.4 hours/week. The corrective action is verified with data. |
The pattern is consistent across every SYMESTIC implementation: the MES does not replace the human investigator — it arms the investigator with evidence. At Neoperl, SPS alarm correlation reduced scrap by 15 % — not because the MES performed the RCA, but because it provided the alarm-to-defect correlation that the CI team used to identify and eliminate the root causes. At Schmiedetechnik Plettenberg, real-time transparency enabled faster root cause identification because "deviations were no longer recognised only in retrospect."
What is the difference between RCA and troubleshooting?
Troubleshooting restores the machine to running condition as quickly as possible — it fixes the immediate symptom. RCA investigates why the failure happened and prevents it from recurring. Both are necessary: troubleshooting keeps production running today; RCA keeps it running tomorrow. The problem arises when troubleshooting replaces RCA — the machine is restarted, the shift continues, and nobody investigates why it stopped. The MES makes this visible: if the same alarm recurs 3 times per week and each time the machine is simply restarted, the MES downtime Pareto shows a recurring loss that is being "fixed" but never solved.
How many root causes should an RCA find?
Most production problems have 1–3 root causes, not 10. If your Ishikawa diagram has 20 validated root causes, you have not done an RCA — you have made a list of everything that is suboptimal. The discipline of RCA is narrowing down: which specific cause, if eliminated, would prevent this specific problem from recurring? The 80/20 rule applies: address the top 1–2 causes first, verify the effect, then reassess whether the remaining causes need separate attention.
What is the difference between occurrence cause and escape cause?
The occurrence cause is why the defect was created. The escape cause is why it was not detected before it reached the customer. Both are root causes — but for different failures. Example: the occurrence cause is "hydraulic pressure dropped below 240 bar, producing an underformed part." The escape cause is "the end-of-line inspection gauge was not calibrated, so the underformed part passed." The 8D report requires both. The MES helps with the occurrence cause (process data, alarm logs). The escape cause typically involves the inspection process — gauges, operators, sampling plans.
How does RCA relate to OEE?
OEE identifies the losses. RCA explains them. The OEE dashboard shows: Availability loss = 12 %, with "hydraulic failure" as the #1 downtime reason at 4.2 %. That is the starting point. The RCA asks: why does the hydraulic system fail? The 5 Whys or Ishikawa investigates the specific mechanism. The corrective action eliminates the cause. The OEE dashboard verifies the improvement: Availability loss drops from 12 % to 8 %. OEE without RCA is a dashboard. RCA without OEE has no prioritisation. Together, they form the continuous improvement loop that drives measurable results — exactly what SYMESTIC's production metrics module and alarms module are designed to support.
Related: Ishikawa Diagram · Poka Yoke · Kaizen · PDCA Cycle · Six Sigma · PPM · OEE Explained · SYMESTIC Production Metrics · SYMESTIC Alarms Module · SYMESTIC Process Data · MES: Definition & Functions
MES software compared: vendors, functions per VDI 5600, costs (cloud vs. on-premise) and implementation. Honest market overview 2026.
OEE software captures availability, performance & quality automatically in real time. Vendor comparison, costs & case studies. 30-day free trial.
MES (Manufacturing Execution System): Functions per VDI 5600, architectures, costs and real-world results. With implementation data from 15,000+ machines.