Validation, Reliability & Test · #47 of 52

Root Cause & FMEA

Five Whys, Fishbones, and the Number That Ranks Risk

A support ticket lands on a Friday afternoon: a handful of robot hands in the field run hot. Not all of them, not predictably, just some. Your first instinct is the worst one you have. You reach for the part you suspect, the motor driver that always runs warm, and you start swapping it on a hunch. Three boards and two days later the problem is still there, and now you have changed so many things you no longer know what was wrong to begin with. You have been guessing, and guessing is a coin flip wearing a lab coat.

The cure is not a smarter hunch. It is a method that refuses to let you stop at the first plausible story. You do not fix the failure you can see; you fix the cause that, if removed, makes the failure impossible.

Debugging by intuition works right up until it does not, and then it fails silently by sending you confidently down the wrong path. Root cause analysis is the discipline that replaces the hunch with an ordering: every failure has a chain behind it, and you walk that chain backward, link by link, until you reach the link that actually started it. Get that ordering right and the rest of this lesson (the five whys, the fishbone, the fault tree, the risk number) all turn out to be the same idea wearing different clothes.

By the end, you can

Walk a failure backward with the five whys from a symptom to a process-level root cause
Sort candidate causes into the six fishbone categories (machine, method, material, measurement, environment, people)
Calculate an FMEA risk priority number from severity, occurrence, and detection and rank failure modes by it
Choose between a top-down fault tree and a bottom-up FMEA for a given question
Sequence a containment-to-prevention investigation (reproduce, discriminate, compare, confirm, fix, screen)

Intuition first

Think of a failure the way a detective thinks of a crime scene. The overheating hand is the body on the floor. It is a symptom, the thing everyone points at, and it is almost never the thing you arrest. A bad detective books the first suspect who looks guilty (the warm motor driver) and goes home. A good detective keeps asking one boring question: and what put that there? Each answer is a new suspect, and you follow the chain until you reach the one whose removal would have prevented the whole scene.

That boring question is the entire trick. "Why is the hand hot?" Because a motor stalls and dumps current as heat. "Why does it stall?" Because a tendon binds. "Why does the tendon bind?" Because a pulley was pressed in crooked. "Why crooked?" Because the press fixture has play. "Why does the fixture have play?" Because nobody inspects it on a schedule. Five questions in, you have walked from a hot chip (a symptom you could have chased for a week) to a missing inspection process (a cause you can actually fix once and for all). Notice the journey: you did not get smarter at each step, you just refused to stop early.

The methods in this lesson are scaffolding for that refusal. The five whys keep you walking. The fishbone keeps you from tunnel-visioning on one branch. The fault tree handles failures that need several things to go wrong at once. And the FMEA, done before anything fails, lets you spend your worry where it pays off instead of spreading it thin across every part on the board.

The five whys: walk the chain backward

The simplest root-cause tool is also the oldest piece of factory wisdom: when something breaks, ask "why?" and then ask it again of the answer, about five times, until you arrive at something you can change in the process rather than patch on the part. The number five is a guideline, not a law. Stop when the next "why" stops giving you something actionable, which is usually around the fourth or fifth link.

The discipline is in what counts as an answer. Each "why" must point at the answer of the one before it, not wander off to a new topic, and each answer must be a fact you can verify, not a story you find satisfying. The classic trap is stopping at a symptom dressed up as a cause. "The board overheated because the chip got too hot" is not a why, it is the same sentence twice. A real chain ends at a process that is missing or broken, because a process is the thing whose repair stops the failure from coming back.

Taiichi Ohno · 1912-1990 The architect of the Toyota Production System, who made asking why five times the heart of Toyota's scientific approach to problems. He insisted that repeating the question until the cause is clear is what separates fixing a symptom from fixing a system.

The five whys has real critics, and they are worth respecting. The depth is arbitrary: nothing guarantees the fifth answer is the true root rather than just the fifth in a line. It cannot find a cause you do not already know about, so it is only as deep as the person asking. And it tends to commit to a single chain, when most real failures have several contributing causes braided together. The fix for all three weaknesses is the same: do not run the five whys as a lonely straight line. Branch it, and let several "why" trails fan out at once. That branching is exactly what the next tool draws for you.

The fishbone: sort the suspects before you chase them

When you brainstorm causes for a stubborn failure, the ideas come in a jumble, and the danger is that you grab the first one and dig. The fishbone diagram (also called the Ishikawa or cause-and-effect diagram, after the Japanese quality pioneer Kaoru Ishikawa) is a way to lay every candidate out before you commit, so you can see the gaps as well as the suspects. You draw the problem as the fish's head on the right, a spine running left toward it, and major bones branching off the spine, each bone a category of cause. Then you hang specific suspects off the bones.

In manufacturing the canonical categories are a set that all happen to start with the letter M, which is the only reason to memorize them as a set:

Machine (the equipment): the press, the reflow oven, the test jig, the placement head. Is a tool worn, mis-set, or drifting?
Method (the process): the assembly steps, the firmware build, the order of operations. Is a step ambiguous, skipped, or in the wrong sequence?
Material (what goes in): the components, the solder paste, the cable stock, even the information on the BOM. Is a lot bad, a part substituted, a spec wrong?
Measurement (how you judge): the calibration of the meter, the test limits, the sensor reading you trust. Is the gauge lying to you?
Environment (the surroundings, sometimes counted as a sixth M, "mother nature"): temperature, humidity, vibration, ESD on the floor.
People (the human factor, the original "manpower" M): training, fatigue, a step that is easy to do wrong.

The value is not the fish. It is the forcing function: by making you put a suspect under each category, the diagram surfaces the branch you would have ignored. The engineer convinced it is a bad chip (Material) is gently forced to also ask whether the test jig is mis-reading (Measurement) or the press is crooked (Machine). Each bone then gets its own five-whys walk back to a root, so the two tools nest: the fishbone gives you the branches, the five whys gives you the depth on each branch.

The fault tree: when several things must go wrong at once

The five whys and the fishbone both follow chains of single causes. But some failures only happen when two or more things fail together, and for those you need a tool that speaks the language of "and" and "or". That tool is fault tree analysis (FTA), and it works in the opposite direction from an FMEA.

You start at the top with the one undesired event you care about (say, "the hand loses grip with a part in flight") and you work downward, asking how that top event could be produced. You connect causes with boolean gates. An OR gate means any one of its inputs is enough to cause the event above it; an AND gate means the event only happens if all of its inputs occur together. Keep expanding each branch into the lower-level faults that feed it, and you build a logic diagram of every path to the top event.

The payoff is twofold. First, the AND gates show you where redundancy already protects you: if two independent things both have to fail, the combined event is far rarer than either alone. Second, when you label each basic event with a failure probability, the tree does the arithmetic. For independent events feeding an AND gate, the probabilities multiply:

P(\text{A and B}) = P(A)\,P(B)

so two independent one-in-a-thousand faults combine to one in a million. An OR gate adds (for small, roughly exclusive probabilities):

P(\text{A or B}) \approx P(A) + P(B)

The contrast with FMEA is the thing to carry. FMEA is inductive and bottom-up: it starts at a single component, asks how that one part can fail, and follows the effect upward. It is exhaustive at cataloguing single-point failures but blind to combinations. FTA is deductive and top-down: it starts at a system-level disaster and finds the combinations of faults that reach it. It captures multiple-failure logic but will not enumerate every single part the way an FMEA does. In serious safety work you run both, because each sees exactly what the other misses.

FMEA: the number that ranks where to worry

The five whys, fishbone, and fault tree are reactive in spirit: something broke, now find out why. Failure mode and effects analysis (FMEA) flips the timeline. You do it up front, before anything has failed, by walking down your list of parts and functions and asking, for each one, "how could this fail, and what would that do?" The reward is a ranked list of what to harden first, produced while the design is still cheap to change.

The ranking comes from a single number. For each failure mode you rate three things on a scale of 1 to 10, then multiply them into the risk priority number:

\text{RPN} = S \times O \times D

where

Severity ( $S$ ): how bad the effect is if it happens. A cosmetic blemish is a 1; a fire or an injury is a 10.
Occurrence ( $O$ ): how often the cause is expected to happen. Virtually never is a 1; almost inevitable is a 10.
Detection ( $D$ ): how likely the failure is to slip past your tests and reach the customer. Note the direction, because it trips everyone up the first time. A high $D$ is bad: it means your screens probably will not catch it. Certain detection is a 1; invisible until the field is a 10.

Multiply the three and you get an RPN from 1 to 1000. A high number is a loud signal: this mode is severe, or common, or sneaky, or some painful blend. You sort the table by RPN and attack the top of the list. And here is the lever that makes detection so worth your attention: severity is often fixed by physics (a stall will make heat), and occurrence can be stubborn to drive down, but detection you can almost always improve by adding a test. A built-in temperature check that trips before damage drops $D$ from a 9 to a 2 and slashes the RPN without touching the failure itself.

The other half of the discipline is timing. An FMEA written after the hardware is built is a museum piece; written during design, it steers the design. And it is a living document: you re-run it after every real failure, feeding what you actually learned back into the occurrence and detection ratings so the next revision worries about the right things. The reactive tools (five whys, fishbone) and the proactive FMEA form a loop. The FMEA predicts; the field surprises you; the root-cause walk explains the surprise; and the FMEA is updated so the surprise is now a known, ranked, defended-against mode.

See it / Try it

Below the reliability curve is an FMEA panel. Type in a severity, an occurrence, and a detection for a failure mode you are worried about, and watch the risk priority number and its band update live. Start with the overheating hand: severity is high because a hot motor can damage the hand and burn a user, so set $S$ around 8. Say it happens to a few percent of units, an occurrence of maybe 4. And right now you have no temperature screen, so it sails past test undetected, a detection of 9 or 10.

shape k: 1.00 scale λ: 1.0

FMEA — risk priority number

Severity (1–10) Occurrence (1–10) Detection (1–10)

RPN = S × O × D: 96
Risk band: Moderate

Read the RPN, then change one number to feel the lever. Drag detection down to 2 (you have added a thermal cutoff that trips on the line) and watch the RPN collapse even though the failure itself is exactly as severe and exactly as common. That is the whole reason detection earns a column of its own: it is usually the cheapest of the three to move, and moving it is how a screen pays for itself. Now nudge severity up to 10 and notice the band climb hard. Severity is the rating you respect regardless of the product, because no amount of clever detection makes a fire acceptable.

Walking the five whys on the overheating hand, you reach: hot → motor stalls → tendon binds → pulley pressed in crooked → press fixture has play → no scheduled inspection of the fixture. Which answer is the root cause to fix?

Two failure modes: Mode A is severity 9, occurrence 2, detection 2. Mode B is severity 3, occurrence 8, detection 7. Which has the higher RPN, and what is the catch?

Lab: investigate the overheating batch

Run the worked example end to end on the bench instead of by hunch. First, reproduce it instrumented: take one of the boards that runs hot and wire it up to log current, position, and temperature together on a timeline (your #28 thermal DAQ work earns its keep here), so you are watching the failure happen with numbers instead of a fingertip. Second, discriminate the category: the simultaneous traces let you split electrical (current spikes with no motion), mechanical (motion that needs abnormal current), control (the firmware commanding a stall or fighting itself), and thermal-path (normal current and motion but the heat has nowhere to go) without swapping a single part. Third, compare good versus bad: run the same instrumented test on a known-good hand and overlay the traces, so the difference jumps out instead of hiding in absolute numbers. Then confirm the cause by reproducing it deliberately (if you suspect the crooked pulley, re-create the bind and watch the current climb on cue). Only now do you fix the root, and then add a screen (a production temperature or stall-current check) so the next unit with this cause is caught in your factory, not your customer's hand. Notice the order: you change exactly one understood thing, after the data has told you which one, which is the opposite of the Friday-afternoon guessing you started with.

Why FMEA, FTA, and 8D are one toolkit, and where the RPN math is shaky

The four tools in this lesson are not rivals, they are stages of one investigation that the formal 8D (eight disciplines) process strings together. 8D came out of Ford in the late 1980s as a team-based march from a fresh problem to a permanent fix: D0 emergency response, D1 form a team, D2 describe the problem precisely (the who/what/where/when of it), D3 put in an interim containment so the customer stops getting bad units while you investigate, D4 find and verify the root cause and the "escape point" (the control that should have caught it but did not), D5 and D6 choose and implement a permanent corrective action, D7 change the broader system so this and similar problems cannot recur, and D8 thank the team. The root-cause tools slot directly into D4: 8D explicitly names the five whys and the Ishikawa fishbone as its workhorses there. The "escape point" idea is the formal version of "add a screen": every root cause has a partner question, which is why your test did not catch it, and fixing both the cause and the escape is what makes the fix permanent.

FMEA and FTA bracket the same investigation from opposite ends. FMEA is inductive (forward, bottom-up): it starts at each component, enumerates that part's failure modes, and traces effects upward, which makes it exhaustive at single-point failures but blind to combinations. FTA is deductive (backward, top-down): it starts at a system-level undesired event and finds the boolean combinations of basic faults that reach it, capturing the multiple-failure logic an FMEA cannot. The standard practice in safety-critical work (civil aerospace, nuclear, medical) is to run both and reconcile them, with the FMEA's basic failure modes feeding the fault tree's basic events. FTA itself was born in 1962 at Bell Labs for the Minuteman missile and matured at NASA after the Challenger accident showed that qualitative FMEA alone had missed combination failures.

One honesty note where the authoritative formula meets its critics. The RPN as $S \times O \times D$ is convenient and the convention this lesson teaches, but the three ratings are ordinal numbers (rankings, not measured quantities), and multiplication is not formally defined on rankings. The practical symptom is rank reversal: because a severity of 8 is not numerically "twice" a 4, a less serious mode can land a higher RPN than a more serious one, exactly the trap in the second Check. The modern AIAG/VDA handbook (2019) responded by replacing the raw RPN with an action priority table that considers severity first, and some functional-safety variants drop the RPN entirely in favor of quantitative diagnostic-coverage metrics. The authoritative $S \times O \times D$ wins here as a teaching tool and a fast triage, but the disciplined engineer never lets a low RPN overrule a high severity, and treats the number as a sortable signal, not a measurement.

Grounded in Wikipedia: "Failure mode and effects analysis", "Five whys", "Ishikawa diagram", "Fault tree analysis", "Eight disciplines problem solving" (CC BY-SA).

Key takeaways

Fix the cause whose removal makes the failure impossible, not the symptom you can see. Guessing changes too many things to learn from.
The five whys walk a failure backward to a process-level root; branch them, because real failures braid several causes together.
The fishbone forces breadth by sorting suspects into categories (machine, method, material, measurement, environment, people) before you dig.
FMEA ranks risk up front with $\text{RPN} = S \times O \times D$ (each 1–10, product 1–1000); re-run it after every field failure.
Detection is usually the cheapest factor to improve: adding a screen drops $D$ and the RPN without changing how severe or common the failure is.
FMEA is bottom-up and exhaustive on single faults; fault tree analysis is top-down and captures combinations. Serious work runs both.

Practice 1 warm-up

A failure mode is rated severity 7, occurrence 5, detection 4. Compute its RPN. Then you add an automated test that catches the fault almost every time, dropping detection to 2. What is the new RPN, and what does the change tell you about where to spend effort?

Show worked solution

The original RPN is the product of the three ratings:

\text{RPN} = S \times O \times D = 7 \times 5 \times 4 = 140.

After improving detection from 4 to 2:

\text{RPN}_\text{new} = 7 \times 5 \times 2 = 70.

Adding the test halved the RPN (from 140 to 70) without changing how severe the failure is or how often it occurs. The failure is exactly as dangerous and exactly as frequent; you have just made it far less likely to reach the customer. That is why detection is the lever of first resort: a screen is usually cheaper to add than a redesign that lowers severity or occurrence, and it buys a large RPN drop.

Practice 2 core

A batch of finger-driver boards intermittently fails a self-test at the end of the line. Build a fishbone: name one concrete candidate cause under each of the six categories (machine, method, material, measurement, environment, people), then say how you would use the five whys on the one you think is most likely.

Show worked solution

One plausible suspect per category:

Machine: the pick-and-place head is mis-calibrated and offsets a fine-pitch part.
Method: the reflow profile was changed and now under-heats one corner of the board.
Material: a reel of capacitors is from a substituted lot with wider tolerance.
Measurement: the test jig's pogo pins are worn, so the self-test reads a real board as failing (a false reject).
Environment: humidity on the floor causes intermittent shorts or ESD damage.
People: a hand-placed connector is sometimes seated one position off.

Suppose the intermittent nature points at the test jig (Measurement), since a real hardware fault would usually fail every time. Run the five whys: jig reports a fail → contact resistance at a pin is high → the pogo pin is worn → the jig has run far past its rated insertion count → there is no schedule to replace pins by cycle count. Root cause: a missing pin-replacement process, not a bad board at all. The fix is a maintenance schedule (and the escape point is that the jig had no self-check on its own contacts).

Practice 3 stretch

You want to bound the probability that a robot hand "drops a part in flight." A drop happens if the grip controller crashes OR both of the two independent grip-force sensors fail at once. The controller crash has probability $10^{-3}$ per operation; each sensor fails independently with probability $10^{-2}$ . Decide whether a fault tree or a bottom-up FMEA is the right tool for this question, then estimate the top-event probability.

Show worked solution

This is a fault tree question, not an FMEA one. You are asking how a system-level undesired event arises from a combination of lower faults joined by AND and OR logic, which is exactly what top-down FTA is built for. A bottom-up FMEA would dutifully list each component's failure modes but would not naturally express the "both sensors must fail together" combination.

Build the tree. The top event ("drop") sits above an OR gate with two inputs: the controller crash, and a sub-event "both sensors fail." That sub-event sits above an AND gate fed by the two independent sensor failures.

The AND gate multiplies (independent events):

P(\text{both sensors}) = (10^{-2})(10^{-2}) = 10^{-4}.

The OR gate adds (small, roughly exclusive probabilities):

P(\text{drop}) \approx P(\text{crash}) + P(\text{both sensors}) = 10^{-3} + 10^{-4} = 1.1 \times 10^{-3}.

So the drop probability is about $1.1 \times 10^{-3}$ per operation, and the tree makes the lesson obvious: the redundant sensor pair contributes only $10^{-4}$ , while the single, unredundant controller dominates at $10^{-3}$ . The next reliability dollar goes to the controller (a watchdog, a redundant lockstep core), not to a third sensor, because the AND gate has already made the sensors a non-problem.

The hard part of a failure was never the soldering or the firmware. It is the discipline to not believe your first good story. A hunch feels like progress and a swapped part feels like work, but both are coin flips until the data speaks. So you ask why one more time than is comfortable, you sort the suspects before you chase them, you rank your worry with a number you can defend, and you add the screen that catches the next one in your factory instead of your customer's hand. You do not fix the failure you can see. You hunt down the cause that, once gone, makes that failure impossible.

🐟 Root Cause & FMEA

By the end, you can

Intuition first

The five whys: walk the chain backward

The fishbone: sort the suspects before you chase them

The fault tree: when several things must go wrong at once

FMEA: the number that ranks where to worry

See it / Try it

FMEA — risk priority number

Lab: investigate the overheating batch

Key takeaways

Root Cause & FMEA