Validation, Reliability & Test · #45 of 52

Characterization, Verification & Validation

Three Words People Use Interchangeably and Should Not

The reliability bathtub curve: failure rate high early, flat through midlife, then rising at the end. — The bathtub curve. Failure rate starts high from infant mortality, flattens into a long random-failure midlife, then climbs again as parts wear out. · Wyatts and McSush, Public domain

The robot hand closes around a strawberry without crushing it, holds, and sets it down. Everyone in the room claps. Then someone asks the only question that matters: "How do you know it does that, and not just that it did it once?" The demo proved nothing about grip force at 5 percent battery, nothing about the thousandth cycle, nothing about whether the spec you built to was the spec the customer actually needed.

A demo is a story. Engineering is the evidence that the story is repeatable, correct, and the right story to be telling.

Three words get thrown around as if they meant the same thing, and they cost real money when they collide: characterization, verification, and validation. They are not synonyms and they are not interchangeable. They answer three different questions, in a fixed order, and skipping one of them is how a product that passed every test on the bench still fails in a customer's hand. The whole discipline rests on getting the ordering and the questions straight.

By the end, you can

Distinguish characterization, verification, and validation by the question each one answers
Trace a validation plan from requirement → test case → coverage
Use a Shmoo / Vmin–Vmax sweep with a guardband to set a margined operating point
Choose quantitative KPIs (grip force, backlash, current, thermal rise, error rate, cycle life, MTBF) and pass/fail targets for the robot hand

Intuition first

Imagine you have built a kitchen knife and you want to sell it. Three different people walk up to your bench, and each asks a different question.

The first is a materials scientist. She does not care whether the knife is good or bad. She wants to map it: how sharp is it cold, how sharp at 80 °C, how it dulls over a thousand cuts, how the edge holds across the whole range of conditions it might ever see. She is doing characterization. She is drawing the map, not judging the territory.

The second is a quality inspector with the drawing in hand. He measures the blade length, the hardness, the edge angle, and checks each one against the number on the print. "Spec says 200 mm; it measures 200.1 mm; pass." He is doing verification. His question is: did we build it to the spec? Did we build the thing right?

The third is the cook who will actually use it. She does not have the drawing and does not care about it. She slices an onion, dices a carrot, and asks whether this knife does what a cook needs a knife to do. She is doing validation. Her question is: did we build the right thing? The spec could be perfect and perfectly met, and she could still hand it back because the handle is wrong for a real grip.

Characterization maps. Verification checks against the spec. Validation checks against the need. You need all three, and you need them in that order, because you cannot verify against a spec you have not characterized your way into, and you cannot validate a thing you have not verified.

Characterization: draw the map

Characterization is the act of measuring how the thing behaves across the conditions it will actually meet, before you decide whether that behaviour is acceptable. It is deliberately judgment-free. You are not pass/failing anything yet. You are building the dataset that everything downstream will lean on.

For the robot hand, characterization means sweeping the real variables and recording the response. How does grip force vary with battery voltage from full to nearly empty? How does positional accuracy drift as the actuators warm from cold-start to a half hour of continuous work? How does the current per finger change with the load it is holding? You hold one variable, sweep another, and write down what the hand does. Out of that comes a map: a surface of behaviour over voltage, temperature, load, and time.

The reason characterization comes first is subtle. A spec written without characterization is a guess. If nobody has measured that grip force sags 18 percent between a full and a near-empty cell, the spec will not mention it, verification will never test for it, and the first time anyone discovers it is when a customer's hand drops a glass at 5 percent battery. Characterization is how the real edges of the behaviour become visible so the spec can be written around them.

Verification: did we build it to the spec?

Verification is the disciplined comparison of what you built against what the spec said to build. The spec is the contract. Each line of it is a claim ("grip force accurate to within 5 percent of the commanded value"), and verification is the test that turns each claim into a measured pass or fail. The output of verification is a coverage statement: every requirement, tested, with a result.

Verification is internal and unforgiving in a narrow way. It does not ask whether the spec is wise. It asks only whether the build honours it. A board can pass every verification test and still be the wrong board, because verification is downstream of the spec and inherits whatever the spec got wrong. That is exactly why verification alone is not enough, and why validation has to follow it.

This is also where margining lives. A requirement that says "operates at 3.3 V" is not really verified by checking that it works at exactly 3.3 V. The supply rail moves: it sags under load, it ripples, the regulator has tolerance. So verification sweeps the operating point and looks for the window in which the part still works, then checks that the window is comfortably wider than the conditions the part will actually see. The gap between "where it stops working" and "where it has to work" is the guardband, and a verification that does not measure a guardband has not really verified anything.

Margaret Hamilton · b. 1936 Led the Apollo flight-software team and helped make software engineering a named discipline. Her insistence on rigorous, requirement-traced testing, verifying every path against the spec, is the lineage of every validation plan that maps requirements to test cases to coverage.

Validation: did we build the right thing?

Validation steps outside the spec and asks the harder question: does this product meet the actual need? The cook with the onion. The customer whose hand has to pick strawberries in a real packing shed, not in your lab. Validation can fail even when every verification test passed, because the spec itself can be wrong, incomplete, or written against an assumption that never held.

A validation plan is the structure that makes "does it meet the need" answerable instead of hand-wavy. It flows in three stages:

  NEEDS / REQUIREMENTS  →  TEST CASES  →  COVERAGE
  "hold an egg without    "grip an egg-      "12 of 12
   crushing it"            mass object,        requirements
                           force ≤ 4 N,        have a passing
                           50 trials"          test case"

You start from the need, decompose it into specific requirements, write a concrete test case for each requirement (with numbers, conditions, and a pass/fail line), and then track coverage: the fraction of requirements that have a passing test case behind them. A requirement with no test case is an unvalidated claim. A test case that does not trace back to a requirement is wasted effort. Coverage is the bookkeeping that keeps the two in lockstep.

Choosing the KPIs for the robot hand

None of this works without numbers. "Ship it when it feels reliable" is not an engineering decision, it is a vibe, and vibes do not survive contact with a thousand customers. Each property you care about needs a quantitative KPI and a target you declared before you tested. For the robot hand, the core KPIs are:

Grip-force accuracy and repeatability: how close the actual force is to the commanded force, and how tightly it clusters when you repeat the same command.
Positional accuracy and backlash: how close a fingertip lands to its target, and the dead motion when a joint reverses direction.
Current and power per actuator: what each finger draws, which feeds the power budget and the thermal model.
Thermal rise vs duty cycle: how hot the actuators get as a function of how hard and how continuously they work.
Comms error rate: the bit or message error rate on the buses tying the fingers to the controller.
Cycle life: how many open/close cycles before a defined degradation.
MTBF / fleet failure rate: the mean time between failures and the rate at which a whole deployed fleet fails.

Accuracy and repeatability are not the same KPI and the difference matters. Accuracy is how close you land to the target; repeatability is how tightly your shots cluster to each other. A hand that always grips 1.0 N too hard is inaccurate but perfectly repeatable, and you can calibrate that bias out. A hand that scatters randomly around the target is unrepeatable, and no calibration saves it. Characterization measures both; verification checks both against their spec'd targets; validation asks whether those targets were the right ones for the job.

A new finger-driver board grips with a force that is always exactly 0.8 N higher than commanded, every single trial. Which KPI is poor, and what does it imply?

Your grip controller passes every test in its verification matrix against the written spec, but in a customer trial it cannot hold a wet, slippery fruit. What failed?

Lab: build the requirement-to-coverage table

Take one need for the robot hand, say "hold a raw egg without breaking it," and walk it all the way to coverage on paper. Decompose the need into requirements with numbers (grip force at or below the shell's crush threshold, force repeatability tight enough that no trial spikes over it, a hold time). Write one concrete test case per requirement: the object, the condition, the trial count, and the explicit pass/fail line. Then tally coverage: how many of your requirements have a passing test case, and which need is still un-covered. You should finish with a small table whose last column is a fraction ("11 of 12 requirements covered") and a named gap. That gap is your next test, not your ship date.

Why a single passing run is not validation, and what a guardband really buys you

The most expensive misconception in this whole topic is "validation equals run it once." Reliability is defined as a probability that a system performs its intended function for a specified period under stated conditions, and a probability cannot be measured by a single sample. One successful grip tells you the success probability is greater than zero and essentially nothing more. To make a claim like "fewer than 1 in 1000 grips fail" you need a sample large enough to bound that rate with real statistical confidence, which is why reliability test plans trade test time and unit count against the confidence level you are willing to claim. A demo is $n = 1$ ; validation is a sample with error bars.

The same statistical honesty drives margining. A Shmoo plot sweeps two parameters (classically supply voltage against clock frequency, or for the hand, supply voltage against commanded grip force) and shades every cell where the device passes. The boundary of the passing region is the operating edge. You never want your nominal operating point sitting on that edge, because the edge itself moves with temperature, ageing, and part-to-part spread. The guardband is the deliberate distance you keep between the nominal point and the measured failure edge. If the device stops working below some minimum voltage $V_{\min}$ , and the worst-case rail the hand will ever see is $V_{\text{op}}$ , then the fractional guardband is

G = \frac{V_{\text{op}} - V_{\min}}{V_{\min}}

and a healthy design keeps $G$ comfortably positive (often $G \ge 0.1$ , a 10 percent margin) across every corner, not just at room temperature with a fresh part. The corners are the combinations of worst-case conditions: lowest voltage, highest temperature, slowest silicon, highest load, all at once. A part that passes at nominal and fails at a corner has zero guardband where it counts.

There is a historical reason this taxonomy is so strict. The U.S. military formalized reliability engineering in the 1940s precisely because vacuum-tube radar electronics kept failing in the field after passing on the bench, and in 1950 the Department of Defense's AGREE group recommended three pillars that still echo here: improve component reliability, set reliability requirements for suppliers, and collect field data and find root causes. That last pillar is validation and fleet-MTBF tracking by another name. The field is where the real need lives, and the field is the only place that tells you whether the spec you verified against was the right spec at all.

One honest conflict to flag: the grounding source is blunt that quantitative reliability prediction (MTBF, failure rates) is "by far the most uncertain design parameter in any design" and warns against "playing the numbers game." That is real. Our insistence on quantitative KPIs is not a claim that the numbers are precise prophecy. It is a claim that a declared, measured number with error bars beats a feeling, because a number can be argued with, tracked over a fleet, and corrected. The source's caution and our rule agree on the deeper point: understand why and how things fail, not just chase a single predicted when.

Grounded in Wikipedia: "Reliability engineering" (CC BY-SA).

Key takeaways

Characterization maps behaviour across conditions; it judges nothing, it just measures the real edges so the spec can be written around them.
Verification asks did we build the thing right? Build vs. spec, every requirement tested, coverage reported.
Validation asks did we build the right thing? Product vs. need; it can fail even when every verification test passed.
A validation plan flows requirement → test case → coverage; a requirement with no test case is an unvalidated claim.
Margin with a guardband from a Shmoo / Vmin–Vmax sweep, checked at the worst-case corners, not just at nominal.
Ship on numbers, not feelings: grip force, backlash, current, thermal rise, error rate, cycle life, and MTBF each need a target declared before you test.

Practice 1 warm-up

For each activity, name whether it is characterization, verification, or validation: (a) sweeping grip force versus battery voltage from 4.2 V down to 3.0 V and logging the curve; (b) checking the measured fingertip position against the ±0.5 mm number written in the spec; (c) handing the hand to a warehouse picker and asking whether it can do their actual shift of work.

Show worked solution

(a) Characterization: you are mapping behaviour across a condition (voltage) with no pass/fail judgment, just building the curve. (b) Verification: you are comparing the build against a number the spec dictates (±0.5 mm). Did we build it right? (c) Validation: you are testing the product against the real need (a real shift of real work), independent of what the spec happens to say. Did we build the right thing?

Practice 2 core

A finger driver stops gripping reliably below $V_{\min} = 3.1$ V (measured at 25 °C with a typical part). The hand's battery, under worst-case load and at end of life, can sag to $V_{\text{op}} = 3.4$ V. Compute the fractional guardband. Is roughly 10 percent enough, and what is the one thing this single-corner number does not tell you?

Show worked solution

The fractional guardband is

G = \frac{V_{\text{op}} - V_{\min}}{V_{\min}} = \frac{3.4 - 3.1}{3.1} = \frac{0.3}{3.1} \approx 0.097,

about a 9.7 percent margin, just under the 10 percent rule of thumb. That is thin and worth a hard look. But the bigger issue: $V_{\min} = 3.1$ V was measured at one corner (25 °C, a typical part). The real failure edge moves with temperature and part-to-part spread, so a hot board with a slow part might fail at, say, 3.3 V, which would collapse the guardband to almost nothing. A single-corner $V_{\min}$ does not tell you where the edge sits at the worst-case corner, and the guardband must hold there, not at room temperature.

Practice 3 stretch

Your team did a flashy demo: the hand picked up one strawberry, on camera, and everyone agreed it "feels reliable, let's ship." Write the case against shipping in terms of characterization, verification, and validation, and outline the minimum requirement-to-coverage plan you would demand first for just the grip-force property.

Show worked solution

The demo is $n = 1$ , which is none of the three disciplines done properly. Validation is not "run it once": one grip cannot bound a failure probability, so "feels reliable" is a vibe, not a measured KPI. The case against shipping:

Characterization is missing: nobody has mapped grip force versus battery voltage, temperature, and load, so we do not even know the edges of the behaviour.
Verification is missing: there is no written grip-force spec (target value, accuracy, repeatability) and therefore nothing to test the build against.
Validation is missing: one fruit, one condition, is not the real need (a range of objects, a full shift, slippery and odd shapes).

Minimum requirement-to-coverage plan for grip force alone:

Requirement: commanded grip force accurate to ±5 percent, repeatable to ±2 percent ( $1\sigma$ ), across the rail from 4.2 V to 3.0 V and from 25 °C to 60 °C actuator temp.
Test cases: a Vmin–Vmax / temperature sweep characterizing the force curve; then a verification run of, say, 50 trials at each corner checking accuracy and repeatability against the ±5 percent / ±2 percent targets; then a validation set gripping a basket of real objects (egg, strawberry, wet fruit, hard tool) for a full duty cycle.
Coverage: track the fraction of those requirements with a passing test case, and name every gap. Ship when coverage is complete at the worst-case corners, not when the camera caught a lucky grip.

The demo that earns applause and the product that earns trust are not the same artifact. One is a story told once. The other is a stack of evidence: a map of how the hand behaves at every edge, a matrix proving it was built to its spec, and a plan proving the spec was the right thing to build. Three questions, asked in order, each in its own honest words. Did we measure it, did we build it right, and did we build the right thing. Answer all three with numbers, and the strawberry survives the thousandth time, not just the first.

📐 Characterization, Verification & Validation

By the end, you can

Intuition first

Characterization: draw the map

Verification: did we build it to the spec?

Validation: did we build the right thing?

Choosing the KPIs for the robot hand

Lab: build the requirement-to-coverage table

Key takeaways

Characterization, Verification & Validation