Validation, Reliability & Test · #46 of 52

Reliability: HALT, Arrhenius, Weibull & the Bathtub

Breaking It on Purpose to Learn Where It Breaks

A thousand finger-driver boards leave your line and scatter into the world, bolted into robot hands that will flex ten million times a year. You will never see most of them again. Yet a customer two continents away will, in eighteen months, hold one that has died, and they will want to know why. You cannot wait eighteen months to find out, and you cannot test a part to death at room temperature without watching it for a decade. So you do something that feels almost violent: you take a healthy board and you cook it, shake it, and over-volt it until it screams.

You do not trust a part because it survived. You trust it because you learned exactly how to kill it, and then you designed that death out.

Reliability is the discipline of turning "it works on my bench" into a number with units of time. The whole field rests on one ordering idea: failures are not all the same, and they do not arrive at random across a product's life. They sort themselves into a sequence (early duds, then a long quiet middle, then wear-out at the end), and once you know which regime a failure belongs to, you know which tool fixes it. Get that ordering right and everything below (HALT, burn-in, Arrhenius, Weibull) snaps into place as three answers to three different questions.

By the end, you can

Distinguish HALT (overstress to find design margins) from HASS / burn-in (screen production units for infant mortality)
Use the Arrhenius rule of thumb to estimate how much a temperature rise accelerates aging
Read the Weibull shape parameter $k$ and name the failure regime it describes
Map the three Weibull regimes onto the three regions of the bathtub curve
Choose the right environmental test for a given suspected failure mode

Intuition first

Imagine you run a fleet of taxis and you want to know when they break down. You watch them for years and you notice three eras. In the first few weeks, a handful die almost immediately: a bad weld from the factory, a hose that was crimped wrong on the line. These are the lemons, and once they are gone they are gone. Then comes a long, calm middle stretch of many years where breakdowns happen, but rarely, and for no pattern you can predict: a rock through a radiator, a random electrical gremlin. Finally, near the end, the survivors all start failing together (engines, transmissions, suspension) because they are simply worn out.

Those three eras are the bathtub curve: a high failure rate at the start that drops as the lemons leave, a flat low middle, and a rising tail as wear-out sets in. Plot the failure rate against age and it dips in the middle like a bathtub seen from the side.

Now the trick. Each era needs a different response, and reliability engineering is mostly about knowing which era you are looking at:

The early duds are a screening problem. You do not redesign the taxi, you weed out the bad ones before they ship. That is burn-in.
The random middle is a design and redundancy problem. You cannot screen for it (the survivors look identical), so you build margin and backups.
The wear-out tail is a lifetime problem. You estimate when it arrives and you either over-build the worn part or schedule replacement before it lets go.

The genius of HALT, Arrhenius, and the Weibull curve is that they let you see all three eras without actually waiting a decade.

HALT and HASS: overstress versus screen

The slowest, most honest way to measure lifetime is to run parts at their normal conditions and wait. For anything reliable, that wait is absurd. So you accelerate. But there are two completely different reasons to push a part past its rating, and confusing them is the classic beginner mistake.

HALT (highly accelerated life test) deliberately overstresses a design to find its margins and its weakest link. You ramp temperature past spec, add vibration, swing the supply voltage, and you keep pushing until something breaks. The point is not to predict field life. The point is to discover where the design gives out first and how much headroom you have between "rated" and "dead". You run HALT on a handful of units during development, and every failure is a gift: it tells you the next thing to make stronger.

HASS (highly accelerated stress screen) and the older, gentler burn-in do the opposite. They screen production units to catch infant mortality before shipping. You take every unit (or a sample) and exercise it under stress (for electronics, usually elevated temperature and sometimes elevated voltage, called heat-soaking) long enough that the lemons fail in your factory instead of in the field. The survivors are trusted to have climbed out of the steep early part of the bathtub.

So the one-line distinction to carry: HALT breaks a few prototypes on purpose to harden the design; HASS/burn-in stresses production units to catch the weak ones. One asks "how strong is this design?"; the other asks "is this particular unit a dud?"

Arrhenius: buying time with heat

To accelerate a life test you need a model that connects stress to aging, otherwise you are just guessing how a hot week maps to a cold decade. For temperature, that model is the Arrhenius relationship, borrowed from chemistry. Most aging in electronics (oxide growth, electromigration, electrolyte dry-out, corrosion) is a chemical or diffusion process whose rate climbs exponentially with temperature.

The rate constant follows

k = A \, \exp\!\left(\frac{-E_a}{k_B \, T}\right)

where $T$ is absolute temperature in $\text{kelvin}$ , $E_a$ is the activation energy of the failure mechanism, $k_B$ is the Boltzmann constant, and $A$ is a prefactor. The shape that matters: as $T$ goes up, the exponent gets less negative, and the rate climbs fast.

For bench intuition you rarely plug in $E_a$ . You lean on the rule of thumb that falls out of the equation for typical activation energies: reaction rate roughly doubles for every $10\ \text{°C}$ rise in temperature. Run a part $30\ \text{°C}$ hotter than its use condition and you are aging it about $2^3 = 8$ times faster. A week at the elevated temperature then stands in for roughly two months at the normal one. That single factor of two per ten degrees is the lever that makes accelerated life testing practical.

Portrait of Svante Arrhenius — Svante Arrhenius · 1859-1927 Swedish chemist and Nobel laureate whose 1889 equation tied reaction rate to temperature. A century later it became the *acceleration model* that lets reliability engineers trade a hot oven for years of waiting. read more →

The acceleration factor between a test temperature and a use temperature is just the ratio of the two rates:

\text{AF} = \exp\!\left[\frac{E_a}{k_B}\left(\frac{1}{T_\text{use}} - \frac{1}{T_\text{test}}\right)\right]

Multiply your test hours by $\text{AF}$ to get equivalent field hours. The catch (and it is a real one) is that $E_a$ is mechanism-specific. Pick the wrong activation energy and your AF can be off by an order of magnitude, which is why a careful lab confirms the mechanism before trusting the number.

Weibull and the bathtub, made quantitative

The bathtub curve is the picture. The Weibull distribution is the math that lets you fit it to real failure-time data and read off which era you are in.

A Weibull fit has two knobs: a scale $\lambda$ (roughly, the characteristic life, the age by which about 63 percent have failed) and a shape parameter $k$ . The shape is the one that tells the story. The instantaneous failure rate, called the hazard, is

h(t) = \frac{k}{\lambda}\left(\frac{t}{\lambda}\right)^{k-1}

Look at the exponent $k - 1$ and the whole bathtub falls out:

$k \lt 1$ : the hazard decreases with time. Failures cluster early and thin out as the duds leave. This is infant mortality, the down-slope at the start of the bathtub.
$k = 1$ : the hazard is constant. Failures arrive at a steady, memoryless rate with no trend. This is the random useful-life floor, the flat bottom. (Here Weibull reduces to the plain exponential distribution.)
$k \gt 1$ : the hazard increases with time. Parts get more likely to fail as they age. This is wear-out, the rising tail.

So a single number, $k$ , classifies the regime, and the three regimes are the three regions of the bathtub. That is the deep link of this lesson: the bathtub is not a vague cartoon, it is what you get when a population is a mix of an early-dominant Weibull ( $k \lt 1$ ), a constant floor ( $k = 1$ ), and a late-dominant Weibull ( $k \gt 1$ ). Fit your field returns to a Weibull, read $k$ , and you know whether to screen harder, add redundancy, or shorten the service interval.

Portrait of Waloddi Weibull — Waloddi Weibull · 1887-1979 Swedish engineer and former coast guard major whose 1939 distribution, with one flexible *shape* parameter, could describe infant mortality, random, and wear-out failures alike. It is the statistical backbone of the bathtub curve. read more →

See it / Try it

Sweep the shape parameter and watch the bathtub regimes appear. Drag $k$ below 1 and the hazard slopes down (infant mortality). Park it at 1 and the curve flattens (random life). Push it above 1 and the tail rises (wear-out). The scale $\lambda$ slides the characteristic life left and right without changing the regime.

shape k: 1.00 scale λ: 1.0

FMEA — risk priority number

Severity (1–10) Occurrence (1–10) Detection (1–10)

RPN = S × O × D: 96
Risk band: Moderate

Notice that $k$ alone decides the shape of the story while $\lambda$ only stretches it in time. That is exactly why reliability reports lead with the shape parameter: it answers "what kind of failure is this?" before anyone argues about "how long until it happens?" The FMEA panel below the curve previews the next lesson, where severity, occurrence, and detection combine into a single risk-priority number.

During development you cook three prototype boards well past their rated temperature, add vibration, and ramp the supply until each one fails, just to find the weakest part. What is this?

You fit your field-return failure times to a Weibull distribution and get a shape parameter k = 2.3. Which region of the bathtub are these failures in, and what is the fix?

Lab: a one-week accelerated screen

On the bench, build the smallest honest reliability experiment you can. Take a batch of the finger-driver boards and split them: a few go into a chamber held at, say, $30\ \text{°C}$ above their normal junction temperature with the motor rail exercised on a duty cycle, the rest stay at room temperature as a control. Log time-to-failure for every unit (this is where your #28 thermal DAQ skills earn their keep). After the run, plot the failure times on Weibull axes (log time against the double-log of the failure fraction); a straight line means a single dominant mechanism, and its slope is $k$ . Read the regime off the slope, apply the Arrhenius rule of thumb ( $2^{30/10} = 8\times$ ) to translate chamber hours into field hours, and you have a defensible first estimate of service life from a single week of cooking. Document the chamber temperature, the assumed activation energy, and your control group, because every one of those is a place the number can lie.

Why the Weibull hazard slopes the way it does, and where the bathtub model breaks

The Weibull cumulative failure function is $F(t) = 1 - \exp[-(t/\lambda)^k]$ , and its hazard (the failure rate among survivors) is $h(t) = (k/\lambda)(t/\lambda)^{k-1}$ . The entire behavior lives in that $k-1$ exponent on time. When $k \lt 1$ the exponent is negative, so $h(t)$ falls as $t$ grows: a population riddled with weak units sheds them early and the survivors are sturdier, which is exactly infant mortality. When $k = 1$ the exponent is zero, the time-dependence vanishes, and $h(t) = 1/\lambda$ is a flat constant; the distribution collapses to the exponential, the unique memoryless lifetime where a used part is statistically as good as new. When $k \gt 1$ the exponent is positive and $h(t)$ climbs without bound, the mathematical signature of accumulating damage (fatigue cracks, electromigration, dielectric wear). A convenient anchor: at $t = \lambda$ the cumulative failure fraction is $1 - e^{-1} \approx 0.632$ regardless of $k$ , which is why $\lambda$ is called the characteristic life.

The bathtub itself is then a superposition: an early-dominant Weibull with $k \lt 1$ , a constant exponential floor, and a late-dominant Weibull with $k \gt 1$ , summed into one hazard curve that dips in the middle. It is a model, not a law. Real products often deviate: software-heavy or well-screened electronics can skip a visible infant-mortality hump, and some populations never reach a clean wear-out wall because they are retired first. Knowing where a given product actually sits on its bathtub requires a large fleet and real return data, which is why early life estimates carry wide error bars.

One honesty note where the grounding and the bench diverge: the "doubles every $10\ \text{°C}$ " Arrhenius rule is a rule of thumb, not a constant of nature. It corresponds to one particular activation energy; a mechanism with a higher $E_a$ accelerates faster than that, a lower one slower. Reputable accelerated testing names the mechanism and its $E_a$ before quoting an acceleration factor, and validates that the stress did not change why the part fails (melting a component is not "faster aging", it is a different failure entirely). The authoritative rule here wins as an estimator, but the lab confirms the mechanism before trusting the multiplier.

Grounded in Wikipedia: "Reliability engineering", "Accelerated life testing", "Weibull distribution", "Bathtub curve", "Arrhenius equation", "Burn-in" (CC BY-SA).

Key takeaways

HALT overstresses a few prototypes to find design margins and the weakest link; HASS / burn-in screens production units to catch infant mortality.
Arrhenius links temperature to aging: as a rule of thumb the rate roughly doubles per 10 °C, so heat buys you accelerated time.
The Weibull shape parameter names the regime: $k \lt 1$ decreasing hazard (infant mortality), $k = 1$ constant (random), $k \gt 1$ increasing (wear-out).
Those three regimes are the three regions of the bathtub curve.
Match the test to the suspected mode: temp cycling, humidity, vibration/shock, IP ingress, each probes a different failure mechanism.

Practice 1 warm-up

A part runs at $40\ \text{°C}$ in the field. You run an accelerated test at $70\ \text{°C}$ . Using the rule of thumb that aging rate doubles every $10\ \text{°C}$ , roughly how many field hours does one test hour represent?

Show worked solution

The temperature rise is $70 - 40 = 30\ \text{°C}$ , which is three doublings. The acceleration factor is

\text{AF} = 2^{30/10} = 2^3 = 8.

So one hour in the chamber stands in for about $8$ hours at the use temperature. A one-week ( $168\ \text{hour}$ ) test represents roughly $168 \times 8 \approx 1344$ field hours, about eight weeks of normal service. (Caveat: this assumes the failure mechanism matches the activation energy behind the rule of thumb and that the higher temperature did not change how the part fails.)

Practice 2 core

Three different products come back from the field. Product A fails mostly in its first two weeks, then quiets down. Product B fails at a steady trickle with no age pattern. Product C is fine for two years, then a wave of failures hits all at once. For each, name the bathtub region, the rough Weibull shape parameter, and the right response.

Show worked solution

Product A is in infant mortality: failures cluster early and the rate decreases with age. Weibull $k \lt 1$ . Response: screen the duds out with burn-in or HASS, and chase the manufacturing root cause so you can eventually drop the screen.
Product B is in the random useful-life region: constant hazard, no trend. Weibull $k = 1$ (the exponential case). You cannot screen for it because survivors look identical, so the response is design margin and redundancy.
Product C is in wear-out: the hazard increases with age and the survivors fail together. Weibull $k \gt 1$ . Response: find the life limit, over-build the worn item, or schedule replacement before the wall.

Practice 3 stretch

You burn in every shipped board for $48\ \text{hours}$ at elevated temperature, and it does clear the early failures. But a colleague argues the screen is "obviously free insurance, so make it $500\ \text{hours}$ to be safe." Using the bathtub and Weibull picture, explain why a longer burn-in can make your product less reliable, not more.

Show worked solution

Burn-in only helps while you are on the down-slope of the bathtub, the infant-mortality region where Weibull $k \lt 1$ and the hazard is falling. Once the duds are gone, extra stress is no longer removing weak units, it is simply aging the good ones. Every hour of stress consumes wear-out budget, sliding each surviving board along its lifetime toward the rising tail ( $k \gt 1$ ). A $500\ \text{hour}$ screen would ship boards that have already spent a chunk of their useful life in your oven, so their wear-out wall arrives sooner in the customer's hands.

The right length is the shortest burn-in that reliably clears infant mortality, found by watching when the failure rate flattens out, then stopping. And the better long-term move is to eliminate the root cause of the early failures so the screen can be shortened or removed entirely. "Longer is safer" confuses the screening problem (a few duds) with the wear-out problem (the whole population), which need opposite responses.

You will never watch most of your boards die. They will fail quietly, years from now, in hands you will never shake. The work of reliability is to bring those distant failures forward into a chamber on your bench, where heat compresses years into days and a single curve tells you whether the death was a dud, bad luck, or old age. You break a few on purpose so the rest can live, and you learn the shape of their dying so you can design that shape away.

🛁 Reliability: HALT, Arrhenius, Weibull & the Bathtub

By the end, you can

Intuition first

HALT and HASS: overstress versus screen

Arrhenius: buying time with heat

Weibull and the bathtub, made quantitative

See it / Try it

FMEA — risk priority number

Lab: a one-week accelerated screen

Key takeaways

Reliability: HALT, Arrhenius, Weibull & the Bathtub