← all lessons
Validation, Reliability & Test · #46 of 48

Reliability: HALT, Arrhenius, Weibull & the Bathtub

Breaking It on Purpose to Learn Where It Breaks

A thousand finger-driver boards leave your line and scatter into the world, bolted into robot hands that will flex ten million times a year. You will never see most of them again. Yet a customer two continents away will, in eighteen months, hold one that has died, and they will want to know why. You cannot wait eighteen months to find out, and you cannot test a part to death at room temperature without watching it for a decade. So you do something that feels almost violent: you take a healthy board and you cook it, shake it, and over-volt it until it screams.

You do not trust a part because it survived. You trust it because you learned exactly how to kill it, and then you designed that death out.

Reliability is the discipline of turning “it works on my bench” into a number with units of time. The whole field rests on one ordering idea: failures are not all the same, and they do not arrive at random across a product’s life. They sort themselves into a sequence (early duds, then a long quiet middle, then wear-out at the end), and once you know which regime a failure belongs to, you know which tool fixes it. Get that ordering right and everything below (HALT, burn-in, Arrhenius, Weibull) snaps into place as three answers to three different questions.

By the end, you can

  1. Distinguish HALT (overstress to find design margins) from HASS / burn-in (screen production units for infant mortality)
  2. Use the Arrhenius rule of thumb to estimate how much a temperature rise accelerates aging
  3. Read the Weibull shape parameter $k$ and name the failure regime it describes
  4. Map the three Weibull regimes onto the three regions of the bathtub curve
  5. Choose the right environmental test for a given suspected failure mode

Intuition first

Imagine you run a fleet of taxis and you want to know when they break down. You watch them for years and you notice three eras. In the first few weeks, a handful die almost immediately: a bad weld from the factory, a hose that was crimped wrong on the line. These are the lemons, and once they are gone they are gone. Then comes a long, calm middle stretch of many years where breakdowns happen, but rarely, and for no pattern you can predict: a rock through a radiator, a random electrical gremlin. Finally, near the end, the survivors all start failing together (engines, transmissions, suspension) because they are simply worn out.

Those three eras are the bathtub curve: a high failure rate at the start that drops as the lemons leave, a flat low middle, and a rising tail as wear-out sets in. Plot the failure rate against age and it dips in the middle like a bathtub seen from the side.

Now the trick. Each era needs a different response, and reliability engineering is mostly about knowing which era you are looking at:

The genius of HALT, Arrhenius, and the Weibull curve is that they let you see all three eras without actually waiting a decade.

HALT and HASS: overstress versus screen

The slowest, most honest way to measure lifetime is to run parts at their normal conditions and wait. For anything reliable, that wait is absurd. So you accelerate. But there are two completely different reasons to push a part past its rating, and confusing them is the classic beginner mistake.

HALT (highly accelerated life test) deliberately overstresses a design to find its margins and its weakest link. You ramp temperature past spec, add vibration, swing the supply voltage, and you keep pushing until something breaks. The point is not to predict field life. The point is to discover where the design gives out first and how much headroom you have between “rated” and “dead”. You run HALT on a handful of units during development, and every failure is a gift: it tells you the next thing to make stronger.

HASS (highly accelerated stress screen) and the older, gentler burn-in do the opposite. They screen production units to catch infant mortality before shipping. You take every unit (or a sample) and exercise it under stress (for electronics, usually elevated temperature and sometimes elevated voltage, called heat-soaking) long enough that the lemons fail in your factory instead of in the field. The survivors are trusted to have climbed out of the steep early part of the bathtub.

So the one-line distinction to carry: HALT breaks a few prototypes on purpose to harden the design; HASS/burn-in stresses production units to catch the weak ones. One asks “how strong is this design?”; the other asks “is this particular unit a dud?”

Arrhenius: buying time with heat

To accelerate a life test you need a model that connects stress to aging, otherwise you are just guessing how a hot week maps to a cold decade. For temperature, that model is the Arrhenius relationship, borrowed from chemistry. Most aging in electronics (oxide growth, electromigration, electrolyte dry-out, corrosion) is a chemical or diffusion process whose rate climbs exponentially with temperature.

The rate constant follows

k=Aexp ⁣(EakBT)k = A \, \exp\!\left(\frac{-E_a}{k_B \, T}\right)

where TT is absolute temperature in kelvin\text{kelvin}, EaE_a is the activation energy of the failure mechanism, kBk_B is the Boltzmann constant, and AA is a prefactor. The shape that matters: as TT goes up, the exponent gets less negative, and the rate climbs fast.

For bench intuition you rarely plug in EaE_a. You lean on the rule of thumb that falls out of the equation for typical activation energies: reaction rate roughly doubles for every 10 °C10\ \text{°C} rise in temperature. Run a part 30 °C30\ \text{°C} hotter than its use condition and you are aging it about 23=82^3 = 8 times faster. A week at the elevated temperature then stands in for roughly two months at the normal one. That single factor of two per ten degrees is the lever that makes accelerated life testing practical.

Portrait of Svante Arrhenius
Svante Arrhenius · 1859-1927 Swedish chemist and Nobel laureate whose 1889 equation tied reaction rate to temperature. A century later it became the acceleration model that lets reliability engineers trade a hot oven for years of waiting. read more →

The acceleration factor between a test temperature and a use temperature is just the ratio of the two rates:

AF=exp ⁣[EakB(1Tuse1Ttest)]\text{AF} = \exp\!\left[\frac{E_a}{k_B}\left(\frac{1}{T_\text{use}} - \frac{1}{T_\text{test}}\right)\right]

Multiply your test hours by AF\text{AF} to get equivalent field hours. The catch (and it is a real one) is that EaE_a is mechanism-specific. Pick the wrong activation energy and your AF can be off by an order of magnitude, which is why a careful lab confirms the mechanism before trusting the number.

Weibull and the bathtub, made quantitative

The bathtub curve is the picture. The Weibull distribution is the math that lets you fit it to real failure-time data and read off which era you are in.

A Weibull fit has two knobs: a scale λ\lambda (roughly, the characteristic life, the age by which about 63 percent have failed) and a shape parameter kk. The shape is the one that tells the story. The instantaneous failure rate, called the hazard, is

h(t)=kλ(tλ)k1h(t) = \frac{k}{\lambda}\left(\frac{t}{\lambda}\right)^{k-1}

Look at the exponent k1k - 1 and the whole bathtub falls out:

So a single number, kk, classifies the regime, and the three regimes are the three regions of the bathtub. That is the deep link of this lesson: the bathtub is not a vague cartoon, it is what you get when a population is a mix of an early-dominant Weibull (k<1k \lt 1), a constant floor (k=1k = 1), and a late-dominant Weibull (k>1k \gt 1). Fit your field returns to a Weibull, read kk, and you know whether to screen harder, add redundancy, or shorten the service interval.

Portrait of Waloddi Weibull
Waloddi Weibull · 1887-1979 Swedish engineer and former coast guard major whose 1939 distribution, with one flexible shape parameter, could describe infant mortality, random, and wear-out failures alike. It is the statistical backbone of the bathtub curve. read more →

See it / Try it

Sweep the shape parameter and watch the bathtub regimes appear. Drag kk below 1 and the hazard slopes down (infant mortality). Park it at 1 and the curve flattens (random life). Push it above 1 and the tail rises (wear-out). The scale λ\lambda slides the characteristic life left and right without changing the regime.

FMEA — risk priority number

RPN = S × O × D
96
Risk band
Moderate

Notice that kk alone decides the shape of the story while λ\lambda only stretches it in time. That is exactly why reliability reports lead with the shape parameter: it answers “what kind of failure is this?” before anyone argues about “how long until it happens?” The FMEA panel below the curve previews the next lesson, where severity, occurrence, and detection combine into a single risk-priority number.

During development you cook three prototype boards well past their rated temperature, add vibration, and ramp the supply until each one fails, just to find the weakest part. What is this?

You fit your field-return failure times to a Weibull distribution and get a shape parameter k = 2.3. Which region of the bathtub are these failures in, and what is the fix?

Lab: a one-week accelerated screen

On the bench, build the smallest honest reliability experiment you can. Take a batch of the finger-driver boards and split them: a few go into a chamber held at, say, 30 °C30\ \text{°C} above their normal junction temperature with the motor rail exercised on a duty cycle, the rest stay at room temperature as a control. Log time-to-failure for every unit (this is where your #28 thermal DAQ skills earn their keep). After the run, plot the failure times on Weibull axes (log time against the double-log of the failure fraction); a straight line means a single dominant mechanism, and its slope is kk. Read the regime off the slope, apply the Arrhenius rule of thumb (230/10=8×2^{30/10} = 8\times) to translate chamber hours into field hours, and you have a defensible first estimate of service life from a single week of cooking. Document the chamber temperature, the assumed activation energy, and your control group, because every one of those is a place the number can lie.

Why the Weibull hazard slopes the way it does, and where the bathtub model breaks

The Weibull cumulative failure function is F(t)=1exp[(t/λ)k]F(t) = 1 - \exp[-(t/\lambda)^k], and its hazard (the failure rate among survivors) is h(t)=(k/λ)(t/λ)k1h(t) = (k/\lambda)(t/\lambda)^{k-1}. The entire behavior lives in that k1k-1 exponent on time. When k<1k \lt 1 the exponent is negative, so h(t)h(t) falls as tt grows: a population riddled with weak units sheds them early and the survivors are sturdier, which is exactly infant mortality. When k=1k = 1 the exponent is zero, the time-dependence vanishes, and h(t)=1/λh(t) = 1/\lambda is a flat constant; the distribution collapses to the exponential, the unique memoryless lifetime where a used part is statistically as good as new. When k>1k \gt 1 the exponent is positive and h(t)h(t) climbs without bound, the mathematical signature of accumulating damage (fatigue cracks, electromigration, dielectric wear). A convenient anchor: at t=λt = \lambda the cumulative failure fraction is 1e10.6321 - e^{-1} \approx 0.632 regardless of kk, which is why λ\lambda is called the characteristic life.

The bathtub itself is then a superposition: an early-dominant Weibull with k<1k \lt 1, a constant exponential floor, and a late-dominant Weibull with k>1k \gt 1, summed into one hazard curve that dips in the middle. It is a model, not a law. Real products often deviate: software-heavy or well-screened electronics can skip a visible infant-mortality hump, and some populations never reach a clean wear-out wall because they are retired first. Knowing where a given product actually sits on its bathtub requires a large fleet and real return data, which is why early life estimates carry wide error bars.

One honesty note where the grounding and the bench diverge: the “doubles every 10 °C10\ \text{°C}” Arrhenius rule is a rule of thumb, not a constant of nature. It corresponds to one particular activation energy; a mechanism with a higher EaE_a accelerates faster than that, a lower one slower. Reputable accelerated testing names the mechanism and its EaE_a before quoting an acceleration factor, and validates that the stress did not change why the part fails (melting a component is not “faster aging”, it is a different failure entirely). The authoritative rule here wins as an estimator, but the lab confirms the mechanism before trusting the multiplier.

Grounded in Wikipedia: “Reliability engineering”, “Accelerated life testing”, “Weibull distribution”, “Bathtub curve”, “Arrhenius equation”, “Burn-in” (CC BY-SA).

Key takeaways

  • HALT overstresses a few prototypes to find design margins and the weakest link; HASS / burn-in screens production units to catch infant mortality.
  • Arrhenius links temperature to aging: as a rule of thumb the rate roughly doubles per 10 °C, so heat buys you accelerated time.
  • The Weibull shape parameter names the regime: $k \lt 1$ decreasing hazard (infant mortality), $k = 1$ constant (random), $k \gt 1$ increasing (wear-out).
  • Those three regimes are the three regions of the bathtub curve.
  • Match the test to the suspected mode: temp cycling, humidity, vibration/shock, IP ingress, each probes a different failure mechanism.
Practice 1 warm-up

A part runs at 40 °C40\ \text{°C} in the field. You run an accelerated test at 70 °C70\ \text{°C}. Using the rule of thumb that aging rate doubles every 10 °C10\ \text{°C}, roughly how many field hours does one test hour represent?

Show worked solution

The temperature rise is 7040=30 °C70 - 40 = 30\ \text{°C}, which is three doublings. The acceleration factor is

AF=230/10=23=8.\text{AF} = 2^{30/10} = 2^3 = 8.

So one hour in the chamber stands in for about 88 hours at the use temperature. A one-week (168 hour168\ \text{hour}) test represents roughly 168×81344168 \times 8 \approx 1344 field hours, about eight weeks of normal service. (Caveat: this assumes the failure mechanism matches the activation energy behind the rule of thumb and that the higher temperature did not change how the part fails.)

Practice 2 core

Three different products come back from the field. Product A fails mostly in its first two weeks, then quiets down. Product B fails at a steady trickle with no age pattern. Product C is fine for two years, then a wave of failures hits all at once. For each, name the bathtub region, the rough Weibull shape parameter, and the right response.

Show worked solution
  • Product A is in infant mortality: failures cluster early and the rate decreases with age. Weibull k<1k \lt 1. Response: screen the duds out with burn-in or HASS, and chase the manufacturing root cause so you can eventually drop the screen.
  • Product B is in the random useful-life region: constant hazard, no trend. Weibull k=1k = 1 (the exponential case). You cannot screen for it because survivors look identical, so the response is design margin and redundancy.
  • Product C is in wear-out: the hazard increases with age and the survivors fail together. Weibull k>1k \gt 1. Response: find the life limit, over-build the worn item, or schedule replacement before the wall.
Practice 3 stretch

You burn in every shipped board for 48 hours48\ \text{hours} at elevated temperature, and it does clear the early failures. But a colleague argues the screen is “obviously free insurance, so make it 500 hours500\ \text{hours} to be safe.” Using the bathtub and Weibull picture, explain why a longer burn-in can make your product less reliable, not more.

Show worked solution

Burn-in only helps while you are on the down-slope of the bathtub, the infant-mortality region where Weibull k<1k \lt 1 and the hazard is falling. Once the duds are gone, extra stress is no longer removing weak units, it is simply aging the good ones. Every hour of stress consumes wear-out budget, sliding each surviving board along its lifetime toward the rising tail (k>1k \gt 1). A 500 hour500\ \text{hour} screen would ship boards that have already spent a chunk of their useful life in your oven, so their wear-out wall arrives sooner in the customer’s hands.

The right length is the shortest burn-in that reliably clears infant mortality, found by watching when the failure rate flattens out, then stopping. And the better long-term move is to eliminate the root cause of the early failures so the screen can be shortened or removed entirely. “Longer is safer” confuses the screening problem (a few duds) with the wear-out problem (the whole population), which need opposite responses.

You will never watch most of your boards die. They will fail quietly, years from now, in hands you will never shake. The work of reliability is to bring those distant failures forward into a chamber on your bench, where heat compresses years into days and a single curve tells you whether the death was a dud, bad luck, or old age. You break a few on purpose so the rest can live, and you learn the shape of their dying so you can design that shape away.

full glossary →