← all lessons
Differential Buses · CAN & Ethernet · #30 of 48

CAN Errors & Debug

TEC, REC, Bus-Off and the Lonely Node

You wire one CAN node onto the robot hand’s wrist board, flash firmware that should stream finger-encoder positions, and hit run. The transmit LED blinks a few times, then goes dark. The bus analyzer shows nothing. You re-check the wiring, re-flash, swap the transceiver, and still nothing. The node is healthy. The code is correct. The wires are right. And yet it has gone silent and refuses to speak.

A CAN node alone on the bus is a person talking into a phone with nobody on the other end to say “got it.”

CAN does not just send and forget. Every frame a node transmits, it also listens to, and it keeps two running scores of how often things go wrong. Those scores drive a small state machine that decides whether the node is allowed to keep talking, must talk quietly, or has to shut up entirely. Debugging CAN is mostly about reading those scores and walking the physical layer in a fixed order from the wires inward.

By the end, you can

  1. Explain how errors drive the TEC and REC counters and the error-active → error-passive → bus-off state machine
  2. Diagnose why a single node on the bench goes bus-off, and why CAN needs a second node to acknowledge
  3. Run the standard CAN debug order: ohm the termination, scope CANH/CANL with a diff probe, confirm bit rate, then decode and read counters
  4. Read a bus-resistance measurement and decide whether termination is correct before applying power

Intuition first

Picture a meeting where the rule is: every time you finish a sentence, at least one other person must nod. If somebody nods, you carry on. If nobody nods, you assume you mumbled, so you say it again. Louder. Again. Each unacknowledged sentence makes you more sure something is wrong with you, and after enough of them you conclude you are the problem and stop talking altogether.

That is almost exactly how a CAN node behaves. After it sends a frame, it watches one specific bit slot (the ACK slot) for a nod from any other node. No nod means the frame “failed,” and the node bumps an internal error score and retransmits. A node all alone on the bench gets no nods, ever, because there is no one else to provide them. Its error score climbs fast, and within a handful of frames it reaches the threshold where it switches itself off the bus. The maddening part is that nothing is broken. The node is doing exactly what the protocol tells it to do when it feels ignored.

So the first lesson of CAN debugging is counterintuitive: a node that will not transmit is often a node that is working perfectly, in an environment that violates CAN’s core assumption. CAN is a conversation, and a conversation needs at least two participants.

Two counters and a three-state machine

Each CAN node carries two error counters, both starting at zero.

A successful transfer nudges the relevant counter back down, so a healthy bus keeps both counters low and twitchy near zero. Errors push them up; good traffic walks them back. The counters are not symmetric in how hard they punish: a node that causes transmit errors climbs faster than one that merely witnesses them, so the node actually generating noise is the one most likely to be confined. That is the whole point of the mechanism, and it has a name: fault confinement. The bus protects the talkers from one broken member.

Those two numbers feed a state machine with three states:

  1. Error-active is the normal, healthy state. The node transmits freely and, when it spots an error, it announces it loudly with an active error flag (six dominant bits) so every other node hears about the problem too.
  2. Error-passive is the chastened state. When either counter reaches 128 or more, the node drops here. It may still transmit, but it must signal errors quietly with a passive error flag (six recessive bits), and after sending it must wait an extra pause before it is allowed to start again. It has lost the right to interrupt the bus.
  3. Bus-off is the timeout. When the TEC reaches 256 or more, the node takes itself completely off the bus. It transmits nothing. The bus is now protected from a node that has proven, repeatedly, that it cannot send a clean frame.

Getting back out is deliberately slow. A bus-off node only recovers after it observes a long stretch of clean idle traffic (128 occurrences of 11 consecutive recessive bits), at which point its counters reset and it may rejoin as error-active. The system would rather wait than let a flaky node thrash the bus back to a halt.

Early CAN controller silicon
Robert Bosch GmbH · CAN, from 1983 Designed CAN at Bosch and built fault confinement into the protocol itself: nodes count their own errors and voluntarily step down to error-passive, then bus-off, so one failing transceiver cannot jam a safety-critical car network.

The lonely node: why one transmitter goes bus-off

Now connect the counters to the silent node on your bench. Walk one frame.

The node wins arbitration (there is no one to arbitrate against), sends the identifier, the data, and the CRC, and arrives at the ACK slot. The transmitter sends the ACK slot as recessive (1) and waits for some other node to override it with a dominant (0) nod. With no second node present, nobody overrides it. The transmitter reads back the recessive level it sent, sees no acknowledgment, and registers an acknowledgment error. The TEC jumps by 8.

The node does what the protocol says: retransmit. Same outcome. TEC jumps another 8. After about 16 such attempts the TEC crosses 128 and the node slips to error-passive; after about 32 it crosses 256 and the node goes bus-off and falls silent. This is the single most common “my CAN won’t work” symptom on a bench, and the cause is not a fault at all. It is the absence of a listener.

The fix is to give the node someone to talk to. A second real node, a USB-CAN analyzer, or even another microcontroller in “listen and acknowledge” mode will provide the dominant ACK bit. The moment a nod arrives, the transmit succeeds, the TEC walks back down, and the node stays comfortably error-active. CAN requires at least two nodes to communicate, and the ACK mechanism is exactly where that requirement bites.

See it: ohm the termination before you trust the bus

Before you ever chase counters, you check the wires, and the cheapest check on a powered-down bus is the resistance across CANH and CANL. A correct high-speed bus has a 120 Ω terminator at each physical end. Those two resistors sit in parallel as seen by your meter, so a healthy, unpowered bus reads about 60 Ω. Toggle the terminators below and watch what the meter and the edges do.

Bus resistance CANH-CANL
60 Ω
Health (unpowered probe)
healthy

With both terminators present the meter reads ~60 Ω and the differential edges settle cleanly. Pull one terminator and the reading jumps to 120 Ω while the edges start to ring, because the open end reflects each transition back down the line. Pull both and the bus is effectively open, the reading goes very high, the ringing is severe, and frames corrupt. The number on the meter is a one-glance verdict: 60 Ω good, 120 Ω one terminator missing, very high means unterminated. You read it with the bus powered off, before any signal exists to confuse you.

A bench CAN node transmits a few frames, then stops and reports bus-off. It is the only node wired to the bus. What is the most likely cause?

On an unpowered high-speed CAN bus, your multimeter reads 120 Ω across CANH and CANL. What does this tell you?

Lab: walk the CAN debug order

When a bus will not communicate, resist the urge to poke at firmware first. Work from the copper outward in a fixed order. Power the bus down and ohm CANH to CANL: you expect ~60 Ω, and 120 Ω or a very high reading tells you a terminator is wrong before you waste an hour. Power up and scope CANH and CANL, ideally with a differential probe so you read the actual differential the receiver sees (dominant should swing CANH up toward 3.5 V and CANL down toward 1.5 V, about a 2 V differential; recessive collapses both toward 2.5 V). Confirm every node shares the same bit rate by reading each controller’s timing registers, because one mismatched node corrupts everyone’s frames. Only then decode the traffic and read the error counters (TEC and REC) on each node to see which one is climbing, which points straight at the culprit. The order is deliberate: the cheapest, most decisive checks come first, and a single missing terminator or a lonely node is caught before you ever open a logic analyzer.

The exact counter arithmetic, the recovery rule, and why ACK errors cost 8

The CAN error-counting rules are precise, and knowing the constants turns guesswork into prediction. The headline transitions are: a node becomes error-passive when its TEC or REC is greater than or equal to 128, and it goes bus-off when its TEC is greater than or equal to 256. Formally,

error-active  TEC128    REC128  error-passive  TEC256  bus-off\text{error-active} \xrightarrow{\;\text{TEC} \ge 128 \;\lor\; \text{REC} \ge 128\;} \text{error-passive} \xrightarrow{\;\text{TEC} \ge 256\;} \text{bus-off}

The increments are not all the same size. A transmitter that detects an error (including a missing acknowledgment) generally adds 8 to its TEC, while a receiver that detects an error adds 1 to its REC for most cases (and more in specific situations). Successful frames decrement the counter that was involved, typically by 1, which is why a bus with rare, recoverable glitches keeps its counters hovering near zero. The asymmetry (transmit errors cost much more than receive errors) is the lever that confines a faulty node: the member that keeps producing bad transmissions climbs eight times faster than the bystanders that merely observe, so it crosses the bus-off line first.

Trace the lonely bench node with these numbers. Each unacknowledged transmit adds 8 to the TEC. Starting from 0, it reaches 128 (error-passive) after about

1288=16 frames\frac{128}{8} = 16 \text{ frames}

and 256 (bus-off) after about

2568=32 frames.\frac{256}{8} = 32 \text{ frames}.

At a 500 kbit/s bit rate, 32 short frames flash by in well under a millisecond of bus time, which is why the transmit LED gives one quick flicker and then dies. Recovery from bus-off is the slow counterpart: by the standard, the node only returns after it has monitored 128 occurrences of 11 consecutive recessive bits (essentially 128 stretches of bus-idle), at which point its counters reset to zero and it rejoins as error-active. The protocol spends time, not luck, to re-admit a node that has misbehaved.

One factual note: the grounding article phrases the passive-flag rule as “when TEC or REC is greater than 127 and less than 255,” and the bus-off rule as “TEC greater than 255.” Those are the same thresholds stated as strict inequalities on integer counters: “greater than 127” equals “greater than or equal to 128,” and “greater than 255” equals “greater than or equal to 256.” This lesson uses the greater-than-or-equal form because it is the way the ISO 11898-1 thresholds are normally written and avoids off-by-one confusion at the boundary.

Grounded in Wikipedia: “CAN bus” (CC BY-SA).

Key takeaways

  • CAN nodes carry a TEC (transmit) and REC (receive) error counter; good traffic walks them down, errors push them up.
  • The state machine is error-active → error-passive (counter ≥ 128) → bus-off (TEC ≥ 256), the heart of fault confinement.
  • A single node alone on the bench goes bus-off because no one acknowledges its frames; CAN needs at least one other node to nod.
  • Debug order: power off and ohm the termination (≈60 Ω), scope CANH/CANL with a diff probe, confirm every node's bit rate, then decode and read the counters.
  • On a powered-down bus, ~60 Ω is healthy, 120 Ω means one terminator is missing, and very high means the bus is unterminated.
Practice 1 warm-up

You power down a high-speed CAN bus and measure the resistance across CANH and CANL. You read 61 Ω. Then you read 120 Ω on a different bus. Then 8 kΩ on a third. Classify each bus.

Show worked solution
  • 61 Ω → healthy. Two 120 Ω terminators in parallel are 120120=60 Ω120 \parallel 120 = 60\ \Omega; 61 Ω is that pair plus tiny wiring resistance. Both ends terminated, signal integrity good.
  • 120 Ω → one terminator missing. You are reading a single 120 Ω resistor. One physical end is unterminated, so edges will reflect and ring. Find and fit the second terminator.
  • 8 kΩ → unterminated. No 120 Ω termination is in the measurement at all; you are seeing only the transceivers’ input impedance. The bus is effectively open and frames will corrupt. Fit a terminator at each end.
Practice 2 core

A teammate insists the wrist-board CAN node is broken: “It transmits twice then reports bus-off, every single time.” It is the only node connected, the termination ohms out at 60 Ω, and the scope shows clean 2 V dominant edges. Explain what is actually happening and the one-line fix.

Show worked solution

The node is not broken; it is doing exactly what CAN demands. With no second node present, nothing drives the dominant ACK bit, so every transmitted frame logs an acknowledgment error and adds 8 to the TEC. After roughly 32 retries the TEC passes 256 and the node takes itself bus-off and falls silent. Clean edges and 60 Ω termination confirm the physical layer is fine; the missing piece is a listener. The one-line fix: add a second node (a real ECU, a USB-CAN analyzer, or another microcontroller in receive-and-acknowledge mode) so its frames get acknowledged. The TEC then walks back down and the node stays error-active.

Practice 3 stretch

A node starts at TEC = 0 and, due to a flaky connector, every frame it transmits fails with an acknowledgment error (+8 to the TEC) with no successful frames in between to decrement it. After how many failed transmissions does it enter error-passive, and after how many does it go bus-off? Then explain why a node that only receives bad frames is far slower to be confined.

Show worked solution

Error-passive is reached at TEC ≥ 128, so after

1288=16 failed transmissions.\frac{128}{8} = 16 \text{ failed transmissions.}

Bus-off is reached at TEC ≥ 256, so after

2568=32 failed transmissions.\frac{256}{8} = 32 \text{ failed transmissions.}

A receive-only error path is far slower because a detected receive error typically adds just 1 to the REC, not 8. To push the REC to the 128 error-passive threshold on receive errors alone would take on the order of 128 bad receptions, eight times as many events as on the transmit side. That asymmetry is intentional: the node causing the bad traffic (it transmits) is confined quickly, while innocent bystanders that merely witness the corruption are punished gently and stay on the bus.

The silent node is not sulking and it is not dead. It counted its unanswered frames, decided it must be the problem, and stepped aside exactly as it was designed to. Give it one other voice on the wire and the count reverses, the nods come back, and it speaks again. CAN was never built for a node to talk alone, and the cure for the loneliest fault on the bench is simply someone to listen.

full glossary →