- February 07, 2025
Not enough time? Get the key points instantly.
The code review passed. Every comment is resolved. The static analyser shows zero warnings. The team signs off and the build goes to hardware.
Then the device resets at 3am with no log entry. Or the BLE connection drops after exactly 47 minutes. Or the sensor reads fine on the bench but returns garbage after 20 minutes in the field. Code review didn't catch any of it. These faults don't live in the code. They live in how the code behaves on real hardware, under real conditions.
By the end of this post, you'll know the firmware debugging techniques used after every code review on client projects: the ones that find the faults that would otherwise reach production.
Code review is a reading exercise. Engineers check logic, flag issues, and verify that the code matches the intent. It's essential — and it's limited to what the code looks like on paper.
The faults that cause the most expensive production failures don't show up on paper. They show up when two processes try to access the same data at the wrong moment. When a rarely-used error path runs out of memory. When a sensor that works perfectly at room temperature starts misbehaving at 60°C. None of that is visible in a code review.
Three categories of fault consistently survive code review on embedded projects:
Timing faults: Two parts of the system collide at exactly the wrong moment. Invisible in static review, guaranteed to appear in the field.
Memory faults: The device runs out of memory or corrupts its own data - not immediately, but after hours or days of uptime.
Hardware-interaction faults: Behaviours that only appear when the full system is running: all sensors active, wireless connected, data flowing.
Each technique below targets one of these categories. Together, they form the firmware fault detection process run after every code review before a build goes to QA.
Every embedded device has a watchdog timer. It's a safety net: if the firmware stops responding, the watchdog resets the device. Most teams set it up and move on.
The problem is that a reset with no information is nearly useless. The device came back up. The fault is gone. You have no idea what caused it.
The right approach is to configure the watchdog so that, before it resets, it saves a snapshot of what the device was doing: which task was running, what the last few log entries were, and what triggered the reset. On the next boot, that data gets transmitted before anything else happens.
On one industrial sensor project, three consecutive resets in the field were traced to a single sensor that occasionally blocked the entire system during a power fluctuation. Without the pre-reset snapshot, that fault would have taken weeks to find. With it, the root cause was identified in two hours.
What this catches: Deadlocks, frozen tasks, hardware peripherals that stop responding.
What happens without it: Field resets with no root cause. Expensive support tickets. Guesswork.
Every task running on the device has a memory allocation. Most firmware teams estimate these allocations based on experience and ship. The estimates are often wrong, not because the engineers are careless, but because the memory used by an error recovery path is genuinely hard to estimate without measuring it.
The solution is to run the firmware through its full operational cycle - normal operation, error recovery, OTA update, reconnection after a dropped connection and measure how close each task comes to its memory limit at every stage. Any task using more than 90% of its allocation is a risk. Any task over 95% is a problem that needs to be fixed before production.
The table below shows what this looks like in practice:
Task | Memory Allocated | Closer to Limit | Risk level |
|---|---|---|---|
Sensor sampling | 2048 bytes | 640 bytes remaining | Low |
BLE connection task | 4096 bytes | 312 bytes remaining | High |
OTA update handler | 3072 bytes | 88 bytes remaining | Critical |
Data publish task | 2048 bytes | 820 bytes remaining | Low |
The OTA handler in that table came from a real connected device project. The standard update path worked fine. But the error recovery path - what happens when a firmware block fails to download, it used significantly more memory than the happy path. That path was never exercised in code review. Measuring memory usage under real conditions caught it before it caused a field failure.
What this catches: Memory overflow faults triggered by error paths that are rarely exercised in testing.
What happens without it: The device works perfectly until it hits the error path in production.
When a firmware crash happens on a device in the field, the default behaviour on most embedded systems is to reset and say nothing. The device comes back up. The crash is gone. Support gets a ticket saying "the device restarted unexpectedly."
A better approach is to configure the firmware so that when a crash occurs, it captures exactly what it was doing at that moment - which task was running, what operation triggered the fault and saves that information before resetting. On the next boot, that data gets sent before normal operation begins.
With this in place, a crash that previously looked like "random reset" becomes "the device crashed while processing incoming data over BLE, because a buffer was accessed before it was ready." That's a one-day fix, not a two-week investigation.
This is purely a configuration and logging change. It doesn't require hardware changes. It costs a few hundred bytes of storage. The return on that investment appears on the first unexplained field reset.
What this catches: Any unexpected crash, including memory violations, null pointer errors, and hardware faults.
What happens without it: Field crashes with no root cause data. Support escalations. Firmware revisions shipped blind.
Memory problems in embedded devices are slow. A small allocation that never gets released might consume 2KB per hour. After 10 minutes of testing, that's invisible. After 48 hours, the device crashes.
After code review, the firmware runs continuously for a minimum of 24 hours, preferably 72, with memory usage logged at regular intervals. The log shows whether memory is stable, slowly declining (a leak), or fragmenting over time.
The pattern to watch for: memory looks healthy in the current snapshot but the all-time low keeps dropping. That's the sign of fragmentation - memory is being allocated and released, but the gaps left behind are too small to reuse. The device will eventually fail to allocate what it needs, even if the total free memory looks fine.
On a healthcare device project, a memory leak only appeared after 48 hours of continuous operation. It was traced to an error path in the wireless connection logic that allocated a small buffer and returned early without releasing it - correctly written for the expected case, but leaking on the error case. Code review missed it. The 72-hour runtime test found it.
What this catches: Memory leaks and fragmentation that build over hours or days of uptime.
What happens without it: The device works in QA, passes testing, and fails after days or weeks in the field.
This is the firmware validation step most teams skip, and the one most likely to cause production failures that are genuinely hard to reproduce.
Here's the scenario: a component of the firmware has a timing deadline. It needs to respond to an incoming signal within a certain window. In isolation, it does. But when the full system is running - wireless connected, sensors active, data being processed - other activity delays it past that deadline. The result is a framing error, a dropped packet, or corrupted data. It happens intermittently. It's nearly impossible to reproduce in a controlled test environment.
The fix is to measure actual timing under full load before the build leaves the engineering team, not after QA finds it or after a customer reports it. The measurement takes a few hours and requires basic lab equipment. Finding a timing violation at this stage means a configuration change. Finding it after production means a firmware revision and a customer support incident.
What this catches: Timing faults that only appear when the full system is under load.
What happens without it: Intermittent field failures that are difficult to reproduce and expensive to diagnose.
Code review validates that the firmware handles expected inputs correctly. This technique checks what happens when it receives inputs it was never designed for.
Every external interface on the device - BLE, Wi-Fi, serial connections, OTA update streams gets tested with malformed, oversized, undersized, and out-of-sequence inputs. Not because attackers are expected to send them, but because other devices, environmental noise, and edge cases in the real world produce unexpected inputs regularly.
The most common fault found with this technique: a handler that processes incoming data correctly at the expected length but doesn't check whether the actual received length is within bounds. A legitimate device in the field sends a slightly longer packet than expected. The firmware copies it into a fixed-size buffer. The buffer overflows into adjacent memory. The crash happens 200ms later, in an apparently unrelated part of the firmware, with no obvious connection to the oversized input.
What this catches: Buffer overflows and data corruption triggered by unexpected real-world inputs.
What happens without it: Intermittent crashes that appear random because the connection to the triggering input is not obvious.
These six techniques run in a structured sequence after every code review sign-off, before any build ships to QA. The full sequence takes 3–4 days.
Day 1: Instrumentation setup (2–4 hours) Configure the watchdog logging, crash capture, and memory monitoring. This is a one-time setup per project that pays back on the first field incident.
Day 2: Full operational cycle run (4–8 hours) Run the firmware through every operational mode: startup, normal operation, error recovery, OTA update, reconnection. Check memory usage across every task. Identify and fix anything close to its limit.
Day 3: Timing and input validation (4–6 hours) Measure timing under full load. Run malformed input tests across all external interfaces. Document any failures as engineering bugs before they become QA findings.
Day 4+: Extended runtime soak (24–72 hours) Let the firmware run continuously. Monitor memory health. Confirm zero unexplained resets before handoff.
Technique | What it catches | Time investment | Skip when |
|---|---|---|---|
Watchdog crash logging | Hangs, deadlocks, frozen tasks | 2 hours setup | Never |
Memory limit profiling | Stack overflow, undersized allocations | 1 day | Never |
Crash capture configuration | Any unexpected crash with root cause | 2 hours | Never |
Extended memory monitoring | Leaks and fragmentation over time | 24–72 hours | Fixed-allocation, short-lived devices |
Timing validation under load | Intermittent timing faults | 4-6 hours | Devices with no real-time requirements |
Input boundary testing | Buffer overflows, data corruption | 4-6 hours | Never |
Code review catches logic errors, naming issues, and implementation mismatches. What it cannot catch is runtime behaviour: what happens when two tasks access shared data at the exact same moment, what happens when the device has been running for 48 hours and memory is fragmented, what happens when an input arrives that's slightly longer than expected. These are the faults that cause the most expensive production failures, and they require runtime testing to find.
The first step is making sure the device is configured to save crash data before it resets - if it isn't, that's the fix. A crash with no log entry is almost always a system hang, a memory violation, or a hardware fault. With crash capture configured, the next occurrence produces a report that identifies which task was running and what triggered the fault. Without it, the only option is guesswork and waiting for the crash to happen again.
Minimum 72 hours for any connected device that manages memory dynamically. Memory leaks and fragmentation compound over time and won't appear in a short test. For medical devices or anything handling sensitive data, extend to 7 days. The most expensive production firmware failures are almost always faults that take time to build up - slow leaks, gradual fragmentation, timing violations that only occur under specific load conditions.
Before. These techniques are engineering-owned, not QA-owned. Finding a memory overflow or a timing violation before it reaches QA means a quick engineering fix. Finding it in QA means a firmware revision, re-flash, and delayed QA cycle. Finding it in production means a support incident, a field update, and a customer relationship problem.
Three to four days per firmware release cycle. The 72-hour soak runs largely unattended. The instrumentation setup in Day 1 is a one-time investment - once the watchdog logging and crash capture are configured on a project, they stay configured. The cost of skipping this process shows up later: a field failure that takes two weeks to diagnose costs significantly more than four days of structured testing.
More relevant, not less. A small team shipping a connected product doesn't have a QA department to catch what engineering misses. The faults this process finds are the ones that generate the most damaging customer feedback: devices that reset unexpectedly, connections that drop after hours of use, crashes that appear to be random. Finding those before shipping is a decision that directly affects product reputation and customer retention.
Code review is the gate that keeps logic errors out. The firmware fault-finding techniques in this post are the gate that keeps runtime faults out. Together, they cover the two categories of problem that cause the most expensive production failures.
None of these techniques require new tools or significant time investment. They require discipline - running the same structured sequence after every code review, on every project, before any build ships to QA.
If your firmware team is signing off on code review without running post-review fault detection, the problems are still in the code. They're waiting for the right conditions on production hardware to surface.
If you're building a connected device and want a firmware team that runs this process on every project, CoreFragment's embedded engineering team can review your current process and identify exactly where the gaps are.