6 Firmware Debugging Techniques We Use After Every Code Review

Why Debugging the Firmware Is Actually Needed?

The code review passed. Every comment is resolved. The static analyser shows zero warnings. The team signs off and the build goes to hardware.

Then the device resets at 3am with no log entry. Or the BLE connection drops after exactly 47 minutes. Or the sensor reads fine on the bench but returns garbage after 20 minutes in the field. Code review didn't catch any of it. These faults don't live in the code. They live in how the code behaves on real hardware, under real conditions.

By the end of this post, you'll know the firmware debugging techniques used after every code review on client projects: the ones that find the faults that would otherwise reach production.

Why Code Review Alone Is Not Enough

Code review is a reading exercise. Engineers check logic, flag issues, and verify that the code matches the intent. It's essential — and it's limited to what the code looks like on paper.

The faults that cause the most expensive production failures don't show up on paper. They show up when two processes try to access the same data at the wrong moment. When a rarely-used error path runs out of memory. When a sensor that works perfectly at room temperature starts misbehaving at 60°C. None of that is visible in a code review.

Three categories of fault consistently survive code review on embedded projects:

  • Timing faults: Two parts of the system collide at exactly the wrong moment. Invisible in static review, guaranteed to appear in the field.

  • Memory faults: The device runs out of memory or corrupts its own data - not immediately, but after hours or days of uptime.

  • Hardware-interaction faults: Behaviours that only appear when the full system is running: all sensors active, wireless connected, data flowing.

Each technique below targets one of these categories. Together, they form the firmware fault detection process run after every code review before a build goes to QA.

Technique 1 - Make the Watchdog Timer Tell You Something Useful

Every embedded device has a watchdog timer. It's a safety net: if the firmware stops responding, the watchdog resets the device. Most teams set it up and move on.

The problem is that a reset with no information is nearly useless. The device came back up. The fault is gone. You have no idea what caused it.

The right approach is to configure the watchdog so that, before it resets, it saves a snapshot of what the device was doing: which task was running, what the last few log entries were, and what triggered the reset. On the next boot, that data gets transmitted before anything else happens.

On one industrial sensor project, three consecutive resets in the field were traced to a single sensor that occasionally blocked the entire system during a power fluctuation. Without the pre-reset snapshot, that fault would have taken weeks to find. With it, the root cause was identified in two hours.

What this catches: Deadlocks, frozen tasks, hardware peripherals that stop responding.

What happens without it: Field resets with no root cause. Expensive support tickets. Guesswork.

Technique 2 - Check Memory Limits Before Hardware Ships

Every task running on the device has a memory allocation. Most firmware teams estimate these allocations based on experience and ship. The estimates are often wrong, not because the engineers are careless, but because the memory used by an error recovery path is genuinely hard to estimate without measuring it.

The solution is to run the firmware through its full operational cycle - normal operation, error recovery, OTA update, reconnection after a dropped connection and measure how close each task comes to its memory limit at every stage. Any task using more than 90% of its allocation is a risk. Any task over 95% is a problem that needs to be fixed before production.

The table below shows what this looks like in practice:

Task

Memory Allocated

Closer to Limit

Risk level

Sensor sampling

2048 bytes

640 bytes remaining

Low

BLE connection task

4096 bytes

312 bytes remaining

High

OTA update handler

3072 bytes

88 bytes remaining

Critical

Data publish task

2048 bytes

820 bytes remaining

Low

The OTA handler in that table came from a real connected device project. The standard update path worked fine. But the error recovery path - what happens when a firmware block fails to download, it used significantly more memory than the happy path. That path was never exercised in code review. Measuring memory usage under real conditions caught it before it caused a field failure.

What this catches: Memory overflow faults triggered by error paths that are rarely exercised in testing.

What happens without it: The device works perfectly until it hits the error path in production.

Technique 3 - Configure the Device to Explain Its Own Crashes

When a firmware crash happens on a device in the field, the default behaviour on most embedded systems is to reset and say nothing. The device comes back up. The crash is gone. Support gets a ticket saying "the device restarted unexpectedly."

A better approach is to configure the firmware so that when a crash occurs, it captures exactly what it was doing at that moment - which task was running, what operation triggered the fault and saves that information before resetting. On the next boot, that data gets sent before normal operation begins.

With this in place, a crash that previously looked like "random reset" becomes "the device crashed while processing incoming data over BLE, because a buffer was accessed before it was ready." That's a one-day fix, not a two-week investigation.

This is purely a configuration and logging change. It doesn't require hardware changes. It costs a few hundred bytes of storage. The return on that investment appears on the first unexplained field reset.

What this catches: Any unexpected crash, including memory violations, null pointer errors, and hardware faults.

What happens without it: Field crashes with no root cause data. Support escalations. Firmware revisions shipped blind.

Technique 4 - Test Memory Health Over 24–72 Hours, Not 10 Minutes

Memory problems in embedded devices are slow. A small allocation that never gets released might consume 2KB per hour. After 10 minutes of testing, that's invisible. After 48 hours, the device crashes.

After code review, the firmware runs continuously for a minimum of 24 hours, preferably 72, with memory usage logged at regular intervals. The log shows whether memory is stable, slowly declining (a leak), or fragmenting over time.

The pattern to watch for: memory looks healthy in the current snapshot but the all-time low keeps dropping. That's the sign of fragmentation - memory is being allocated and released, but the gaps left behind are too small to reuse. The device will eventually fail to allocate what it needs, even if the total free memory looks fine.

On a healthcare device project, a memory leak only appeared after 48 hours of continuous operation. It was traced to an error path in the wireless connection logic that allocated a small buffer and returned early without releasing it - correctly written for the expected case, but leaking on the error case. Code review missed it. The 72-hour runtime test found it.

What this catches: Memory leaks and fragmentation that build over hours or days of uptime.

What happens without it: The device works in QA, passes testing, and fails after days or weeks in the field.

Technique 5 - Validate Timing Under Full System Load

This is the firmware validation step most teams skip, and the one most likely to cause production failures that are genuinely hard to reproduce.

Here's the scenario: a component of the firmware has a timing deadline. It needs to respond to an incoming signal within a certain window. In isolation, it does. But when the full system is running - wireless connected, sensors active, data being processed - other activity delays it past that deadline. The result is a framing error, a dropped packet, or corrupted data. It happens intermittently. It's nearly impossible to reproduce in a controlled test environment.

The fix is to measure actual timing under full load before the build leaves the engineering team, not after QA finds it or after a customer reports it. The measurement takes a few hours and requires basic lab equipment. Finding a timing violation at this stage means a configuration change. Finding it after production means a firmware revision and a customer support incident.

What this catches: Timing faults that only appear when the full system is under load.

What happens without it: Intermittent field failures that are difficult to reproduce and expensive to diagnose.

Technique 6 - Send Inputs the Device Was Never Designed to Handle

Code review validates that the firmware handles expected inputs correctly. This technique checks what happens when it receives inputs it was never designed for.

Every external interface on the device - BLE, Wi-Fi, serial connections, OTA update streams gets tested with malformed, oversized, undersized, and out-of-sequence inputs. Not because attackers are expected to send them, but because other devices, environmental noise, and edge cases in the real world produce unexpected inputs regularly.

The most common fault found with this technique: a handler that processes incoming data correctly at the expected length but doesn't check whether the actual received length is within bounds. A legitimate device in the field sends a slightly longer packet than expected. The firmware copies it into a fixed-size buffer. The buffer overflows into adjacent memory. The crash happens 200ms later, in an apparently unrelated part of the firmware, with no obvious connection to the oversized input.

What this catches: Buffer overflows and data corruption triggered by unexpected real-world inputs.

What happens without it: Intermittent crashes that appear random because the connection to the triggering input is not obvious.

What Firmware Debugging Looks Like as a Process

These six techniques run in a structured sequence after every code review sign-off, before any build ships to QA. The full sequence takes 3–4 days.

Day 1: Instrumentation setup (2–4 hours) Configure the watchdog logging, crash capture, and memory monitoring. This is a one-time setup per project that pays back on the first field incident.

Day 2: Full operational cycle run (4–8 hours) Run the firmware through every operational mode: startup, normal operation, error recovery, OTA update, reconnection. Check memory usage across every task. Identify and fix anything close to its limit.

Day 3: Timing and input validation (4–6 hours) Measure timing under full load. Run malformed input tests across all external interfaces. Document any failures as engineering bugs before they become QA findings.

Day 4+: Extended runtime soak (24–72 hours) Let the firmware run continuously. Monitor memory health. Confirm zero unexplained resets before handoff.

Technique

What it catches

Time investment

Skip when

Watchdog crash logging

Hangs, deadlocks, frozen tasks

2 hours setup

Never

Memory limit profiling

Stack overflow, undersized allocations

1 day

Never

Crash capture configuration

Any unexpected crash with root cause

2 hours

Never

Extended memory monitoring

Leaks and fragmentation over time

24–72 hours

Fixed-allocation, short-lived devices

Timing validation under load

Intermittent timing faults

4-6 hours

Devices with no real-time requirements

Input boundary testing

Buffer overflows, data corruption

4-6 hours

Never

FAQs for Firmware Debugging

What does firmware debugging after code review actually catch that the review misses?

Code review catches logic errors, naming issues, and implementation mismatches. What it cannot catch is runtime behaviour: what happens when two tasks access shared data at the exact same moment, what happens when the device has been running for 48 hours and memory is fragmented, what happens when an input arrives that's slightly longer than expected. These are the faults that cause the most expensive production failures, and they require runtime testing to find.

How do you debug a firmware crash that leaves no log entry?

The first step is making sure the device is configured to save crash data before it resets - if it isn't, that's the fix. A crash with no log entry is almost always a system hang, a memory violation, or a hardware fault. With crash capture configured, the next occurrence produces a report that identifies which task was running and what triggered the fault. Without it, the only option is guesswork and waiting for the crash to happen again.

How long should firmware testing run before production sign-off?

Minimum 72 hours for any connected device that manages memory dynamically. Memory leaks and fragmentation compound over time and won't appear in a short test. For medical devices or anything handling sensitive data, extend to 7 days. The most expensive production firmware failures are almost always faults that take time to build up - slow leaks, gradual fragmentation, timing violations that only occur under specific load conditions.

Should this process happen before or after QA?

Before. These techniques are engineering-owned, not QA-owned. Finding a memory overflow or a timing violation before it reaches QA means a quick engineering fix. Finding it in QA means a firmware revision, re-flash, and delayed QA cycle. Finding it in production means a support incident, a field update, and a customer relationship problem.

How much does this process add to a development timeline?

Three to four days per firmware release cycle. The 72-hour soak runs largely unattended. The instrumentation setup in Day 1 is a one-time investment - once the watchdog logging and crash capture are configured on a project, they stay configured. The cost of skipping this process shows up later: a field failure that takes two weeks to diagnose costs significantly more than four days of structured testing.

Is this relevant for small teams or early-stage products?

More relevant, not less. A small team shipping a connected product doesn't have a QA department to catch what engineering misses. The faults this process finds are the ones that generate the most damaging customer feedback: devices that reset unexpectedly, connections that drop after hours of use, crashes that appear to be random. Finding those before shipping is a decision that directly affects product reputation and customer retention.

Conclusion

Code review is the gate that keeps logic errors out. The firmware fault-finding techniques in this post are the gate that keeps runtime faults out. Together, they cover the two categories of problem that cause the most expensive production failures.

None of these techniques require new tools or significant time investment. They require discipline - running the same structured sequence after every code review, on every project, before any build ships to QA.

If your firmware team is signing off on code review without running post-review fault detection, the problems are still in the code. They're waiting for the right conditions on production hardware to surface.

If you're building a connected device and want a firmware team that runs this process on every project, CoreFragment's embedded engineering team can review your current process and identify exactly where the gaps are.

Author

Parthraj Gohil

Parthraj Gohil is the Founder and CEO of CoreFragment Technologies. He run the team of IoT developers, embedded engineers, app developers and AI engineers. With more than 10 years of industry experience, he has delivered IoT firmware projects across Healthcare , IoT, Embedded-AI, Wearables and Automotive industry.

Have Something on Your Mind? Contact Us : info@corefragment.com or +91 79 4007 1108

Share this blog

Share this on social channels to benefit others.