Firmware Development Best Practices: Common Challenges and How to Fix Them

Why firmware engineers should follow best practices?

The firmware team has been working for six weeks. The code compiles. The basic use case works on the bench. Then the first hardware revision comes back from the PCB house, the team integrates the full peripheral stack, and the device starts resetting randomly. The logs show nothing. The reset happens between log entries.

That is not bad luck. That is the absence of firmware development best practices that every experienced embedded team applies before integration, not after. The stack overflow that caused the reset was detectable at unit test time. The race condition in the ISR handler was visible in code review. The watchdog timeout that produced no useful data was configurable from day one.

By the end of this post, you will know the seven most common firmware development challenges that derail production timelines, and the specific practices that prevent each one before it becomes a field problem.

Why Firmware Development Best Practices Matter More Than Software Development Ones

Software bugs crash an app. Firmware bugs crash a device. The difference is recovery cost.

A crashed web app gets a restart in 10 seconds. A crashed firmware that writes corrupted data to flash memory gets a factory reflash. Or, in production, a truck roll. A firmware bug that causes intermittent resets 30 days after deployment gets reproduced in a lab, diagnosed across multiple hardware revisions, and fixed in a firmware update campaign. The cost of a firmware defect that escapes to production is 10 to 50 times higher than the cost of finding it during development.

Three characteristics of embedded firmware make best practices non-optional:

Hardware dependency. Firmware runs on specific hardware with specific timing constraints. A software unit test runs on a simulator. A firmware integration test runs on the actual chip. Actual clock frequency. Actual peripheral timing. A bug that only appears on hardware is harder to reproduce, harder to diagnose, and harder to fix than one that appears on a simulator.

Concurrent execution under interrupts. Firmware typically runs with interrupts that can fire at any moment and preempt the main execution path. Race conditions between ISR handlers and main-loop code do not appear in sequential testing. They appear on hardware. They appear under load, on hardware, after the system has been running long enough for a timing alignment to occur.

No recovery path after deployment. A firmware bug without OTA capability is permanent. Physical access is the only fix. Even with OTA, updating 10,000 deployed devices takes time, bandwidth, and risk. The only reliable recovery is finding the bug before shipping.

Practice 1: Define the Task Architecture Before Writing Any Code

The single most expensive firmware mistake is starting with working code and finding the architecture is wrong at integration time. Task architecture (which code runs in which context, what priorities tasks have, and how they communicate) determines whether the firmware can be extended and debugged without destabilising what already works.

What a correct task architecture looks like:

Each task owns one responsibility. A sensor task reads sensors. A communication task manages the wireless stack. A storage task writes to flash. Tasks communicate through queues, not shared global variables. Each task has a defined stack size, set from measurement rather than guesswork.

The most common architecture mistake:

A single super-loop that does everything: reads sensors, processes data, manages the BLE stack, handles button presses, and updates the display. All sequentially in one main loop. This works in prototype. It breaks under any condition where the timing of one operation affects another. A BLE event that arrives while the sensor reading is in progress gets dropped. A long flash write blocks the button handler. The display update introduces jitter in the sensor sample interval.

The fix:

Define the task list before firmware development starts. For each task, specify: its responsibility, its priority, its maximum stack depth, its communication interfaces (which queues it reads from and writes to), and its timing requirement. This document takes half a day to write. It saves weeks of integration debugging.

Task

Priority

Stack

Communication

Timing requirement

Sensor sampling

High

1KB

Writes to sensor queue

Every 100ms, ±5ms

BLE stack task

Highest

4KB

Reads command queue, writes data queue

Event-driven

Data processing

Medium

2KB

Reads sensor queue, writes storage queue

Within 50ms of sample

Flash storage

Low

1KB

Reads storage queue

Async, no deadline

System monitor

Low

512B

Reads all task health

Every 1 second

Practice 2: Configure the Watchdog to Capture State Before It Resets

Every firmware team enables the watchdog. Most teams configure it to reset the device and nothing else. That is the wrong configuration because a reset with no diagnostic information turns every watchdog event into a guessing game.

What the watchdog should do:

Before resetting, the watchdog pre-reset hook captures the system state: which task was running, what the last log entries were, what the stack depth of every task was at that moment, and what the reset reason register says. All of this writes to a protected region of non-volatile memory. On the next boot, the firmware reads that region and transmits the diagnostic data before starting normal operation.

With this in place, a watchdog reset produces a report. Without it, a watchdog reset produces nothing.

Common causes a configured watchdog catches immediately:

  • A task blocked on a semaphore that was never released because the releasing task crashed

  • A deadlock between two tasks competing for two resources in opposite order

  • A tight loop in an ISR that prevents the scheduler from running

  • A peripheral that stopped responding and left the firmware waiting on a poll

Practice 3: Measure Stack Depth Before Any Build Leaves Engineering

Stack overflow is the most common cause of unexplained firmware resets. It is also one of the most preventable because stack depth is measurable before hardware is needed.

Why teams get stack sizes wrong:

Most developers estimate stack sizes. Estimation is inaccurate because the maximum call depth of a task depends on which code paths execute under which conditions. The error recovery path that allocates 800 bytes of local variables only runs when a peripheral fails. It never runs in normal bench testing. It runs exactly once in production, at 3am, when the device is 200km away.

How to measure correctly:

FreeRTOS provides a watermark function that returns the minimum free stack space observed since the task started. Run the firmware through the full operational cycle: normal operation, error recovery paths, OTA update, BLE reconnection, power cycle. Log the watermark for every task. Any task showing less than 10% free stack is a risk. Any task below 5% is a defect.

Run this measurement before any hardware handoff to QA. The measurement takes one day. Finding a stack overflow in QA adds one to two weeks.

Stack sizing rule:

Set the initial stack size at twice the estimated maximum depth. After measuring the watermark under full conditions, size the final stack at 1.3 times the observed maximum depth, not the watermark itself. The watermark shows you the worst case you tested. The 1.3x multiplier covers the cases you did not test.

Practice 4: Treat Every ISR as a Critical Section With Measurable Latency

Interrupt service routines are where firmware race conditions live. An ISR that runs too long blocks the scheduler. An ISR that shares state with a task without a critical section produces data corruption that appears random because timing determines which execution wins.

ISR design rules:

An ISR should do one thing: capture data and signal the task that will process it. Reading a byte from a UART receive register and writing it to a ring buffer is correct ISR behaviour. Parsing a protocol frame inside an ISR is not. Calling printf inside an ISR is never acceptable. Calling any function that can block inside an ISR causes a hard fault.

Measuring ISR latency:

Use a GPIO toggle at ISR entry and exit, measured with a logic analyser. Measure under full system load (all peripherals active, BLE connected, active MQTT session) because ISR latency under idle conditions is not the number that matters. Under full load, ISR latency can be 5 to 10 times higher than under idle because of interrupt nesting and memory bus contention.

Any ISR with a latency deadline must be validated at full system load, not at idle. A UART byte that must be read within one character time at the configured baud rate is a typical example.

Shared state protection:

Any variable written inside an ISR and read outside it must be protected. On ARM Cortex-M processors, this means marking the variable as volatile and wrapping all accesses outside the ISR with critical section guards, or passing data through a queue rather than a shared variable. A shared variable that is not protected produces intermittent corruption that only reproduces when a specific timing condition occurs. That is exactly the class of bug that causes the most expensive production incidents.

Practice 5: Run a Power Profile Before Any Hardware Goes to QA

Power bugs are invisible on the bench. A device drawing 8mA in what should be a 5µA sleep state looks exactly like a device drawing 5µA unless someone measures it.

Most firmware power problems are not complex. They fall into three categories:

Wrong sleep state. The firmware requests a light sleep state because it is simpler to configure. The hardware supports a deep sleep state that draws 100 times less current. Fix: identify the deepest sleep state that satisfies all required wake sources and configure it explicitly before any power testing.

Peripheral not shut down before sleep. A sensor in software-disable mode still draws standby current. A UART peripheral with its clock running draws leakage current. Fix: audit every peripheral initialised in the firmware, confirm its power state before sleep entry, and add explicit shutdown calls for every peripheral not needed during sleep.

Scheduler wake events. FreeRTOS's default tick interrupt fires every millisecond, waking the processor 1,000 times per second during what should be deep sleep. Fix: enable tickless idle before any power validation. This is a one-line configuration change with a significant power impact on sleep-dominant devices.

Take a current measurement before any build goes to QA. The measurement takes two hours. Discovering a power defect in user testing on a battery-powered device takes weeks to reproduce and diagnose.

Power problem

Typical current waste

Fix

Effort

Wrong sleep state

50–500µA

Configure correct sleep mode

2-4 hours

Peripheral not shut down

20–300µA per peripheral

Add shutdown before sleep entry

1-2 hours

Schedular tick active

100–800µA

Enable tickless idle

1 hour

Sensor left in standby

50–200µA

Add hardware power switch or software power-down

2-4 hours

Radio not duty cycled

500–5,000µA

Tune connection interval or duty cycle

Half day

Practice 6: Use Static Analysis and Code Review Together, Not Interchangeably

Static analysis and code review find different things. Treating one as a substitute for the other violates one of the core firmware development best practices. A class of defects goes undetected either way.

What static analysis finds:

Uninitialised variables. Buffer overflows at known indices. Integer overflow in arithmetic. Null pointer dereferences on known null paths. Memory leaks in allocations without matching frees. These are mechanical defects that a tool catches faster and more reliably than a human reviewer.

Configure the static analyser to treat all warnings as errors before any code is reviewed. A reviewer facing 40 suppressed warnings triages warnings instead of reviewing logic.

What code review finds:

Wrong algorithm for the requirements. Missing error handling for a case the spec covers. A race condition between two tasks that the analyser cannot see because it requires understanding the execution context. A design that works for the prototype but breaks when a hardware revision changes a timing assumption.

The review process that catches hardware-firmware integration bugs:

Before review, the author writes a one-paragraph description of the change: what it does, why it does it this way, and what the failure mode would be if the assumption is wrong. Reviewers who understand the stated assumption can challenge it. Reviewers who are simply reading code miss the assumption entirely.

Practice 7: Build the OTA Update Infrastructure Before the Feature Set

The most common firmware architecture mistake in IoT products is treating OTA as a feature to add after the core functionality is working. OTA is not a feature. OTA is the infrastructure that makes every other feature fixable after deployment.

A firmware architecture that was not designed for OTA from the start requires three things to add it later: a flash memory repartitioning (which requires reflashing every device already in the field), a bootloader change (which cannot be deployed over OTA because the bootloader must be in place first), and a firmware architecture change to the update process itself.

What OTA requires at the architecture stage:

  • A bootloader partition that is never overwritten by OTA and handles image verification and partition switching

  • A dual-application partition layout with active and staging partitions

  • A flash memory map that reserves space for both partitions before the feature firmware is written

  • A signing key workflow in the build pipeline before the first firmware is compiled

Define all four of these before the first line of application firmware is written. The cost is one day of architecture planning. The alternative is a repartitioning operation on every deployed device, which on a fleet of any meaningful size is not feasible.

Firmware Development Best Practices: Pre-Build Checklist

Run through this checklist on every project where firmware development best practices need to be validated before any build goes beyond the initial prototype stage.

Architecture:

  • Task list defined with responsibilities, priorities, stack sizes, and communication interfaces

  • No shared global variables between tasks — all state passed through queues or protected by mutexes

  • ISR responsibilities limited to data capture and task signalling only

  • OTA flash partition layout defined and frozen before application firmware begins

Fault handling:

  • Watchdog enabled with pre-reset state capture writing to non-volatile storage

  • HardFault handler captures PC, LR, and fault registers before reset

  • Stack watermark measurement integrated into the development cycle — not just done once

  • All malloc failure paths handled — never silently ignored

Power:

  • Sleep state selection verified against required wake sources

  • Tickless idle enabled on any RTOS-based device with sleep as a design requirement

  • Current measurement taken before hardware leaves engineering

Code quality:

  • Static analysis configured with all warnings as errors

  • Code review process includes a change description written by the author before review

  • All ISR-to-task shared state protected with critical sections or queues

Conclusion

Firmware development best practices are not style guidelines for writing cleaner code. They are the decisions that determine whether a device works in the field or generates a support incident six months after shipping.

Task architecture, watchdog configuration, stack measurement, ISR design, power validation, and OTA infrastructure each address a specific class of failure that shows up in production if they are not applied during development. None of them are complex. All of them require discipline to apply before the pressure of a launch deadline arrives.

If you are building a firmware team or scoping a connected product and want an embedded engineering team that applies these practices on every project, CoreFragment's firmware team has shipped firmware across healthcare devices, industrial sensors, wearables, and connected consumer products. Share your device architecture and you will get a direct assessment of where the risks are and what the right firmware approach looks like before development starts.

Author

Parthraj Gohil

Parthraj Gohil is the Founder and CEO of CoreFragment Technologies. He run the team of IoT developers, embedded engineers, app developers and AI engineers. With more than 10 years of industry experience, he has delivered IoT firmware projects across Healthcare , IoT, Embedded-AI, Wearables and Automotive industry.

Have Something on Your Mind? Contact Us : info@corefragment.com or +91 79 4007 1108

Share this blog

Share this on social channels to benefit others.