- June 26, 2024
Not enough time? Get the key points instantly.
Your IoT product shipped 2,000 units. A firmware bug surfaces in the field: a memory leak that causes a reset after 72 hours of uptime. The fix is two lines of code. Deploying it requires a firmware update to every device in the field, over Wi-Fi or cellular, while devices are actively in use, without a single one going offline permanently.
Push a broken update and you have 2,000 bricked devices. No remote recovery. A technician visit per unit. At $150 per truck roll, that is $300,000 to fix what was a two-line change.
By the end of this post, you will know exactly how to design OTA updates for IoT devices that handle the update correctly, roll back safely when something goes wrong, and scale from 100 devices to 100,000 without architectural changes.
A firmware update that bricks a device is never a firmware bug. It is an architecture failure. The way you design OTA updates for IoT devices determines whether a failed update is recoverable or permanent. The firmware may have had a defect, but a correctly designed OTA system catches that defect before it becomes permanent.
Three design gaps account for nearly every production OTA failure:
No rollback mechanism. The most common failure mode: a new firmware image is written directly over the active firmware. If the new image has a boot failure, the device cannot start. There is no fallback. Recovery requires physical access. A dual-partition architecture prevents this entirely — the new image goes into a staging partition while the active firmware keeps running, and the device only switches to the new image after verifying it boots successfully.
No image verification before writing. An image arriving corrupted from a partial download or network error gets rejected before it touches flash. Full stop. Verifying the hash or signature before writing takes milliseconds. Skipping it converts a transient network error into a bricked device.
No staged rollout. Pushing a new firmware to 2,000 devices simultaneously means a defect in the new firmware affects 2,000 devices simultaneously. A staged rollout deploys to 1% of the fleet first, monitors for errors, and promotes to the full fleet only after a defined observation window passes. The defect affects 20 devices, not 2,000.
Get all three right and OTA updates for IoT devices become the safety net they are supposed to be. Get any one wrong and OTA becomes the most expensive failure mode in your product.
The dual-partition architecture is not optional for production OTA updates on IoT devices. Every production OTA update system for IoT devices is built on it or eventually migrates to it after a bricking incident.
How it works:
The device flash memory contains two application partitions (an active partition and a staging partition) plus a bootloader that runs before either. During normal operation, the device boots from the active partition. When an update arrives, the new firmware image writes to the staging partition while the active firmware keeps running. On the next boot, the bootloader verifies the staging image and, on success, marks it as the new active partition.
If the staging image fails verification, fails to boot, or crashes within a defined observation window, the bootloader marks it as invalid and returns to the previous active partition. Automatically. No network required.
Partition Layout | Role | Written By |
|---|---|---|
Bootloader | Verifies images, selects partition to boot | Factory — never updated over OTA |
Active partition | Currently running firmware | Bootloader on promotion from staging |
Staging partition | Incoming OTA image | OTA update process |
Settings / NVS | Device configuration, calibration, certificates | Application firmware |
The bootloader partition must never be updatable over OTA. A bootloader that gets replaced over the air can be bricked by a bad update with no recovery path. The bootloader is the recovery mechanism — it must stay intact.
MCUboot is the standard open-source bootloader for embedded OTA update systems. It supports the dual-partition (swap) model, cryptographic image signing, and rollback marking. ESP-IDF's native OTA system implements the same dual-partition concept with native support on ESP32 family chips. Both are production-proven across millions of deployed devices.
Image verification is the line between a recoverable OTA failure and a bricked device. An image that arrives with corrupted bytes from a network error, a truncated download, or a man-in-the-middle substitution should be rejected entirely. The device recovers. The OTA system logs the failure. A new attempt delivers the correct image.
Two-layer verification:
Layer 1 : Integrity check. A SHA-256 hash of the complete firmware image is calculated server-side and included in the update manifest. The device recalculates the hash after download and compares. A mismatch means the image is corrupted or incomplete. Reject it. Do not write anything to flash.
Layer 2 : Authenticity check. A cryptographic signature from the firmware signing key confirms the image was produced by a trusted source. The device holds the public key. The server holds the private key. An image that passes the hash check but fails the signature check was not produced by your build pipeline. Reject it. This is the layer that prevents a compromised server or network path from delivering malicious firmware to your devices.
Both checks run before any bytes touch the staging partition. Both are mandatory. This is not a performance concern. SHA-256 verification of a 512KB firmware image takes under 100ms on a Cortex-M4 running at 64MHz. There is no justification for skipping either check.
Interrupted download recovery:
Network connections drop. A device mid-download that loses connectivity should resume from where it left off, not restart from zero. Implement chunked download. Each chunk gets acknowledged. The position is persisted to non-volatile storage. On reconnection, the download resumes from the last acknowledged chunk.
A device that restarts the full download on every connection interruption will never complete a large update over a poor cellular connection. In practice, some devices in low-connectivity deployments never update at all because every attempt is interrupted before completion.
The update path gets the engineering attention. The rollback mechanism gets the production incidents. That is the wrong balance. Both deserve equal design effort because rollback is what separates a recoverable failure from an unrecoverable one.
Boot count and health check pattern:
On boot from a newly promoted firmware image, the bootloader sets a boot counter. The application firmware must confirm health within a defined window — typically 60 to 300 seconds by writing a "confirmed" flag to non-volatile storage. If the application does not confirm within the window (because it crashed, hung, or failed to connect to the network), the bootloader treats the new image as invalid on the next boot and returns to the previous active partition.
This catches the most dangerous class of OTA failure: firmware that boots and appears to start correctly, but fails before reaching a stable operating state. A watchdog reset within the confirmation window triggers rollback automatically.
What the rollback window must cover:
Full boot sequence including peripheral initialisation
Network connection and initial cloud handshake
First successful sensor read or application cycle
Confirmation write to non-volatile storage
The window must be long enough to cover all of these under worst-case conditions: slow network, cold peripheral startup, high cloud latency. Too short triggers false rollbacks. Too long delays detection of broken firmware.
Rollback version protection:
A rollback mechanism that allows downgrading to any previous firmware version is a security risk. If a previous version had a known vulnerability, an attacker who can influence the OTA process can force a rollback to the vulnerable version. Implement a minimum version floor in the bootloader: a firmware image with a version number below the floor is rejected, even if it is signed correctly.
On one industrial IoT project, a firmware update with a boot failure caused 340 devices to roll back to the previous firmware version automatically overnight. Zero truck rolls were required. The defect was fixed the following day and the update was re-pushed. Total user impact: a 24-hour delay on a non-critical firmware feature. Without rollback, the impact would have been 340 devices offline indefinitely.
Never push a firmware update to your full fleet simultaneously. Not once. Not ever. The practice of staged rollout is the difference between a defect affecting 20 devices and a defect affecting 2,000.
How staged rollout works:
The OTA server maintains device cohorts: groups defined by device ID, firmware version, geography, or customer segment. A new firmware release is pushed to the smallest cohort first: typically 0.5 to 2% of the fleet. The OTA system monitors that cohort for a defined observation window — 24 to 72 hours tracking boot success rate, crash rate, session stability, and any application-level error signals.
If the cohort passes the observation window without anomalies, the release promotes to the next cohort: typically 5 to 10% of the fleet. The process repeats until the full fleet is updated.
If any cohort shows an anomaly above a threshold (crash rate above 2%, boot success below 98%, or specific error codes above baseline), the release is halted. Affected devices roll back automatically. The bootloader handles it. The remaining fleet waits. The defect gets diagnosed and fixed first.
Cohort | Fleet Percentage | Observation window | Promotion condition |
|---|---|---|---|
Canary | 0.5–2% | 24-48 hours | Boot success >98%, crash rate <2% |
Early access | 5-10% | 24-72 hours | Same as canary |
General availability | 100% | Monitoring continues | No rollback signals |
Device-side requirements for staged rollout:
The device must report post-update health signals to the OTA server so the staged rollout system can make promotion decisions. At minimum: boot success confirmation, firmware version, device uptime, and any application-level error counters. Without telemetry, the staged rollout system is operating blind.
Run through this before the first production firmware update campaign.
Architecture:
Dual-partition layout implemented with bootloader, active, and staging partitions
Bootloader partition write-protected and not updatable over OTA
Settings and calibration data in a separate partition not affected by firmware updates
Image verification:
SHA-256 hash check implemented before writing to staging partition
Cryptographic signature verification implemented using offline signing key
Signing key pair generated offline and stored securely — private key never on the device or server
Chunked download with resume-from-interruption implemented
Rollback:
Boot count mechanism implemented in bootloader
Application health confirmation implemented within defined window
Minimum version floor configured in bootloader to prevent downgrade attacks
Rollback tested explicitly — not assumed to work from code review
Staged rollout:
Device cohort system implemented in OTA backend
Post-update health telemetry implemented on device side
Observation window and promotion thresholds defined before first release
Rollback halt mechanism tested on OTA backend
Security:
OTA download transport uses HTTPS with server certificate pinning
Firmware image signing key is different from the secure boot signing key
Update manifest includes expected version, hash, and signature not just a URL
A dual-partition OTA architecture with cryptographic image verification and automatic rollback on boot failure is the safest approach. The new image writes to a staging partition while the active firmware keeps running. The device only switches to the new firmware after confirming it boots and operates correctly within a defined window. If anything goes wrong, the bootloader returns to the previous firmware automatically. Staged rollout to a small device cohort before the full fleet adds a second layer of safety because a defect affects a small percentage of devices rather than the entire fleet.
With a correctly designed OTA system, a failed mid-download is recoverable. The device should use a chunked download protocol where each chunk is acknowledged individually and the download position is saved to non-volatile storage. On reconnection, the download resumes from the last confirmed chunk. The staging partition is only considered complete after the full image is downloaded and the integrity hash passes. An incomplete or corrupted image never gets written to the staging partition.
Three mechanisms together prevent bricking. First, verify the firmware image hash and cryptographic signature before writing anything to flash. A corrupted or tampered image gets rejected before it can cause damage. Second, use a dual-partition architecture so the new image writes to a staging partition rather than replacing the active firmware directly. Third, implement a boot confirmation window so the bootloader rolls back to the previous firmware automatically if the new firmware fails to prove it is healthy within a defined time after boot.
MCUboot is an open-source bootloader designed specifically for embedded OTA update systems. It implements the dual-partition swap model, cryptographic image signing with RSA or ECDSA, rollback protection with version floors, and hardware security integration on supported platforms. It runs on Zephyr, FreeRTOS, and bare metal systems across ARM Cortex-M, RISC-V, and Xtensa processors. For most IoT products that do not need a custom bootloader for a specific reason, MCUboot is the right starting point because it handles the security and rollback logic that is easy to get wrong when implementing from scratch.
The OTA backend needs a device registry that tracks the current firmware version, the last update timestamp, and the update status of every device in the fleet. Each device reports its firmware version and health status at regular intervals, so the backend knows which devices are running which version at any given time. Staged rollout systems use this registry to define cohorts, track cohort health, and make promotion decisions. Devices that have been offline or unreachable for an extended period need a catch-up update path when they reconnect, which should deliver only the latest stable firmware rather than every intermediate version.
Before. The OTA partition layout, bootloader selection, flash memory map, and non-volatile storage allocation for the settings partition must all be decided before firmware development starts. Changing them later requires re-flashing every device in the field. The signing key generation and build pipeline signing step must be in place before the first production firmware is compiled. OTA is not a feature that can be cleanly added to a firmware architecture designed without it - it is a constraint that shapes the architecture from the beginning.