Skip to main content

Build for the Field, Not the Lab: Shipping OTA Updates to NB-IoT Devices

·1712 words·9 mins
Carlos Prados
Author
Carlos Prados
Telecommunications Engineer, Entrepreneur, CTO & CIO, Team Leader & Manager, IoT-M2M-Big Data Consultant, Pre-sales Engineer, Product-Service Manager & Strategist.

Fleet Health Is the Product
#

The industry sells IoT as if data were the point. It isn’t. Data is the receipt. Fleet health is the product.

A thousand sensors in the desert, a gas meter in every flat of a rural town, a cold-chain tracker strapped to a pallet crossing a border — none of that is worth anything if the fleet degrades between my laptop and the real world. Every device that silently goes dark, every firmware bug that bakes in without a way out, every slot of flash with a bricked image — each of those is a direct cut into whatever value the product was supposed to deliver.

This is why, in any serious IoT project, the single most important capability you ship is the ability to ship again. Remote firmware updates are not a feature — they are the lifeline that decides whether the fleet stays in service for ten years, or whether you end up flying a technician to a mountain hut to reflash a 4 MB SoC.

So when I sat down to build a new OTA system for constrained fleets, the first design principle was simple: build for the field, not for the lab.


The Reality of NB-IoT
#

Most OTA stacks you find in the wild were designed for well-fed Linux servers. Drag them onto a battery-powered NB-IoT radio waking once per hour and the assumptions collapse. Constrained radios are their own planet:

Reality of NB-IoTDesign consequence
Downlink ~20 kbps, capped at tens of KB per daybsdiff + zstd deltas; verify signatures before downloading
MTU 1 280 bytes, packet loss commonCoAP Block2 transfer at 512-byte blocks, HTTP Range resume
Device wakes for seconds a dayAsynchronous delta generation on the server, short retry windows
Battery budget measured in yearsHeartbeat-only agent, idle between cycles, no open TCP sockets
Operator may be 1 000 km away, truck roll takes daysA/B slots with atomic rollback, watchdog with boot-count safety net

Every design decision in the project traces back to one of those rows. If it doesn’t, it shouldn’t be in the code.


When Keystone Doesn’t Fit
#

A couple of months ago I wrote about Keystone, a Go-based edge orchestrator I built to replace AWS Greengrass. Keystone runs multiple components, manages processes and containers, and idles around ~23 MB of RAM on a gateway-class device.

That works on a Raspberry Pi. It does not work on an NB-IoT module whose total RAM is 64 MB and whose job is to run one binary that reads a sensor, talks to a backend, and goes back to sleep.

Different scope, not a replacement. When the device hosts a complete edge runtime, use Keystone. When the device is a single binary and all you need is to swap that binary safely, you want an agent footprint closer to 5 MB than 25 MB, no container runtime, no component graph, no orchestration overhead. For those devices, a tight OTA-only agent is not just sufficient — it is obligatory. Anything heavier and you are eating into the exact RAM and flash that the device needs to do its actual job.

That is where ota-updater comes in.


Build for the Field, Not for the Lab
#

Most of the hard decisions in this project are not about what the system can do when the network is up and the battery is fresh. They are about what happens when everything that can go wrong, goes wrong, and the device is still expected to serve.

A non-exhaustive list of things the agent must survive without human intervention:

  • A delta download that cuts off at 83 %. Network drops back up. Resume from byte 83 %, not from zero.
  • A new binary that panics on startup before its first heartbeat. Boot-count exceeds the budget. Permanent rollback to the last known-good slot. Failure reported upstream.
  • A corrupt delta the server shouldn’t have served. Signature verification before the download refuses to spend a single downlink byte on it.
  • A power cut halfway through writing the inactive slot. Atomic rename plus fsync(dir) guarantees neither the old binary nor the half-written new one is corrupted on boot.
  • A configuration mistake that points a thousand devices at a broken version. Canary the rollout, watch updater_heartbeats_total{result="fail"}, roll back the target before the whole fleet trips.

In NB-IoT, every byte is expensive and every reboot is a liability. Design follows.


Architecture at a Glance
#

Two binaries, two transports, one signed payload:

┌──────────────────────────┐              ┌──────────────────────────┐
│       update-server      │              │        edge-agent        │
│   (cloud or on-prem)     │              │      (on the device)     │
│                          │              │                          │
│  • signed manifests      │   HTTP/CoAP  │  • heartbeat cycle       │
│  • bsdiff+zstd deltas    │ ───────────▶ │  • signature verify      │
│  • LRU hot cache         │   JSON/CBOR  │  • delta download        │
│  • fsnotify target       │              │  • A/B slot swap         │
│  • Prometheus /metrics   │              │  • watchdog + rollback   │
└──────────────────────────┘              └──────────────────────────┘

The server is a stateless HTTP + CoAP endpoint in front of a directory of binaries. It computes deltas on demand, caches them in bounded memory, and signs every manifest with Ed25519. The agent is a single Go binary — or, as we’ll see, a Go library — that heartbeats, verifies, downloads, patches, and self-replaces. No message broker, no job queue, no database; nothing that can be down when a device wakes up at 3 AM looking for orders.


Protect the Downlink, Not Just the Binary#

The usual OTA signature scheme looks like this: sign the hash of the final target binary, ship a delta to the device, let the device apply the delta, then verify the result matches the signed hash. It works, but it pays the cost of the delta download before finding out the delta was tampered with.

On a radio where the monthly data budget is measured in hundreds of kilobytes, that is indefensible.

The scheme I ended up with (documented exhaustively in docs/signing.md) signs the pair targetHash || deltaHash:

// ManifestSigningPayload builds the exact bytes that go under Ed25519.
// Any change in either hash breaks the signature.
func ManifestSigningPayload(targetHash, deltaHash []byte) []byte {
    buf := make([]byte, 0, len(targetHash)+len(deltaHash))
    buf = append(buf, targetHash...)
    buf = append(buf, deltaHash...)
    return buf
}

Cost: one Ed25519 signature per (from, to) pair. Marginal with Ed25519.

Benefit: the agent verifies the signature against the exact delta bytes it is about to download, not against the result it will have after spending battery and downlink. A tampered delta is rejected with zero bytes transferred. A corrupt server response is rejected with zero bytes transferred. Only after the patch succeeds and the reconstructed binary matches targetHash does the device commit the swap.

It is a small design choice with an outsized impact on cost-per-update in a fleet of thousands.


A/B Slots and In-Place Self-Update
#

A/B slot systems are the boring, correct answer to “what if the new binary is broken”. This project uses two slots and an atomic symlink:

/var/lib/ota-agent/slots/
  ├── A/edge-app          # previous version
  ├── B/edge-app          # new version, being written
  └── current -> A        # atomic symlink; swap is a rename

Write the new binary into the inactive slot, fsync, flip the symlink via atomic rename, then transfer the running process to the new binary. The last step is the one most implementations get clumsy about: they fork a new process, lose the PID, and let the supervisor pick up the pieces.

I went the other way. The agent’s default RestartStrategy is syscall.Exec, which invokes execve(2) on the new binary from inside the current process. The kernel replaces the process image in place — same PID, same PPID, same cgroup, same open file descriptors, same terminal — but running different code. To systemd, the service never restarted. To Docker, PID 1 never changed. To an interactive shell running the agent by hand, the process kept writing logs to the same terminal from the same PID.

Wrap that in a watchdog with N=3 heartbeat retries within a configurable window, a persistent boot counter that triggers permanent rollback after two failed boots (MaxBoots=2), and all on-disk writes guarded by fsync(file) + rename + fsync(dir) — and you have an update process that survives power cuts, bad builds, and transient NB-IoT weather. The details are in the README; the point is that none of it is incidental.


Embeddable by Design
#

The most unusual decision in this project is that the agent is not primarily a binary. It is a Go library.

A real fleet rarely runs “just the updater” on a device. It runs your real workload — a telemetry client, a gateway, a payment app — and the updater is a passenger. Shipping two separate binaries means two processes to supervise, two sets of logs, two health models, and twice the complexity. So the agent lives in pkg/agent, with public types, an injectable logger, pluggable hooks (HealthChecker, RestartStrategy, HWInfoFunc) and no globals.

Embedding it into your own binary is a handful of lines:

updater := agent.NewUpdater(agent.UpdaterConfig{
    ServerURL:   "https://updates.example.com",
    DeviceID:    "sensor-0042",
    PublicKey:   pubKey,
    SlotManager: agent.NewSlotManager("/var/lib/myapp/slots"),
    BootCounter: agent.NewBootCounter("/var/lib/myapp/slots/.boot_count"),
    Watchdog:    watchdog,
    Primary:     httpClient,
    Logger:      slog.Default(),
})

go updater.Run(ctx)

Your application keeps doing its job. The updater heartbeats in the background, downloads and verifies deltas, and — when the time comes — syscall.Execs your own binary into its next version. Same PID, same sockets, same everything except the code.


Conclusion: Fleet Health Is the Product
#

The first time you watch a device in a different country update itself over a 20 kbps radio, reboot into a new version, and check itself healthy — all without human intervention — is when you understand what this code is actually for. Not a feature. Not a convenience. The difference between a fleet that keeps delivering value for a decade and one that slowly rots into support tickets.

Build for the field, not the lab. Fleet health is the product.

The code lives at carlosprados/ota-updater. Ed25519-signed delta patches, HTTP + CoAP transports, A/B slots, watchdog, rollback, atomic writes, Prometheus metrics, pprof, and a full step-by-step demo (with a companion Bruno API collection) for anyone who wants to feel the thing swap under their own hands before trusting it with their own fleet. Pull requests welcome — field reports even more so.