Skip to main content

9 posts tagged with "telemetry"

View All Tags

Binary vs JSON Payloads for Industrial MQTT Telemetry: Bandwidth, Encoding Strategies, and When Each Wins [2026]

· 14 min read

Every IIoT platform faces the same fundamental design decision for machine telemetry: do you encode data as human-readable JSON, or pack it into a compact binary format?

The answer affects bandwidth consumption, edge buffer capacity, parsing performance, debugging experience, and how well your system degrades under constrained connectivity. Despite what vendor marketing suggests, neither format universally wins. The engineering tradeoffs are real, and the right choice depends on your deployment constraints.

This article breaks down both approaches with the depth that plant engineers and IIoT architects need to make an informed decision.

MQTT QoS Levels for Industrial Telemetry: Choosing the Right Delivery Guarantee [2026]

· 11 min read

When an edge gateway publishes a temperature reading from a plastics extruder running at 230°C, does it matter if that message arrives exactly once, at least once, or possibly not at all? The answer depends on what you're doing with the data — and getting it wrong can mean either lost production insights or a network drowning in redundant traffic.

MQTT's Quality of Service (QoS) levels are one of the most misunderstood aspects of industrial IoT deployments. Most engineers default to QoS 1 for everything, which is rarely optimal. This guide breaks down each level with real industrial scenarios, bandwidth math, and patterns that actually work on factory floors where cellular links drop and PLCs generate thousands of data points per second.

Paged Ring Buffers for Industrial MQTT: How to Never Lose a Data Point [2026]

· 10 min read

Here's the scenario every IIoT engineer dreads: your edge gateway is collecting temperature, pressure, and vibration data from 200 tags across 15 PLCs. The cellular modem on the factory roof drops its connection — maybe for 30 seconds during a handover, maybe for 4 hours because a backhoe hit a fiber line. When connectivity returns, what happens to the data?

If your answer is "it's gone," you have a buffer management problem. And fixing it properly requires understanding paged ring buffers — the unsung hero of reliable industrial telemetry.

Why Naive Buffering Fails

The simplest approach — queue MQTT messages in memory and retry on reconnect — has three fatal flaws:

  1. Memory exhaustion: A gateway reading 200 tags at 1-second intervals generates ~12,000 readings per minute. At ~100 bytes per JSON reading, that's 1.2 MB/minute. A 4-hour outage accumulates ~288 MB. Your 256 MB embedded gateway just died.

  2. No delivery confirmation: MQTT QoS 1 guarantees "at least once" delivery, but the Mosquitto client library's in-flight message queue is finite. If you publish 50,000 messages into a disconnected client, most will be silently dropped by the client library's internal buffer long before the broker sees them.

  3. Thundering herd on reconnect: When connectivity returns, dumping 288 MB of queued messages simultaneously will choke the cellular uplink (typically 1–5 Mbps), cause broker-side backpressure, and likely trigger another disconnect.

The Paged Ring Buffer Architecture

The solution is a fixed-size, page-based circular buffer that sits between the data collection layer and the MQTT client. Here's how it works:

Memory Layout

The buffer is allocated as a single contiguous block — typically 2 MB on an embedded gateway. This block is divided into equal-sized pages, where each page can hold one complete MQTT payload.

┌─────────────────────────────────────────────────┐
│ 2 MB Buffer Memory │
├────────┬────────┬────────┬────────┬────────┬────┤
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │ ...│
│ 4 KB │ 4 KB │ 4 KB │ 4 KB │ 4 KB │ │
└────────┴────────┴────────┴────────┴────────┴────┘

With a 4 KB page size and 2 MB total buffer, you get approximately 500 pages. Each page holds multiple MQTT messages packed sequentially.

Page States

Every page exists in exactly one of three states:

  • Free: Available for new data. Part of a singly-linked free list.
  • Work: Currently being filled with incoming data. Only one work page exists at a time.
  • Used: Full of data, waiting to be transmitted. Part of a singly-linked FIFO queue.
Free Pages → [P5] → [P6] → [P7] → null
Work Page → [P3] (currently filling)
Used Pages → [P0] → [P1] → [P2] → null
↑ sending waiting →

The Write Path

When a batch of PLC tag values arrives from the data collection layer:

  1. Check the work page: If there's no current work page, pop one from the free list. If the free list is empty, steal the oldest used page (overflow — we're losing old data to make room for new data, which is the correct trade-off for operational monitoring).

  2. Calculate fit: Each message is packed as: [4-byte message ID] [4-byte message size] [message payload]. Check if the current work page has enough remaining space for this overhead plus the payload.

  3. If it fits: Write the message ID (initially zero — will be filled by the MQTT client), the size, and the payload. Advance the write pointer.

  4. If it doesn't fit: Move the current work page to the tail of the used queue. Pop a new page from the free list (or steal from used queue). Write into the new page.

Page Internal Layout:
┌──────────┬──────────┬─────────────┬──────────┬──────────┬─────────────┐
│ msg_id_1 │ msg_sz_1 │ payload_1 │ msg_id_2 │ msg_sz_2 │ payload_2 │
│ (4 bytes) │ (4 bytes) │ (N bytes) │ (4 bytes) │ (4 bytes) │ (M bytes) │
└──────────┴──────────┴─────────────┴──────────┴──────────┴─────────────┘
↑ write_p (current position)

The Send Path

The MQTT send logic runs after every write operation and follows strict rules:

  1. Check prerequisites: Connection must be up (connected == 1) AND no packet currently in-flight (packet_sent == 0). If either fails, do nothing — the data is safely buffered.

  2. Select the send source: If there are used pages, send from the first one in the FIFO. If no used pages exist but the work page has data, promote the work page to used and send from it.

  3. Read the next message from the current page's read pointer: extract the size, get the data pointer, and call mosquitto_publish() with QoS 1.

  4. Mark packet as in-flight: Set packet_sent = 1. This is critical — only one message can be in-flight at a time. This prevents the thundering herd problem and ensures ordered delivery.

  5. Wait for acknowledgment: The MQTT client library calls the publish callback when the broker confirms receipt (PUBACK for QoS 1). Only then do we advance the read pointer and send the next message.

The Acknowledgment Path

When the Mosquitto library fires the on_publish callback with a packet ID:

  1. Verify the ID matches the in-flight message on the current used page
  2. Advance the read pointer past the delivered message (skip message ID + size + payload bytes)
  3. Check if page is fully delivered: If read_p >= write_p, move the page back to the free list
  4. Clear the in-flight flag: Set packet_sent = 0
  5. Immediately attempt to send the next message — this creates a natural flow control where messages are delivered as fast as the broker can acknowledge them
Delivery Flow:
publish()
[Used Page] ──────────────────→ [MQTT Broker]
↑ │
│ PUBACK │
└────────────────────────────────┘
advance read_p, try next

Thread Safety: The Mutex Dance

In a real gateway, data collection and MQTT delivery run on different threads. The PLC polling loop writes data every second, while the Mosquitto client library fires callbacks from its own network thread. Every buffer operation — add, send, acknowledge, connect, disconnect — must be wrapped in a mutex:

// Data collection thread:
mutex_lock(buffer)
add_data(payload)
try_send_next() // opportunistic send
mutex_unlock(buffer)

// MQTT callback thread:
mutex_lock(buffer)
mark_delivered(packet_id)
try_send_next() // chain next send
mutex_unlock(buffer)

The key insight is that try_send_next() is called from both threads — after every write (in case we're connected and idle) and after every acknowledgment (to chain the next message). This ensures maximum throughput without busy-waiting.

Handling Disconnects Gracefully

When the MQTT connection drops, two things happen:

  1. The disconnect callback fires: Set connected = 0 and packet_sent = 0. The in-flight message is NOT lost — it's still in the page at the current read pointer. When connectivity returns, it will be re-sent.

  2. Data keeps flowing in: The PLC polling loop doesn't stop. New data continues to fill pages. The used queue grows. If it fills all available pages, new pages will steal from the oldest used pages — but this only happens under extreme sustained outages.

When the connection re-establishes:

  1. The connect callback fires: Set connected = 1 and trigger try_send_next()
  2. Buffered data starts flowing: Messages are delivered in FIFO order, one at a time, with acknowledgment pacing

This means the broker receives data in chronological order, with timestamps embedded in each batch. Analytics systems downstream can seamlessly handle the gap — they see a burst of historical data followed by real-time data, all correctly timestamped.

The Cloud Watchdog: Detecting Silent Failures

There's a subtle failure mode: the MQTT connection appears healthy (no disconnect callback), but data isn't actually being delivered. This can happen with certain TLS middlebox issues, stale TCP connections that haven't timed out, or Azure IoT Hub token expirations.

The solution is a delivery watchdog:

  1. Track the timestamp of the last successful packet delivery
  2. On a periodic check (every 120 seconds), compare the current time against the last delivery timestamp
  3. If no data has been delivered in 120 seconds AND the connection claims to be up, force a reconnection:
    • Reset the MQTT configuration timestamp (triggers config reload)
    • Clear the watchdog timer
    • The main loop will detect the stale configuration and restart the MQTT client
if (now - last_delivery_time > 120s) AND (connected) {
log("No data delivered in 120s — forcing MQTT reconnect")
force_mqtt_restart()
}

This catches the "zombie connection" problem that plagues many IIoT deployments — the gateway thinks it's sending, but nothing is actually arriving at the cloud.

Binary vs. JSON: The Bandwidth Trade-off

The paged buffer doesn't care about the payload format — it stores raw bytes. But the choice between JSON and binary encoding has massive implications for buffer utilization:

JSON payload for one tag reading:

{"id":42,"values":[23.7],"ts":1709337600}

~45 bytes per reading.

Binary payload for the same reading:

Tag ID:    2 bytes (uint16)
Status: 1 byte
Value Cnt: 1 byte
Value Sz: 1 byte
Value: 4 bytes (float32)
─────────────────────
Total: 9 bytes per reading

That's a 5x reduction. With batching (multiple readings per batch header), the per-reading overhead drops further because the timestamp and device identity are shared across a group of values.

On a cellular connection billing per megabyte, this isn't academic — it's the difference between $15/month and $75/month per gateway. On satellite connections (Iridium, Starlink maritime), it can be $50 vs. $250.

Binary Batch Wire Format

A binary batch on the wire follows this structure:

[0xF7]                          — 1 byte, magic/version marker
[num_groups] — 4 bytes, big-endian uint32
For each group:
[timestamp] — 4 bytes, big-endian time_t
[device_type] — 2 bytes, big-endian uint16
[serial_number] — 4 bytes, big-endian uint32
[num_values] — 4 bytes, big-endian uint32
For each value:
[tag_id] — 2 bytes, big-endian uint16
[status] — 1 byte (0 = OK, else error code)
If status == 0:
[values_count] — 1 byte
[value_size] — 1 byte (1, 2, or 4)
[values...] — values_count × value_size bytes

A batch of 50 tag readings fits in ~600 bytes binary versus ~3,000 bytes JSON. Over a 4-hour outage with 200 tags at 60-second intervals, that's the difference between buffering ~4.8 MB (binary) versus ~24 MB (JSON) — within or far exceeding a typical gateway's buffer.

Sizing Your Buffer: The Math

For a given deployment, calculate your buffer needs:

Tags: 200
Read interval: 60 seconds
Binary payload per reading: ~9 bytes
Readings per minute: 200
Bytes per minute: 200 × 9 = 1,800 bytes
With batch overhead (~15 bytes per group): ~1,815 bytes/min

Buffer size: 2 MB = 2,097,152 bytes
Retention: 2,097,152 / 1,815 = ~1,155 minutes = ~19.2 hours

So a 2 MB buffer can hold approximately 19 hours of data for 200 tags at 60-second intervals using binary encoding. With JSON, that drops to ~3.8 hours. Size your buffer accordingly.

What machineCDN Does Differently

machineCDN's edge gateway implements this paged ring buffer architecture natively. Every gateway shipped includes:

  • Fixed 2 MB paged buffer with configurable page sizes matching the MQTT broker's maximum packet size
  • Automatic binary encoding for all telemetry — 5x bandwidth reduction over JSON
  • Single-message flow control with QoS 1 acknowledgment tracking — no thundering herd on reconnect
  • 120-second delivery watchdog that detects zombie connections and forces reconnect
  • Graceful overflow handling — when buffer fills, oldest data is recycled (not newest), preserving the most recent operational state

For plant engineers, this means deploying a gateway on a cellular connection and knowing that a connectivity outage — whether 30 seconds or 12 hours — won't result in lost data. The buffer holds, the watchdog monitors, and data flows in order when the link comes back.

Key Takeaways

  1. Never use unbounded queues for industrial telemetry buffering — use fixed-size paged buffers that degrade gracefully under memory pressure
  2. One message in-flight at a time prevents the thundering herd problem and ensures ordered delivery
  3. Always track delivery acknowledgments — don't just publish and forget; verify the broker received each packet before advancing
  4. Implement a delivery watchdog — silent MQTT failures are harder to detect than disconnects
  5. Use binary encoding — 5x bandwidth reduction means 5x longer buffer retention on the same memory
  6. Size for your worst outage — calculate how much buffer you need based on tag count, interval, and the longest connectivity gap you expect
  7. Thread safety is non-negotiable — data collection and MQTT delivery run concurrently; every buffer operation needs mutex protection

The paged ring buffer isn't exotic computer science — it's a practical engineering pattern that's been battle-tested in thousands of industrial deployments. The difference between a prototype IIoT system and a production one often comes down to exactly this kind of infrastructure.

Batched vs. Immediate Telemetry Delivery: When to Use Each in Industrial Monitoring [2026]

· 11 min read

Every industrial IoT edge gateway faces a fundamental architectural decision for every data point it collects: ship it now, or hold it and ship a batch later?

Get this wrong and you either drown your MQTT broker in tiny messages or you miss a critical alarm because it was sitting in a buffer when the compressor caught fire. This guide covers the engineering behind both approaches, the real-world trade-offs, and a framework for deciding which to use where.

Sparkplug B Specification Deep Dive: Birth Certificates, Death Certificates, and Why Your IIoT MQTT Deployment Needs It [2026]

· 14 min read
MachineCDN Team
Industrial IoT Experts

MQTT is the de facto transport layer for industrial IoT. Every edge gateway, every cloud platform, and every IIoT architecture diagram draws that same line: device → MQTT broker → cloud. But here's the uncomfortable truth that anyone who's deployed MQTT in a real factory knows: raw MQTT tells you nothing about the data inside those payloads.

MQTT is a transport protocol. It delivers bytes. It doesn't define what a "temperature reading" looks like, how to discover which devices are online, or what happens when a device reboots at 3 AM. That's where Sparkplug B comes in — and understanding it deeply is the difference between a demo and a production deployment.

Store-and-Forward Buffer Design for Reliable Industrial MQTT Telemetry [2026]

· 12 min read

Your edge gateway just collected 200 data points from six machines. The MQTT connection to the cloud dropped 47 seconds ago. What happens to that data?

In consumer IoT, the answer is usually "it gets dropped." In industrial IoT, that answer gets you fired. A single missed alarm delivery can mean a $50,000 chiller compressor failure. A gap in temperature logging can invalidate an entire production batch for FDA compliance.

The solution is a store-and-forward buffer — a memory structure that sits between your data collection layer and your MQTT transport, holding telemetry data during disconnections and draining it the moment connectivity returns. It sounds simple. The engineering details are anything but.

This article walks through the design of a production-grade store-and-forward buffer for resource-constrained edge gateways running on embedded Linux.

Store-and-forward buffer architecture for MQTT telemetry

Why MQTT QoS Isn't Enough

The first objection is always: "MQTT already has QoS 1 and QoS 2 — doesn't the broker handle retransmission?"

Technically yes, but only for messages that have already been handed to the MQTT client library. The problem is what happens before the publish call:

  1. The TCP connection is down. mosquitto_publish() returns MOSQ_ERR_NO_CONN. Your data is gone unless you stored it somewhere.
  2. The MQTT library's internal buffer is full. Most MQTT client libraries have a finite send queue. When it fills, new publishes get rejected.
  3. The gateway rebooted. Any data in memory is lost. Only data written to persistent storage survives.

QoS handles message delivery within an established session. Store-and-forward handles data persistence across disconnections, reconnections, and reboots.

The Page-Based Buffer Architecture

A production buffer uses a paged memory pool — a contiguous block of memory divided into fixed-size pages that cycle through three states:

┌─────────────────────────────────────────────────────┐
│ Buffer Memory Pool │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Page 0│ │Page 1│ │Page 2│ │Page 3│ │Page 4│ │
│ │ FREE │ │ USED │ │ USED │ │ WORK │ │ FREE │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
│ │
│ FREE = empty, available for writing │
│ WORK = currently being filled with incoming data │
│ USED = full, queued for delivery to MQTT broker │
└─────────────────────────────────────────────────────┘

Page States

  • FREE pages form a linked list of available pages. When the buffer needs a new work page, it pulls from the free list.
  • WORK page is the single page currently accepting incoming data. New telemetry batches get appended here. There is always at most one work page.
  • USED pages form an ordered queue of pages waiting to be delivered. The buffer sends data from the head of the used queue, one message at a time.

Page Structure

Each page contains multiple messages, packed sequentially:

┌─────────────────────────────────────────────┐
│ Page N │
│ │
│ ┌──────────┬──────────┬──────────────────┐ │
│ │ msg_id │ msg_size │ message_data │ │
│ │ (4 bytes)│ (4 bytes)│ (variable) │ │
│ ├──────────┼──────────┼──────────────────┤ │
│ │ msg_id │ msg_size │ message_data │ │
│ │ (4 bytes)│ (4 bytes)│ (variable) │ │
│ ├──────────┼──────────┼──────────────────┤ │
│ │ ... more messages ... │ │
│ └──────────────────────────────────────────┘ │
│ │
│ write_p ──→ next write position │
│ read_p ──→ next read position (delivery) │
│ │
└─────────────────────────────────────────────┘

The msg_id field is critical — it gets filled in by the MQTT library's publish() call, which returns a packet ID. When the broker acknowledges delivery (via the PUBACK callback in QoS 1), the buffer matches the acknowledged ID against the head of the delivery queue.

Memory Sizing

The minimum viable buffer needs at least three pages:

  • One page being filled (WORK)
  • One page being transmitted (USED, head of queue)
  • One page available for the next batch (FREE)

In practice, you want more headroom. The formula:

buffer_size = page_size × desired_holdover_time / batch_interval

Example:
- Page size: 32 KB
- Batch interval: 30 seconds
- Desired holdover: 10 minutes
- Pages needed: 32KB × (600s / 30s) = 20 pages = 640 KB

On a typical embedded Linux gateway with 256MB–512MB RAM, dedicating 1–4 MB to the telemetry buffer is reasonable.

The Write Path: Accepting Incoming Data

When the data collection layer finishes a polling cycle and has a batch of tag values ready to deliver, it calls into the buffer:

Step 1: Check the Work Page

If no work page exists, allocate one from the free list. If the free list is empty, steal the oldest used page — this is the overflow strategy (more on this below).

Step 2: Size Check

Before writing, verify that the message (plus its 8-byte header) fits in the remaining space on the work page:

remaining = page_size - (write_p - start_p)
needed = 4 (msg_id) + 4 (msg_size) + payload_size

if needed > remaining:
move work_page to used_pages queue
allocate a new work page
retry

Step 3: Write the Message

1. Write 4 zero bytes at write_p    (placeholder for msg_id)
2. Write message size as uint32 (4 bytes)
3. Write message payload (N bytes)
4. Advance write_p by 8 + N

The msg_id is initially zero because we don't know it yet — it gets assigned when the message is actually published to MQTT.

Step 4: Trigger Delivery

After every write, the buffer checks if it can send data. If the connection is up and no message is currently awaiting acknowledgment, it initiates delivery of the next queued message.

The Read Path: Delivering to MQTT

Delivery follows a strict one-message-at-a-time discipline. The buffer maintains a packet_sent flag:

if connected == false:  return
if packet_sent == true: return (waiting for PUBACK)

message = used_pages[0].read_p
result = mqtt_publish(message.data, message.size, &message.msg_id)

if result == success:
packet_sent = true
else:
packet_sent = false (retry on next opportunity)

Why One at a Time?

Sending multiple messages without waiting for acknowledgment is tempting — it would be faster. But it creates a delivery ordering problem. If messages 1, 2, and 3 are sent simultaneously and message 2's PUBACK arrives first, you don't know whether messages 1 and 3 were delivered. With one-at-a-time, the delivery order is guaranteed to match the insertion order.

For higher throughput, some implementations pipeline 2–3 messages and track a small window of in-flight packet IDs. But for industrial telemetry where data integrity matters more than latency, sequential delivery is the safer choice.

The Delivery Confirmation Callback

When the MQTT library's on_publish callback fires with a packet ID:

1. Lock the buffer mutex
2. Check that the packet_id matches used_pages[0].read_p.msg_id
3. Advance read_p past the delivered message
4. If read_p >= write_p:
- Page completely delivered
- Move page from used_pages to free_pages
- Reset the page's write_p and read_p
5. Set packet_sent = false
6. Attempt to send the next message
7. Unlock mutex

This is where the msg_id field in the page pays off — it's the correlation key between "we published this" and "the broker confirmed this."

Overflow Handling: When Memory Runs Out

On a constrained device, the buffer will eventually fill up during an extended outage. The question is: what do you sacrifice?

Strategy 1: Drop Newest (Ring Buffer)

When the free list is empty, reject new writes. The data collection layer simply loses the current batch. This preserves historical data but creates gaps at the end of the outage.

Strategy 2: Drop Oldest (FIFO Eviction)

When the free list is empty, steal the oldest used page — the one at the head of the delivery queue. This preserves the most recent data but creates gaps at the beginning of the outage.

Which to Choose?

For industrial monitoring, drop-oldest is almost always correct. The reasoning:

  • During a long outage, the most recent data is more actionable than data from 20 minutes ago.
  • When connectivity returns, operators want to see current machine state, not historical state from the beginning of the outage.
  • Historical data from the outage period can often be reconstructed from PLC internal logs after the fact.

A production implementation logs a warning when it evicts a page:

Buffer: Overflow warning! Extracted USED page (#7)

This warning should be forwarded to the platform's monitoring layer so operators know data was lost.

Thread Safety

The buffer is accessed from two threads:

  1. The polling thread — calls buffer_add_data() after each collection cycle
  2. The MQTT callback thread — calls buffer_process_data_delivered() when PUBACKs arrive

A mutex protects all buffer operations:

// Pseudocode
void buffer_add_data(buffer, data, size) {
lock(buffer->mutex)
write_data_to_work_page(buffer, data, size)
try_send_next_message(buffer)
unlock(buffer->mutex)
}

void buffer_on_puback(buffer, packet_id) {
lock(buffer->mutex)
advance_read_pointer(buffer, packet_id)
try_send_next_message(buffer)
unlock(buffer->mutex)
}

The key insight: try_send_next_message() is called from both code paths. After adding data, the buffer checks if it can immediately begin delivery. After confirming delivery, it checks if there's more data to send. This creates a self-draining pipeline that doesn't need a separate timer or polling loop.

Connection State Management

The buffer tracks connectivity through two callbacks:

On Connect

buffer->connected = true
try_send_next_message(buffer) // Start draining the queue

On Disconnect

buffer->connected = false
buffer->packet_sent = false // Reset in-flight tracking

The packet_sent = false on disconnect is critical. If a message was in flight when the connection dropped, we have no way of knowing whether the broker received it. Setting packet_sent = false means the message will be re-sent on reconnection. This may result in duplicate delivery — which is fine. Industrial telemetry systems should be idempotent anyway (a repeated temperature reading at timestamp T is the same as the original).

Batch Finalization: When to Flush

Data arrives at the buffer through a batch layer that groups multiple tag values before serialization. The batch finalizes (and writes to the buffer) on two conditions:

1. Size Limit

When the accumulated batch exceeds a configured maximum size (e.g., 32 KB for JSON, or when the binary payload reaches 90% of the maximum), the batch is serialized and written to the buffer immediately:

if current_batch_size > max_batch_size:
finalize_and_write_to_buffer(batch)
reset_batch()

2. Time Limit

When the time since the batch started collecting exceeds a configured timeout (e.g., 30 seconds), the batch is finalized regardless of size:

elapsed = now - batch_start_time
if elapsed > max_batch_time:
finalize_and_write_to_buffer(batch)
reset_batch()

The time-based trigger is checked at the end of each tag group within a polling cycle, not on a separate timer. This avoids adding another thread and ensures the batch is finalized at a natural boundary in the data stream.

Binary vs. JSON Serialization

Production edge systems typically support two serialization formats:

JSON Format

{
"groups": [
{
"ts": 1709341200,
"device_type": 1018,
"serial_number": 12345,
"values": [
{"id": 1, "values": [452]},
{"id": 2, "values": [38]},
{"id": 162, "error": -5}
]
}
]
}

JSON is human-readable and easy to debug but verbose. A batch of 25 tag values in JSON might be 800 bytes.

Binary Format

0xF7              Command byte
[4B] num_groups Number of timestamp groups
[4B] timestamp Unix timestamp
[2B] dev_type Device type ID
[4B] serial Device serial number
[4B] num_values Number of values in group
[2B] tag_id Tag identifier
[1B] status 0x00=OK, other=error
[1B] count Array size
[1B] elem_sz Element size (1, 2, or 4 bytes)
[N×S bytes] Packed values (MSB first)

The same 25 tag values in binary format might be 180 bytes — a 4.4× reduction. On cellular connections where bandwidth is metered per megabyte, this matters enormously.

The format choice is configured per device. Many deployments use binary for production and JSON for commissioning/debugging.

Monitoring the Buffer

A healthy buffer should have these characteristics:

  • Pages cycling regularly — pages move from FREE → WORK → USED → FREE in a steady rhythm
  • No overflow warnings — if you see "extracted USED page" in the logs, the buffer is undersized or the connection is too unreliable
  • Delivery timestamps advancing — track the timestamp of the last confirmed delivery. If it stops advancing while data is being collected, something is wrong with the MQTT connection

The edge daemon should publish buffer health as part of its periodic status message:

{
"buffer": {
"total_pages": 20,
"free_pages": 14,
"used_pages": 5,
"work_pages": 1,
"last_delivery_ts": 1709341200,
"overflow_count": 0
}
}

How machineCDN Implements Store-and-Forward

machineCDN's edge gateway implements the full page-based buffer architecture described in this article. The buffer sits between the batch serialization layer and the MQTT transport, providing:

  • Automatic page management — the gateway sizes the buffer based on available memory and configured batch parameters
  • Drop-oldest overflow — during extended outages, the most recent data is always preserved
  • Dual-format support — JSON for commissioning, binary for production deployments, configurable per device
  • Connection-aware delivery — the buffer begins draining immediately when the MQTT connection comes back up, with sequential delivery confirmation via QoS 1 PUBACKs

For multi-machine deployments on cellular gateways, the binary format combined with batch-and-forward typically reduces bandwidth consumption by 70–80% compared to per-tag JSON publishing — which translates directly to lower cellular data costs.

Key Takeaways

  1. MQTT QoS doesn't replace store-and-forward. QoS handles delivery within a session. Store-and-forward handles persistence across disconnections.

  2. Use a paged memory pool. Fixed-size pages with three states (FREE/WORK/USED) give you predictable memory usage and simple overflow handling.

  3. One message at a time for delivery integrity. Sequential delivery with PUBACK confirmation guarantees ordering and makes the system easy to reason about.

  4. Drop oldest on overflow. In industrial monitoring, recent data is more valuable than historical data from the beginning of an outage.

  5. Finalize batches on both size and time. Size limits prevent memory bloat; time limits prevent stale data sitting in an incomplete batch.

  6. Thread safety is non-negotiable. The polling thread and MQTT callback thread both touch the buffer. A mutex with minimal critical sections keeps things safe without impacting throughput.

The store-and-forward buffer is the unsung hero of reliable industrial telemetry. It's not glamorous, it doesn't show up in marketing slides, but it's the component that determines whether your IIoT platform loses data at 2 AM on a Saturday when the cell tower goes down — or quietly holds everything until the connection comes back and delivers it all without anyone ever knowing there was a problem.

Binary Payload Encoding for Industrial MQTT: Cutting Bandwidth by 10x on Constrained Networks [2026]

· 13 min read

Binary Payload Encoding

JSON is killing your cellular data budget.

When your edge gateway publishes a single temperature reading as {"tag_id": 42, "value": 23.45, "type": "float", "status": 0, "ts": 1709312400}, that's 72 bytes of text to convey 10 bytes of actual information: a 2-byte tag ID, a 4-byte float, a 1-byte status code, and a 4-byte timestamp (which is shared across all tags in the same poll cycle anyway).

At 200 tags polled every 5 seconds, JSON payloads consume roughly 100 KB/minute — over 4 GB/month. On a $15/month cellular plan with a 1 GB cap, you've blown your data budget by day 8.

Binary encoding solves this. By designing a compact wire format purpose-built for industrial telemetry, you can reduce per-tag overhead from ~70 bytes to ~7 bytes — a 10x reduction that makes cellular and satellite IIoT deployments economically viable.

This article covers the engineering of binary payload formats for industrial MQTT, from byte-level encoding decisions to the buffering and delivery systems that ensure data integrity.

Why JSON Falls Short for Industrial Telemetry

JSON became the default payload format for MQTT in the IIoT world because it's human-readable, self-describing, and every platform can parse it. These are real advantages during development and debugging. But they come at a cost that compounds brutally at scale.

The Overhead Tax

Let's dissect a typical JSON telemetry message:

{
"device_type": 1010,
"serial": 1106550353,
"ts": 1709312400,
"tags": [
{"id": 1, "status": 0, "type": "uint16", "values": [4200]},
{"id": 2, "status": 0, "type": "float", "values": [23.45]},
{"id": 3, "status": 0, "type": "bool", "values": [1]}
]
}

This payload is approximately 250 bytes. The actual data content:

  • Device type: 2 bytes
  • Serial number: 4 bytes
  • Timestamp: 4 bytes
  • 3 tag values: 2 + 4 + 1 = 7 bytes
  • 3 tag IDs: 6 bytes
  • 3 status codes: 3 bytes

Total useful data: 26 bytes. The other 224 bytes are structural overhead — curly braces, square brackets, quotation marks, colons, commas, key names, and redundant type strings.

That's an overhead ratio of 9.6x. For every byte of machine data, you're transmitting nearly 10 bytes of JSON syntax.

CPU Cost on Embedded Gateways

JSON serialization isn't free on embedded hardware. Constructing JSON objects, converting numbers to strings, escaping special characters, and computing string lengths all consume CPU cycles that could be spent polling more tags or running edge analytics.

On an ARM Cortex-A7 gateway (common in industrial routers), JSON serialization of a 200-tag batch takes 2–5ms. The equivalent binary encoding takes 200–500μs — an order of magnitude faster. When you're polling Modbus every second and need to leave CPU headroom for other tasks, this matters.

Designing a Binary Telemetry Format

A practical binary format for industrial MQTT must balance compactness with extensibility. Here's a proven structure used in production industrial gateways.

Message Structure

┌─────────────────────────────────────────┐
│ Header │
│ ├─ Timestamp (4 bytes, uint32) │
│ ├─ Device Type (2 bytes, uint16) │
│ └─ Serial Number (4 bytes, uint32) │
├─────────────────────────────────────────┤
│ Tag Group │
│ ├─ Tag Count (2 bytes, uint16) │
│ ├─ Tag Record 1 │
│ │ ├─ Tag ID (2 bytes, uint16) │
│ │ ├─ Status (1 byte, uint8) │
│ │ ├─ Type (1 byte, uint8) │
│ │ ├─ Value Count (1 byte, uint8) │
│ │ └─ Values (variable) │
│ ├─ Tag Record 2 │
│ │ └─ ... │
│ └─ Tag Record N │
└─────────────────────────────────────────┘

Type Encoding

Use a single byte to encode the value type, which also determines the byte width of each value:

Type CodeTypeBytes per Value
0x01bool1
0x02int324
0x03uint324
0x04float324
0x05int162
0x06uint162
0x07int81
0x08uint81

This type system covers every data type you'll encounter in Modbus and EtherNet/IP PLCs. The decoder uses the type code to determine exactly how many bytes to read for each value — no parsing ambiguity, no delimiter scanning.

Size Comparison

For the same 3-tag example above:

Binary encoding:

  • Header: 10 bytes (timestamp + device type + serial)
  • Tag count: 2 bytes
  • Tag 1 (uint16): 2 + 1 + 1 + 1 + 2 = 7 bytes
  • Tag 2 (float32): 2 + 1 + 1 + 1 + 4 = 9 bytes
  • Tag 3 (bool): 2 + 1 + 1 + 1 + 1 = 6 bytes

Total: 34 bytes vs. 250 bytes for JSON. That's a 7.3x reduction.

The savings compound as tag count increases. At 100 tags (a typical mid-size PLC), a JSON batch runs 6–8 KB; the binary equivalent is 700–900 bytes. At 200 tags, JSON hits 12–16 KB while binary stays under 2 KB.

Data Grouping: Batches and Groups

Individual tag values shouldn't be published as individual MQTT messages. The MQTT protocol itself adds overhead: a PUBLISH packet includes a fixed header (2 bytes minimum), topic string (20–50 bytes for a typical industrial topic), and packet identifier (2 bytes for QoS 1). Publishing 200 individual messages means 200× this overhead.

Timestamp-Grouped Batches

The most effective grouping strategy collects all tag values from a single poll cycle into one batch, sharing a single timestamp:

[Batch Start: timestamp=1709312400]
Tag 1: id=1, status=0, type=uint16, value=4200
Tag 2: id=2, status=0, type=float, value=23.45
Tag 3: id=3, status=0, type=bool, value=1
...
[Batch End]

The timestamp in the batch header applies to all contained tags. This eliminates per-tag timestamp overhead — a savings of 4 bytes per tag, or 800 bytes across 200 tags.

Batch Size Limits

MQTT brokers and clients have maximum message size limits. Azure IoT Hub limits messages to 256 KB. AWS IoT Core allows 128 KB. Most on-premise Mosquitto deployments default to 256 MB but should be configured lower for production use.

More importantly, your edge gateway's memory and processing constraints impose practical limits. A 4 KB batch size works well for most deployments:

  • Large enough to hold 200+ tags in binary format
  • Small enough to fit in constrained gateway memory
  • Fast enough to serialize without impacting the poll loop

When a batch exceeds the configured size, close it and start a new one. The cloud decoder handles multiple batches with the same timestamp gracefully.

Change-of-Value Filtering Before Batching

Apply change-of-value (COV) filtering before adding values to the batch, not after. If a tag's value hasn't changed since the last report and COV is enabled for that tag, skip it entirely. This reduces batch sizes further during steady-state operation — when 80% of tags are unchanged, your binary batch shrinks proportionally.

However, implement a periodic full-refresh: every hour (or configurable interval), reset all COV baselines and include every tag in the next batch. This ensures the cloud always has a complete snapshot, even if individual change events were lost during a brief disconnection.

The Page Buffer: Store-and-Forward in Fixed Memory

Binary encoding solves the bandwidth problem. But you still need to handle MQTT disconnections without losing data. The page-based ring buffer is the industrial standard for store-and-forward in embedded systems.

Architecture

Pre-allocate a contiguous memory region at startup and divide it into fixed-size pages:

┌────────────────────────────────────────────────┐
│ Buffer Memory (e.g., 512 KB) │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Page 0│ │Page 1│ │Page 2│ │Page 3│ │Page 4│ │
│ │ │ │ │ │ │ │ │ │ │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
└────────────────────────────────────────────────┘

Pages cycle through three states:

  1. Free — empty, available for writing
  2. Work — currently being written to by the Modbus polling thread
  3. Used — full, waiting for MQTT delivery

Page Layout

Each page contains multiple messages, packed sequentially:

┌─────────────────────────────────────┐
│ Page Header (struct, ~16 bytes) │
├─────────────────────────────────────┤
│ Message 1: │
│ ├─ Message ID (4 bytes) │
│ ├─ Message Size (4 bytes) │
│ └─ Message Body (variable) │
├─────────────────────────────────────┤
│ Message 2: │
│ ├─ Message ID (4 bytes) │
│ ├─ Message Size (4 bytes) │
│ └─ Message Body (variable) │
├─────────────────────────────────────┤
│ ... (more messages) │
├─────────────────────────────────────┤
│ Free space │
└─────────────────────────────────────┘

The 4-byte message ID field is filled by the MQTT library when the message is published (at QoS 1). The gateway uses this ID to match publish acknowledgments to specific messages.

Write Path

  1. Check if the current work page has enough space for the new message (message size + 8 bytes for ID and size fields).
  2. If yes: write the message, advance the write pointer.
  3. If no: move the work page to the "used" queue, grab a free page as the new work page, and write there.
  4. If no free pages exist: grab the oldest used page (overflow condition). Log a warning — you're losing the oldest buffered data, but preserving the newest.

This overflow strategy is deliberately biased toward fresh data. In industrial monitoring, a temperature reading from 5 minutes ago is far more valuable than one from 3 days ago that was buffered during an outage.

Delivery Path

  1. Take the first page from the "used" queue.
  2. Read the next undelivered message (tracked by a per-page read pointer).
  3. Publish via MQTT at QoS 1.
  4. Wait for PUBACK — don't advance the read pointer until the broker confirms receipt.
  5. On PUBACK: advance the read pointer. If the page is fully delivered, move it back to "free."
  6. On disconnect: stop sending, keep writing. The buffer absorbs the outage.

The wait-for-PUBACK step is critical. Without it, you're fire-and-forgetting into a potentially disconnected socket, and data silently disappears.

Thread Safety

The write path (Modbus polling thread) and delivery path (MQTT thread) operate concurrently on the same buffer. A mutex protects all page state transitions:

  • Moving pages between free/work/used queues
  • Checking available space
  • Advancing read/write pointers
  • Processing delivery acknowledgments

Keep the critical section as small as possible — lock, update pointers, unlock. Never hold the mutex during a Modbus read or MQTT publish; those operations can block for seconds.

Delivery Tracking and Watchdogs

In production, "the MQTT connection is up" doesn't mean data is flowing. The connection can be technically alive (TCP socket open, keepalives passing) while messages silently fail to publish or acknowledge.

Delivery Timestamp Tracking

Track the timestamp of the last successfully delivered message (confirmed by PUBACK). If this timestamp falls more than N minutes behind the current time, something is wrong:

  • The broker may be rejecting messages (payload too large, topic permission denied)
  • The network may be passing keepalives but dropping data packets
  • The MQTT library may be stuck in an internal error state

When the delivery watchdog fires, tear down the entire MQTT connection and reinitialize. It's a heavy-handed recovery, but it's reliable. In industrial systems, a clean restart beats a subtle degradation every time.

Status Telemetry

The gateway should periodically publish its own status message containing:

  • Daemon uptime — how long since last restart
  • System uptime — how long since last boot
  • Buffer state — pages free/used/work, current fill level
  • PLC link state — is the Modbus connection healthy
  • Firmware version — for remote fleet management
  • Token expiration — time remaining on the MQTT auth token

This status message can use JSON even if data messages use binary — it's infrequent (every 30–60 seconds) and readability matters more than compactness for diagnostics.

Bandwidth Math: Real-World Numbers

Let's calculate the actual savings for a typical deployment:

Scenario: 150 tags, polled every 5 seconds, 50% change rate with COV enabled, cellular connection.

JSON Format

  • Average tag JSON: ~60 bytes
  • Tags per poll (with 50% COV): 75
  • Batch overhead: ~50 bytes
  • Total per poll: 75 × 60 + 50 = 4,550 bytes
  • Per minute (12 polls): 54.6 KB
  • Per day: 78.6 MB
  • Per month: 2.36 GB

Binary Format

  • Average tag binary: ~7 bytes
  • Header per batch: 12 bytes
  • Total per poll: 75 × 7 + 12 = 537 bytes
  • Per minute (12 polls): 6.4 KB
  • Per day: 9.3 MB
  • Per month: 279 MB

Savings: 88% reduction — from 2.36 GB to 279 MB. On a $20/month cellular plan with 500 MB included, JSON doesn't fit. Binary does, with headroom.

Add MQTT overhead (topic strings, packet headers) and TLS overhead (~40 bytes per record), and real-world savings are slightly less dramatic but still consistently in the 8–10x range.

Decoding on the Cloud Side

Binary encoding shifts complexity from the edge to the cloud. The decoder must:

  1. Parse the header to extract timestamp, device type, and serial number.
  2. Iterate tag records using the type code to determine value byte widths.
  3. Reconstruct typed values — particularly IEEE 754 floats from their 4-byte binary representation.
  4. Handle partial messages — if a batch was truncated due to buffer overflow, the decoder must fail gracefully on the last incomplete record without losing the valid records before it.

Most cloud platforms (Azure IoT Hub, AWS IoT Core) support custom message decoders that transform binary payloads to JSON for downstream processing. Write the decoder once, and the rest of your analytics pipeline sees standard JSON.

How machineCDN Implements Binary Telemetry

machineCDN's edge daemon uses binary encoding by default for all data telemetry. The implementation includes:

  • Compact binary batching with shared timestamps per group, reducing per-tag overhead to 5–9 bytes depending on data type.
  • Page-based ring buffer with pre-allocated memory, zero runtime allocation, and deliberate overflow behavior that preserves fresh data.
  • Per-message PUBACK tracking with delivery watchdog and automatic connection recycling.
  • Parallel JSON status messages for gateway diagnostics, published on a separate topic at lower frequency.
  • Automatic format negotiation — the cloud ingestion layer detects binary vs. JSON based on the first byte of the payload and routes to the appropriate decoder.

The result: machineCDN gateways routinely operate on 500 MB/month cellular plans, monitoring 200+ tags at 5-second intervals, with full store-and-forward resilience during connectivity outages.

When to Use Binary vs. JSON

Binary encoding isn't always the right choice. Use this decision framework:

CriterionUse BinaryUse JSON
NetworkCellular, satellite, meteredEthernet, WiFi, unmetered
Tag count> 50< 20
Poll interval< 10 seconds> 60 seconds
Gateway CPUConstrained (< 500 MHz)Capable (> 1 GHz)
Debug needsProduction, stableDevelopment, changing
DownstreamCustom decoder availableGeneric tooling needed

For most production industrial deployments — where gateways connect hundreds of tags over cellular and reliability trumps developer convenience — binary encoding is the clear winner. Save JSON for your status messages and the debugging serial port.

Getting Started

If you're designing a binary telemetry format for your own gateway:

  1. Start with the type system. Define your type codes and byte widths. Match them to your PLC's native data types.
  2. Design the header. Include version, device identity, and a shared timestamp. Add a format version byte so you can evolve the format without breaking old decoders.
  3. Build the buffer first. Get store-and-forward working before optimizing the encoding. Data integrity matters more than data compactness.
  4. Write the decoder alongside the encoder. Test with known values. Verify float encoding especially — IEEE 754 byte ordering bugs are silent and devastating.
  5. Measure real bandwidth. Deploy both JSON and binary formats on the same gateway for a week and compare actual data consumption. The numbers will sell the approach to stakeholders who question the added complexity.

Binary encoding is a solved problem in industrial telemetry. The patterns are well-established, the savings are dramatic, and the complexity cost is paid once at design time and amortized across every byte your fleet ever transmits.

Binary Telemetry Encoding for IIoT: Why JSON Is Killing Your Bandwidth [2026]

· 11 min read

If you're sending PLC tag values as JSON from edge gateways to the cloud, you're wasting 80–90% of your bandwidth. On a cellular-connected factory floor with dozens of machines, that's the difference between a $50/month data plan and a $500/month one — and the difference between sub-second telemetry and multi-second lag.

This guide breaks down binary telemetry encoding: how to pack industrial data efficiently at the edge, preserve type fidelity across the wire, and design batch grouping strategies that survive unreliable networks.

Binary telemetry encoding for IIoT edge devices

Reliable Telemetry Delivery in IIoT: Page Buffers, Batch Finalization, and Disconnection Recovery [2026]

· 13 min read

Your edge gateway reads 200 tags from a PLC every second. The MQTT connection to your cloud broker drops for 3 minutes because someone bumped the cellular antenna. What happens to the 36,000 data points collected during the outage?

If your answer is "they're gone," you have a toy system, not an industrial one.

Reliable telemetry delivery is the hardest unsolved problem in most IIoT architectures. Everyone focuses on the protocol layer — Modbus reads, EtherNet/IP connections, OPC-UA subscriptions — but the real engineering is in what happens between reading a value and confirming it reached the cloud. This article breaks down the buffer architecture that makes zero-data-loss telemetry possible on resource-constrained edge hardware.

Reliable telemetry delivery buffer architecture

The Problem: Three Asynchronous Timelines

In any edge-to-cloud telemetry system, you're managing three independent timelines:

  1. PLC read cycle — Tags are read at fixed intervals (1s, 60s, etc.). This never stops. The PLC doesn't care if your cloud connection is down.

  2. Batch collection — Raw tag values are grouped into batches by timestamp and device. Batches accumulate until they hit a size limit or a timeout.

  3. MQTT delivery — Batches are published to the broker. The broker acknowledges receipt. At QoS 1, the MQTT library handles retransmission, but only if you give it data in the right form.

These three timelines run independently. The PLC read loop runs on a tight 1-second cycle. Batch finalization might happen every 30–60 seconds. MQTT delivery depends on network availability. If any one of these stalls, the others must keep running without data loss.

This is fundamentally a producer-consumer problem with a twist: the consumer (MQTT) can disappear for minutes at a time, and the producer (PLC reads) cannot slow down.

The Batch Layer: Grouping Values for Efficient Transport

Raw tag values are tiny — a temperature reading is 4 bytes, a boolean is 1 byte. Sending each value as an individual MQTT message would be absurdly wasteful. Instead, values are collected into batches — structured payloads that contain multiple timestamped readings from one or more devices.

Batch Structure

A batch is organized as a series of groups, where each group represents one polling cycle (one timestamp, one device):

Batch
├── Group 0: { timestamp: 1709284800, device_type: 5000, serial: 12345 }
│ ├── Value: { id: 2, values: [72.4] } // Delivery Temp
│ ├── Value: { id: 3, values: [68.1] } // Mold Temp
│ └── Value: { id: 5, values: [12.6] } // Flow Value
├── Group 1: { timestamp: 1709284860, device_type: 5000, serial: 12345 }
│ ├── Value: { id: 2, values: [72.8] }
│ ├── Value: { id: 3, values: [68.3] }
│ └── Value: { id: 5, values: [12.4] }
└── ...

Dual-Format Encoding: JSON vs Binary

Production edge daemons typically support two encoding formats for batches, and the choice has massive implications for bandwidth:

JSON format:

{
"groups": [
{
"ts": 1709284800,
"device_type": 5000,
"serial_number": 12345,
"values": [
{"id": 2, "values": [72.4]},
{"id": 3, "values": [68.1]}
]
}
]
}

Binary format (same data):

Header:  F7                           (1 byte - magic)
Groups: 00 00 00 01 (4 bytes - group count)
Group 0: 65 E5 A0 00 (4 bytes - timestamp)
13 88 (2 bytes - device type: 5000)
00 00 30 39 (4 bytes - serial number)
00 00 00 02 (4 bytes - value count)
Value 0: 00 02 (2 bytes - tag id)
00 (1 byte - status: OK)
01 (1 byte - values count)
04 (1 byte - element size: 4 bytes)
42 90 CC CD (4 bytes - float 72.4)
Value 1: 00 03
00
01
04
42 88 33 33 (4 bytes - float 68.1)

The JSON version of this payload: ~120 bytes. The binary version: ~38 bytes. That's a 3.2x reduction — and on a metered cellular connection at $0.01/MB, that savings compounds quickly when you're transmitting every 30 seconds 24/7.

The binary format uses a simple TLV-like structure: magic byte, group count (big-endian uint32), then for each group: timestamp (uint32), device type (uint16), serial number (uint32), value count (uint32), then for each value: tag ID (uint16), status byte, value count, element size, and raw value bytes. No field names, no delimiters, no escaping — just packed binary data.

Batch Finalization Triggers

A batch should be finalized (sealed and queued for delivery) when either condition is met:

  1. Size limit exceeded — When the accumulated batch size exceeds a configured maximum (e.g., 500KB for JSON, or when the binary buffer is 90%+ full). The 90% threshold for binary avoids the edge case where the next value would overflow the buffer.

  2. Collection timeout expired — When elapsed time since the batch started exceeds a configured maximum (e.g., 60 seconds). This ensures data flows even during quiet periods with few value changes.

if (elapsed_seconds > max_collection_time) → finalize
if (batch_size > max_batch_size) → finalize

Both checks happen after every group is closed (after every polling cycle). This means finalization granularity is tied to your polling interval — if you poll every 1 second and your batch timeout is 60 seconds, each batch will contain roughly 60 groups.

The "Do Not Batch" Exception

Some values are too important to wait for batch finalization. Equipment alarms, pump state changes, emergency stops — these need to reach the cloud immediately. These tags are flagged as "do not batch" in the configuration.

When a do-not-batch tag changes value, it bypasses the normal batch pipeline entirely. A mini-batch is created on the spot — containing just that single value — and pushed directly to the outgoing buffer. This ensures sub-second cloud visibility for critical state changes, while bulk telemetry still benefits from batch efficiency.

Tag: "Pump Status"     interval: 1s    do_not_batch: true
Tag: "Heater Status" interval: 1s do_not_batch: true
Tag: "Delivery Temp" interval: 60s do_not_batch: false ← normal batching

The Buffer Layer: Surviving Disconnections

This is where most IIoT implementations fail. The batch layer produces data. The MQTT layer consumes it. But what sits between them? If it's just an in-memory queue, you'll lose everything on disconnect.

Page-Based Ring Buffer Architecture

The production-grade answer is a page-based ring buffer — a fixed-size memory region divided into equal-sized pages that cycle through three states:

States:
FREE → Available for writing
WORK → Currently being filled with batch data
USED → Filled, waiting for MQTT delivery

Lifecycle:
FREE → WORK (when first data is added)
WORK → USED (when page is full or batch is finalized)
USED → transmit → delivery ACK → FREE (recycled)

Here's how it works:

Memory layout: At startup, a contiguous block of memory is allocated (e.g., 2MB). This block is divided into pages of a configured size (matching the MQTT max packet size, typically matching the batch size). Each page has a small header tracking its state and a data area.

┌──────────────────────────────────────────────┐
│ [Page 0: USED] [Page 1: USED] [Page 2: WORK]│
│ [Page 3: FREE] [Page 4: FREE] [Page 5: FREE]│
│ [Page 6: FREE] ... [Page N: FREE] │
└──────────────────────────────────────────────┘

Writing data: When a batch is finalized, its serialized bytes are written to the current WORK page. Each message gets a small header: a 4-byte message ID slot (filled later by the MQTT library) and a 4-byte size field. If the current page can't fit the next message, it transitions to USED and a fresh FREE page becomes the new WORK page.

Overflow handling: When all FREE pages are exhausted, the buffer reclaims the oldest USED page — the one that's been waiting for delivery the longest. This means you lose old data rather than new data, which is the right trade-off: the most recent readings are the most valuable. An overflow warning is logged so operators know the buffer is under pressure.

Delivery: When the MQTT connection is active, the buffer walks through USED pages and publishes their contents. Each publish gets a packet ID from the MQTT library. When the broker ACKs the packet (via the PUBACK callback for QoS 1), the corresponding page is recycled to FREE.

Disconnection recovery: When the MQTT connection drops:

  1. The disconnect callback fires
  2. The buffer marks itself as disconnected
  3. Data continues accumulating in pages (WORK → USED)
  4. When reconnected, the buffer immediately starts draining USED pages

No data is lost unless the buffer physically overflows. With 2MB of buffer and 500KB page size, you get 4 pages of headroom — enough to survive several minutes of disconnection at typical telemetry rates.

Thread Safety

The PLC read loop and the MQTT event loop run on different threads. The buffer must be thread-safe. Every buffer operation acquires a mutex:

  • buffer_add_data() — called from the PLC read thread after batch finalization
  • buffer_process_data_delivered() — called from the MQTT callback thread on PUBACK
  • buffer_process_connect() / buffer_process_disconnect() — called from MQTT lifecycle callbacks

Without proper locking, you'll see corrupted pages, double-free crashes, and mysterious data loss under load. This is non-negotiable.

Sizing the Buffer

Buffer sizing depends on three variables:

  1. Data rate: How many bytes per second does your polling loop produce?
  2. Expected outage duration: How long do you need to survive without MQTT?
  3. Available memory: Edge devices (especially industrial routers) have limited RAM

Example calculation:

  • 200 tags, average 6 bytes each (including binary overhead) = 1,200 bytes/group
  • Polling every 1 second = 1,200 bytes/second = 72KB/minute
  • Target: survive 30-minute outage = 2.16MB buffer
  • With 500KB pages = 5 pages minimum (round up for safety)

In practice, 2–4MB covers most scenarios. On a 32MB industrial router, that's well within budget.

The MQTT Layer: QoS, Reconnection, and Watchdogs

QoS 1: At-Least-Once Delivery

For industrial telemetry, QoS 1 is the right choice:

  • QoS 0 (fire and forget): No delivery guarantee. Unacceptable for production data.
  • QoS 1 (at least once): Broker ACKs every message. Duplicates possible but data loss prevented. Good trade-off.
  • QoS 2 (exactly once): Eliminates duplicates but doubles the handshake overhead. Rarely worth it for telemetry.

The page buffer's recycling logic depends on QoS 1: pages are only freed when the PUBACK arrives. If the ACK never comes (connection drops mid-transmission), the page stays in USED state and will be retransmitted after reconnection.

Connection Watchdog

MQTT connections can enter a zombie state — the TCP socket is open, the MQTT loop is running, but no data is actually flowing. This happens when network routing changes, firewalls silently drop the connection, or the broker becomes unresponsive.

The fix: a watchdog timer that monitors delivery acknowledgments. If no PUBACK has been received within a timeout window (e.g., 120 seconds) and data has been queued for transmission, force a reconnect:

if (now - last_delivered_packet_time > 120s) {
if (has_pending_data) {
// Force MQTT reconnection
reset_mqtt_client();
}
}

This catches the edge case where the MQTT library thinks it's connected but the network is actually dead. Without this watchdog, your edge daemon could silently accumulate hours of undelivered data in the buffer, eventually overflowing and losing it all.

Asynchronous Connection

MQTT connection establishment (DNS resolution, TLS handshake, CONNACK) can take several seconds, especially over cellular links. This must not block the PLC read loop. The connection should happen on a separate thread:

  1. Main thread detects connection is needed
  2. Connection thread starts connect_async()
  3. Main thread continues reading PLCs
  4. On successful connect, the callback fires and buffer delivery begins

If the connection thread is still working when a new connection attempt is needed, skip it — don't queue multiple connection attempts or you'll thrash the network stack.

TLS for Production

Any MQTT connection leaving your plant network must use TLS. Period. Industrial telemetry data — temperatures, pressures, equipment states, alarm conditions — is operationally sensitive. On the wire without encryption, anyone on the network path can see (and potentially modify) your readings.

For cloud brokers like Azure IoT Hub, TLS is mandatory. The edge daemon should:

  • Load the CA certificate from a PEM file
  • Use MQTT v3.1.1 protocol (widely supported, well-tested)
  • Monitor the SAS token expiration timestamp and alert before it expires
  • Automatically reinitialize the MQTT client when the certificate or connection string changes (file modification detected via stat())

Daemon Status Reporting

A well-designed edge daemon reports its own health back through the same MQTT channel it uses for telemetry. A periodic status message should include:

  • System uptime and daemon uptime — detect restarts
  • PLC link state — is the PLC connection healthy?
  • Buffer state — how full is the outgoing buffer?
  • MQTT state — connected/disconnected, last ACK time
  • SAS token expiration — days until credentials expire
  • Software version — for remote fleet management

An extended status format can include per-tag state: last read time, last delivery time, current value, and error count. This is invaluable for remote troubleshooting — you can see from the cloud exactly which tags are stale and why.

Value Comparison and Change Detection

Not all values need to be sent every polling cycle. A temperature that's been 72.4°F for the last hour doesn't need to be transmitted 3,600 times. Change detection — comparing the current value to the last sent value — can dramatically reduce bandwidth.

The implementation: each tag stores its last transmitted value. After reading, compare:

if (tag.compare_enabled && tag.has_been_read_once) {
if (current_value == tag.last_value) {
skip_this_value(); // Don't add to batch
}
}

Important caveats:

  • Not all tags should use comparison. Continuous process variables (temperatures, flows) should always send, even if unchanged — the recipient needs the full time series to calculate trends and detect flatlines (a stuck sensor reads the same value forever, which is itself a fault condition).
  • Discrete state tags (booleans, enums) are ideal for comparison — they change rarely and each change is significant.
  • Floating-point comparison should use an epsilon threshold, not exact equality, to avoid sending noise from ADC jitter.

Putting It All Together: The Main Loop

The complete edge daemon main loop ties all these layers together:

1. Parse configuration (device addresses, tag lists, MQTT credentials)
2. Allocate memory (PLC config pool + output buffer)
3. Format output buffer into pages
4. Start MQTT connection thread
5. Detect PLC device (probe address, determine type/protocol)
6. Load device-specific tag configuration

MAIN LOOP (runs every 1 second):
a. Check for config file changes → restart if changed
b. Read PLC tags (coalesced Modbus/EtherNet/IP)
c. Add values to batch (with comparison filtering)
d. Check batch finalization triggers (size/timeout)
e. Process incoming commands (config updates, force reads)
f. Check MQTT connection watchdog
g. Sleep 1 second

Every component — polling, batching, buffering, delivery — operates within this single loop iteration, keeping the system deterministic and debuggable.

How machineCDN Implements This

The machineCDN edge runtime implements this full stack natively on resource-constrained industrial routers. The page-based ring buffer runs in pre-allocated memory (no dynamic allocation after startup), the MQTT layer handles Azure IoT Hub and local broker configurations interchangeably, and the batch layer supports both JSON and binary encoding selectable per-device.

On a Teltonika RUT9xx router with 256MB RAM, the daemon typically uses under 4MB total — including 2MB of buffer space that can store 20+ minutes of telemetry during a connectivity outage. Tags are automatically sorted, coalesced, and dispatched with zero configuration beyond listing the tag names and addresses.

The result: edge gateways that have been running continuously for years in production environments, surviving cellular dropouts, network reconfigurations, and even firmware updates without losing a single data point.

Conclusion

Reliable telemetry delivery isn't about the protocol — it's about the pipeline. Modbus reads are the easy part. The hard engineering is in the layers between: batching values efficiently, buffering them through disconnections, and confirming delivery before recycling memory.

The key design principles:

  1. Never block the read loop — PLC polling is sacred
  2. Buffer with finite, pre-allocated memory — dynamic allocation on embedded systems is asking for trouble
  3. Reclaim oldest data first — in overflow, recent values matter more
  4. Acknowledge before recycling — a page stays USED until the broker confirms receipt
  5. Watch for zombie connections — a connected socket doesn't mean data is flowing

Get these right, and your edge infrastructure becomes invisible — which is exactly what production IIoT should be.