Skip to main content

OPC-UA Information Modeling and Subscriptions: A Deep Dive for IIoT Engineers [2026]

· 12 min read

If you've spent time wiring Modbus registers to cloud platforms, you know the pain: flat address spaces, no built-in semantics, and endless spreadsheets mapping register 40004 to "Mold Temperature Zone 2." OPC-UA was designed to solve exactly this problem — but its information modeling layer is far richer (and more complex) than most engineers realize when they first encounter it.

This guide goes deep on how OPC-UA structures industrial data, how subscriptions efficiently deliver changes to clients, and how security policies protect the entire stack. Whether you're evaluating OPC-UA for a greenfield deployment or bridging it into an existing Modbus/EtherNet-IP environment, this is the practical knowledge you need.

Paged Ring Buffers for Industrial MQTT: How to Never Lose a Data Point [2026]

· 10 min read

Here's the scenario every IIoT engineer dreads: your edge gateway is collecting temperature, pressure, and vibration data from 200 tags across 15 PLCs. The cellular modem on the factory roof drops its connection — maybe for 30 seconds during a handover, maybe for 4 hours because a backhoe hit a fiber line. When connectivity returns, what happens to the data?

If your answer is "it's gone," you have a buffer management problem. And fixing it properly requires understanding paged ring buffers — the unsung hero of reliable industrial telemetry.

Why Naive Buffering Fails

The simplest approach — queue MQTT messages in memory and retry on reconnect — has three fatal flaws:

  1. Memory exhaustion: A gateway reading 200 tags at 1-second intervals generates ~12,000 readings per minute. At ~100 bytes per JSON reading, that's 1.2 MB/minute. A 4-hour outage accumulates ~288 MB. Your 256 MB embedded gateway just died.

  2. No delivery confirmation: MQTT QoS 1 guarantees "at least once" delivery, but the Mosquitto client library's in-flight message queue is finite. If you publish 50,000 messages into a disconnected client, most will be silently dropped by the client library's internal buffer long before the broker sees them.

  3. Thundering herd on reconnect: When connectivity returns, dumping 288 MB of queued messages simultaneously will choke the cellular uplink (typically 1–5 Mbps), cause broker-side backpressure, and likely trigger another disconnect.

The Paged Ring Buffer Architecture

The solution is a fixed-size, page-based circular buffer that sits between the data collection layer and the MQTT client. Here's how it works:

Memory Layout

The buffer is allocated as a single contiguous block — typically 2 MB on an embedded gateway. This block is divided into equal-sized pages, where each page can hold one complete MQTT payload.

┌─────────────────────────────────────────────────┐
│ 2 MB Buffer Memory │
├────────┬────────┬────────┬────────┬────────┬────┤
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │ ...│
│ 4 KB │ 4 KB │ 4 KB │ 4 KB │ 4 KB │ │
└────────┴────────┴────────┴────────┴────────┴────┘

With a 4 KB page size and 2 MB total buffer, you get approximately 500 pages. Each page holds multiple MQTT messages packed sequentially.

Page States

Every page exists in exactly one of three states:

  • Free: Available for new data. Part of a singly-linked free list.
  • Work: Currently being filled with incoming data. Only one work page exists at a time.
  • Used: Full of data, waiting to be transmitted. Part of a singly-linked FIFO queue.
Free Pages → [P5] → [P6] → [P7] → null
Work Page → [P3] (currently filling)
Used Pages → [P0] → [P1] → [P2] → null
↑ sending waiting →

The Write Path

When a batch of PLC tag values arrives from the data collection layer:

  1. Check the work page: If there's no current work page, pop one from the free list. If the free list is empty, steal the oldest used page (overflow — we're losing old data to make room for new data, which is the correct trade-off for operational monitoring).

  2. Calculate fit: Each message is packed as: [4-byte message ID] [4-byte message size] [message payload]. Check if the current work page has enough remaining space for this overhead plus the payload.

  3. If it fits: Write the message ID (initially zero — will be filled by the MQTT client), the size, and the payload. Advance the write pointer.

  4. If it doesn't fit: Move the current work page to the tail of the used queue. Pop a new page from the free list (or steal from used queue). Write into the new page.

Page Internal Layout:
┌──────────┬──────────┬─────────────┬──────────┬──────────┬─────────────┐
│ msg_id_1 │ msg_sz_1 │ payload_1 │ msg_id_2 │ msg_sz_2 │ payload_2 │
│ (4 bytes) │ (4 bytes) │ (N bytes) │ (4 bytes) │ (4 bytes) │ (M bytes) │
└──────────┴──────────┴─────────────┴──────────┴──────────┴─────────────┘
↑ write_p (current position)

The Send Path

The MQTT send logic runs after every write operation and follows strict rules:

  1. Check prerequisites: Connection must be up (connected == 1) AND no packet currently in-flight (packet_sent == 0). If either fails, do nothing — the data is safely buffered.

  2. Select the send source: If there are used pages, send from the first one in the FIFO. If no used pages exist but the work page has data, promote the work page to used and send from it.

  3. Read the next message from the current page's read pointer: extract the size, get the data pointer, and call mosquitto_publish() with QoS 1.

  4. Mark packet as in-flight: Set packet_sent = 1. This is critical — only one message can be in-flight at a time. This prevents the thundering herd problem and ensures ordered delivery.

  5. Wait for acknowledgment: The MQTT client library calls the publish callback when the broker confirms receipt (PUBACK for QoS 1). Only then do we advance the read pointer and send the next message.

The Acknowledgment Path

When the Mosquitto library fires the on_publish callback with a packet ID:

  1. Verify the ID matches the in-flight message on the current used page
  2. Advance the read pointer past the delivered message (skip message ID + size + payload bytes)
  3. Check if page is fully delivered: If read_p >= write_p, move the page back to the free list
  4. Clear the in-flight flag: Set packet_sent = 0
  5. Immediately attempt to send the next message — this creates a natural flow control where messages are delivered as fast as the broker can acknowledge them
Delivery Flow:
publish()
[Used Page] ──────────────────→ [MQTT Broker]
↑ │
│ PUBACK │
└────────────────────────────────┘
advance read_p, try next

Thread Safety: The Mutex Dance

In a real gateway, data collection and MQTT delivery run on different threads. The PLC polling loop writes data every second, while the Mosquitto client library fires callbacks from its own network thread. Every buffer operation — add, send, acknowledge, connect, disconnect — must be wrapped in a mutex:

// Data collection thread:
mutex_lock(buffer)
add_data(payload)
try_send_next() // opportunistic send
mutex_unlock(buffer)

// MQTT callback thread:
mutex_lock(buffer)
mark_delivered(packet_id)
try_send_next() // chain next send
mutex_unlock(buffer)

The key insight is that try_send_next() is called from both threads — after every write (in case we're connected and idle) and after every acknowledgment (to chain the next message). This ensures maximum throughput without busy-waiting.

Handling Disconnects Gracefully

When the MQTT connection drops, two things happen:

  1. The disconnect callback fires: Set connected = 0 and packet_sent = 0. The in-flight message is NOT lost — it's still in the page at the current read pointer. When connectivity returns, it will be re-sent.

  2. Data keeps flowing in: The PLC polling loop doesn't stop. New data continues to fill pages. The used queue grows. If it fills all available pages, new pages will steal from the oldest used pages — but this only happens under extreme sustained outages.

When the connection re-establishes:

  1. The connect callback fires: Set connected = 1 and trigger try_send_next()
  2. Buffered data starts flowing: Messages are delivered in FIFO order, one at a time, with acknowledgment pacing

This means the broker receives data in chronological order, with timestamps embedded in each batch. Analytics systems downstream can seamlessly handle the gap — they see a burst of historical data followed by real-time data, all correctly timestamped.

The Cloud Watchdog: Detecting Silent Failures

There's a subtle failure mode: the MQTT connection appears healthy (no disconnect callback), but data isn't actually being delivered. This can happen with certain TLS middlebox issues, stale TCP connections that haven't timed out, or Azure IoT Hub token expirations.

The solution is a delivery watchdog:

  1. Track the timestamp of the last successful packet delivery
  2. On a periodic check (every 120 seconds), compare the current time against the last delivery timestamp
  3. If no data has been delivered in 120 seconds AND the connection claims to be up, force a reconnection:
    • Reset the MQTT configuration timestamp (triggers config reload)
    • Clear the watchdog timer
    • The main loop will detect the stale configuration and restart the MQTT client
if (now - last_delivery_time > 120s) AND (connected) {
log("No data delivered in 120s — forcing MQTT reconnect")
force_mqtt_restart()
}

This catches the "zombie connection" problem that plagues many IIoT deployments — the gateway thinks it's sending, but nothing is actually arriving at the cloud.

Binary vs. JSON: The Bandwidth Trade-off

The paged buffer doesn't care about the payload format — it stores raw bytes. But the choice between JSON and binary encoding has massive implications for buffer utilization:

JSON payload for one tag reading:

{"id":42,"values":[23.7],"ts":1709337600}

~45 bytes per reading.

Binary payload for the same reading:

Tag ID:    2 bytes (uint16)
Status: 1 byte
Value Cnt: 1 byte
Value Sz: 1 byte
Value: 4 bytes (float32)
─────────────────────
Total: 9 bytes per reading

That's a 5x reduction. With batching (multiple readings per batch header), the per-reading overhead drops further because the timestamp and device identity are shared across a group of values.

On a cellular connection billing per megabyte, this isn't academic — it's the difference between $15/month and $75/month per gateway. On satellite connections (Iridium, Starlink maritime), it can be $50 vs. $250.

Binary Batch Wire Format

A binary batch on the wire follows this structure:

[0xF7]                          — 1 byte, magic/version marker
[num_groups] — 4 bytes, big-endian uint32
For each group:
[timestamp] — 4 bytes, big-endian time_t
[device_type] — 2 bytes, big-endian uint16
[serial_number] — 4 bytes, big-endian uint32
[num_values] — 4 bytes, big-endian uint32
For each value:
[tag_id] — 2 bytes, big-endian uint16
[status] — 1 byte (0 = OK, else error code)
If status == 0:
[values_count] — 1 byte
[value_size] — 1 byte (1, 2, or 4)
[values...] — values_count × value_size bytes

A batch of 50 tag readings fits in ~600 bytes binary versus ~3,000 bytes JSON. Over a 4-hour outage with 200 tags at 60-second intervals, that's the difference between buffering ~4.8 MB (binary) versus ~24 MB (JSON) — within or far exceeding a typical gateway's buffer.

Sizing Your Buffer: The Math

For a given deployment, calculate your buffer needs:

Tags: 200
Read interval: 60 seconds
Binary payload per reading: ~9 bytes
Readings per minute: 200
Bytes per minute: 200 × 9 = 1,800 bytes
With batch overhead (~15 bytes per group): ~1,815 bytes/min

Buffer size: 2 MB = 2,097,152 bytes
Retention: 2,097,152 / 1,815 = ~1,155 minutes = ~19.2 hours

So a 2 MB buffer can hold approximately 19 hours of data for 200 tags at 60-second intervals using binary encoding. With JSON, that drops to ~3.8 hours. Size your buffer accordingly.

What machineCDN Does Differently

machineCDN's edge gateway implements this paged ring buffer architecture natively. Every gateway shipped includes:

  • Fixed 2 MB paged buffer with configurable page sizes matching the MQTT broker's maximum packet size
  • Automatic binary encoding for all telemetry — 5x bandwidth reduction over JSON
  • Single-message flow control with QoS 1 acknowledgment tracking — no thundering herd on reconnect
  • 120-second delivery watchdog that detects zombie connections and forces reconnect
  • Graceful overflow handling — when buffer fills, oldest data is recycled (not newest), preserving the most recent operational state

For plant engineers, this means deploying a gateway on a cellular connection and knowing that a connectivity outage — whether 30 seconds or 12 hours — won't result in lost data. The buffer holds, the watchdog monitors, and data flows in order when the link comes back.

Key Takeaways

  1. Never use unbounded queues for industrial telemetry buffering — use fixed-size paged buffers that degrade gracefully under memory pressure
  2. One message in-flight at a time prevents the thundering herd problem and ensures ordered delivery
  3. Always track delivery acknowledgments — don't just publish and forget; verify the broker received each packet before advancing
  4. Implement a delivery watchdog — silent MQTT failures are harder to detect than disconnects
  5. Use binary encoding — 5x bandwidth reduction means 5x longer buffer retention on the same memory
  6. Size for your worst outage — calculate how much buffer you need based on tag count, interval, and the longest connectivity gap you expect
  7. Thread safety is non-negotiable — data collection and MQTT delivery run concurrently; every buffer operation needs mutex protection

The paged ring buffer isn't exotic computer science — it's a practical engineering pattern that's been battle-tested in thousands of industrial deployments. The difference between a prototype IIoT system and a production one often comes down to exactly this kind of infrastructure.

Planned Production Time vs Actual: How IIoT Closes the Capacity Gap in Manufacturing

· 10 min read
MachineCDN Team
Industrial IoT Experts

Every production manager has been asked the same question by their VP of Operations: "How much more capacity do we have?" And every production manager has given the same answer with varying degrees of confidence: "We think we have about 15-20% more capacity, but it depends."

It depends on downtime. It depends on changeovers. It depends on which products are running. It depends on whether the Tuesday night shift actually gets 7.5 hours of production out of their 8-hour shift or whether they lose 90 minutes to startup, cleanup, and that recurring alarm on Press 4.

The gap between planned production time and actual productive time is the single largest source of hidden capacity in manufacturing. According to a study by the Aberdeen Group, the average manufacturer operates at 65-72% capacity utilization — meaning 28-35% of available production time is consumed by downtime, changeovers, slow cycles, and other losses that are rarely measured accurately.

IIoT platforms close this gap by measuring exactly what happens during every minute of planned production time. Not what is supposed to happen. Not what operators report happened. What actually happened, based on real-time machine data.

Prescriptive Maintenance for Manufacturing: Beyond Prediction — What to Do When Your AI Tells You Something's Wrong

· 9 min read
MachineCDN Team
Industrial IoT Experts

Predictive maintenance tells you that something is going to fail. Prescriptive maintenance tells you what to do about it. That distinction sounds subtle, but in practice it's the difference between a maintenance team that gets alerts they don't know how to act on, and one that receives specific, actionable guidance that prevents failures with minimal disruption.

Securing Industrial MQTT and OT Networks: TLS, Certificates, and Zero-Trust for the Factory Floor [2026]

· 13 min read

The edge gateway sitting on your factory floor is talking to the cloud. It's reading temperature, pressure, and flow data from PLCs over Modbus, packaging it into MQTT messages, and publishing to a broker that might be Azure IoT Hub, AWS IoT Core, or a self-hosted Mosquitto instance. The question isn't whether that data path is valuable — it's whether anyone else is listening.

Industrial MQTT security isn't a theoretical exercise. A compromised edge gateway can inject false telemetry (making operators think everything is fine when it isn't), intercept production data (exposing process parameters to competitors), or pivot into the OT network to reach PLCs directly. This guide covers the practical measures that actually protect these systems.

How to Implement Shift Handover Reporting with IIoT Data: Eliminate Information Gaps Between Shifts

· 10 min read
MachineCDN Team
Industrial IoT Experts

Every manufacturing plant knows the problem: Shift A finishes at 3 PM, Shift B walks in at 3:15 PM, and somewhere in those 15 minutes, critical information evaporates. The injection molder on Line 4 has been running hot for the last two hours. The conveyor on Line 7 threw a fault code at 2:45 PM but was cleared manually. The quality team rejected a batch at 1 PM and nobody documented why.

This isn't a people problem — it's a systems problem. And IIoT data solves it completely.

Thread-Safe Telemetry Pipelines: Building Concurrent IIoT Edge Gateways That Don't Lose Data [2026]

· 17 min read

An edge gateway on a factory floor isn't a REST API handling one request at a time. It's a real-time system juggling multiple competing demands simultaneously: polling a PLC for tag values every second, buffering data locally when the cloud connection drops, transmitting batched telemetry over MQTT, processing incoming configuration commands from the cloud, and monitoring its own health — all at once, on hardware with the computing power of a ten-year-old smartphone.

Get the concurrency wrong, and you don't get a 500 error in your logs. You get silent data loss, corrupted telemetry batches, or — worst case — a watchdog reboot loop that takes your monitoring offline during a critical production run.

This guide covers the architecture patterns that make industrial edge gateways reliable under real-world conditions: concurrent PLC polling, thread-safe buffering, MQTT delivery guarantees, and the store-and-forward patterns that keep data flowing when the network doesn't.

Thread-safe edge gateway architecture with concurrent data pipelines

The Concurrency Challenge in Industrial Edge Gateways

A typical edge gateway has at least three threads running concurrently:

  1. The polling thread — reads tags from PLCs at configured intervals (1-second to 60-second cycles)
  2. The MQTT network thread — manages the broker connection, handles publish/subscribe, reconnection
  3. The main control thread — processes incoming commands, monitors watchdog timers, manages configuration

These threads all share one critical resource: the outgoing data buffer. The polling thread writes telemetry into the buffer. The MQTT thread reads from the buffer and transmits data. When the connection drops, the buffer must hold data without the polling thread stalling. When the connection recovers, the buffer must drain in order without losing or duplicating messages.

This is a classic producer-consumer problem, but with industrial constraints that make textbook solutions insufficient.

Why Standard Queues Fall Short

Your first instinct might be to use a thread-safe queue — a ConcurrentLinkedQueue in Java, a queue.Queue in Python, or a lock-free ring buffer. These work fine for web applications, but industrial edge gateways have constraints that break standard queue implementations:

1. Memory Is Fixed and Finite

Edge gateways run on embedded hardware with 64 MB to 512 MB of RAM — no swap space, no dynamic allocation after startup. An unbounded queue will eventually exhaust memory during a long network outage. A fixed-size queue forces you to choose: block the producer (stalling PLC polling) or drop the oldest data.

2. Network Outages Last Hours, Not Seconds

In a factory, network outages aren't transient blips. A fiber cut, a misconfigured switch, or a power surge on the network infrastructure can take connectivity down for hours. Your buffer needs to hold potentially thousands of telemetry batches — not just a few dozen.

3. Delivery Confirmation Is Asynchronous

MQTT QoS 1 guarantees at-least-once delivery, but the PUBACK confirmation comes back asynchronously — possibly hundreds of milliseconds after the PUBLISH. During that window, you can't release the buffer space (the message might need retransmission), and you can't stall the producer (PLC data keeps flowing).

4. Data Must Survive Process Restarts

If the edge gateway daemon restarts (due to a configuration update, a watchdog trigger, or a power cycle), buffered-but-undelivered data must be recoverable. Purely in-memory queues lose everything.

The Paged Ring Buffer Pattern

The pattern that works in production is a paged ring buffer — a fixed-size memory region divided into pages, with explicit state tracking for each page. Here's how it works:

Memory Layout

At startup, the gateway allocates a single contiguous memory block and divides it into equal-sized pages:

┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Page 0 │ Page 1 │ Page 2 │ Page 3 │ Page 4 │
│ FREE │ FREE │ FREE │ FREE │ FREE │
└─────────┴─────────┴─────────┴─────────┴─────────┘

Each page has its own header tracking:

  • A page number (for logging and debugging)
  • A start_p pointer (beginning of writable space)
  • A write_p pointer (current write position)
  • A read_p pointer (current read position for transmission)
  • A next pointer (linking to the next page in whatever list it's in)

Three Page Lists

Pages move between three linked lists:

  1. Free pages — available for the producer to write into
  2. Used pages — full of data, queued for transmission
  3. Work page — the single page currently being written to
Producer (Polling Thread)          Consumer (MQTT Thread)
│ │
▼ │
┌──────────┐ │
│Work Page │──────── When full ──────►┌──────────┐
│(writing) │ │Used Pages│──► MQTT Publish
└──────────┘ │(queued) │
▲ └──────────┘
│ │
│ When delivered │
│◄──────────────────────────────────────┘
┌──────────┐
│Free Pages│
│(empty) │
└──────────┘

The Producer Path

When the polling thread has a new batch of tag values to store:

  1. Check the work page — if there's no current work page, grab one from the free list
  2. Calculate space — check if the new data fits in the remaining space on the work page
  3. If it fits — write the data (with a size header) and advance write_p
  4. If it doesn't fit — move the work page to the used list, grab a new page (from free, or steal the oldest from used if free is empty), and write there
  5. After writing — check if there's data ready to transmit and kick the consumer

The critical detail: if the free list is empty, the producer steals the oldest used page. This means during extended outages, the buffer wraps around and overwrites the oldest data — exactly the behavior you want. Recent data is more valuable than stale data in industrial monitoring.

The Consumer Path

When the MQTT connection is active and there's data to send:

  1. Check the used page list — if empty, check if the work page has unsent data and promote it
  2. Read the next message from the first used page's read_p position
  3. Publish via MQTT with QoS 1
  4. Set a "packet sent" flag — this prevents sending the next message until the current one is acknowledged
  5. Wait for PUBACK — when the broker confirms receipt, advance read_p
  6. If read_p reaches write_p — the page is fully delivered; move it back to the free list
  7. Repeat — grab the next message from the next used page

The Mutex Strategy

The entire buffer is protected by a single mutex. This might seem like a bottleneck, but in practice:

  • Write operations (adding data) take microseconds
  • Read operations (preparing to transmit) take microseconds
  • The actual MQTT transmission happens outside the mutex — only the buffer state management is locked

The mutex is held for a few microseconds at a time, never during network I/O. This keeps the polling thread from ever blocking on network latency.

Polling Thread:               MQTT Thread:
lock(mutex) lock(mutex)
write data to page read data from page
check if page full mark as sent
maybe promote page unlock(mutex)
trigger send check ─── MQTT publish ───
unlock(mutex) (outside mutex!)
lock(mutex)
process PUBACK
maybe free page
unlock(mutex)

Message Framing Inside Pages

Each page holds multiple messages packed sequentially. Each message has a simple header:

┌──────────────┬──────────────┬─────────────────────┐
│ Message ID │ Message Size │ Message Body │
│ (4 bytes) │ (4 bytes) │ (variable) │
└──────────────┴──────────────┴─────────────────────┘

The Message ID field is initially zero. When the MQTT library publishes the message, it fills in the packet ID assigned by the broker. This is how the consumer tracks which specific message was acknowledged — when the PUBACK callback fires with a packet ID, it can match it to the message at read_p and advance.

This framing makes the buffer self-describing. During recovery after a restart, the gateway can scan page contents by reading size headers sequentially.

Handling Disconnections Gracefully

When the MQTT connection drops, the consumer thread must handle it without corrupting the buffer:

Connection Lost:
1. Set connected = 0
2. Clear "packet sent" flag
3. Do NOT touch any page pointers

That's it. The producer keeps writing — it doesn't know or care about the connection state. The buffer absorbs data normally.

When the connection recovers:

Connection Restored:
1. Set connected = 1
2. Trigger send check (under mutex)
3. Consumer picks up where it left off

The key insight: the "packet sent" flag prevents double-sending. If a PUBLISH was in flight when the connection dropped, the PUBACK never arrived. The flag remains set, but the disconnection handler clears it. When the connection recovers, the consumer re-reads the same message from read_p (which was never advanced) and re-publishes it. The broker either receives a duplicate (handled by QoS 1 dedup) or receives it for the first time.

Binary vs. JSON Batch Encoding

The telemetry data written into the buffer can be encoded in two formats, and the choice affects both bandwidth and reliability.

JSON Format

Each batch is a JSON object containing groups of timestamped values:

{
"groups": [
{
"ts": 1709424000,
"device_type": 1017,
"serial_number": 123456,
"values": [
{"id": 80, "values": [725]},
{"id": 81, "values": [680]},
{"id": 82, "values": [285]}
]
}
]
}

Pros: Human-readable, easy to debug, parseable by any language. Cons: 5-8× larger than binary, float precision loss (decimal representation), size estimation is rough.

Binary Format

A compact binary encoding with a header byte (0xF7), followed by big-endian packed groups:

F7                              ← Header
00 00 00 01 ← Number of groups (1)
65 E8 2C 00 ← Timestamp (Unix epoch)
03 F9 ← Device type (1017)
00 01 E2 40 ← Serial number
00 00 00 03 ← Number of values (3)
00 50 00 01 02 02 D5 ← Tag 80: status=0, 1 value, 2 bytes, 725
00 51 00 01 02 02 A8 ← Tag 81: status=0, 1 value, 2 bytes, 680
00 52 00 01 02 01 1D ← Tag 82: status=0, 1 value, 2 bytes, 285

Pros: 5-8× smaller, perfect float fidelity (raw bytes preserved), exact size calculation. Cons: Requires matching decoder on the cloud side, harder to debug without tools.

For gateways communicating over cellular connections — common in remote facilities like water treatment plants, oil wells, or distributed renewable energy sites — binary encoding is essentially mandatory. A gateway polling 100 tags every 10 seconds generates about 260 MB/month in JSON versus 35 MB/month in binary. At typical IoT cellular rates ($0.50-$2.00/MB), that's the difference between $130/month and $17/month per gateway.

The MQTT Watchdog Pattern

MQTT connections can enter a zombie state — technically connected according to the TCP stack, but the broker has stopped responding. This is especially common behind industrial firewalls and NAT devices with aggressive connection timeout policies.

The Problem

The MQTT library reports the connection as alive. The gateway publishes messages. No PUBACK comes back — ever. The buffer fills up because the consumer thinks each message is "in flight" (the packet_sent flag is set). Eventually the buffer wraps and data loss begins.

The Solution: Last-Delivered Timestamp

Track the timestamp of the last successful PUBACK. If more than N seconds have passed since the last acknowledged delivery, and there are messages waiting to be sent, the connection is stale:

monitor_watchdog():
if connected AND packet_sent:
elapsed = now - last_delivered_packet_timestamp
if elapsed > WATCHDOG_THRESHOLD:
// Force disconnect and reconnect
force_disconnect()
// Disconnection handler clears packet_sent
// Reconnection handler will re-deliver from read_p

A typical threshold is 60 seconds for LAN connections and 120 seconds for cellular. This catches zombie connections that the TCP stack and MQTT keep-alive miss.

Reconnection with Backoff

When the watchdog (or a genuine disconnection) triggers a reconnect, use a dedicated thread for the connection attempt. The connect_async call can block for the TCP timeout duration (potentially 30+ seconds), and you don't want that blocking the main loop or the polling thread.

A semaphore controls the reconnection thread:

Main Thread:                Reconnection Thread:
Detects need to (blocked on semaphore)
reconnect │
Posts semaphore ──────► Wakes up
Calls connect_async()
(may block 30s)
Success or failure
Posts "done" semaphore
Waits for "done" ◄──────
Checks result

The reconnect delay should be fixed and short (5 seconds is typical) for industrial applications, not exponential backoff. In a factory, the network outage either resolves quickly (a transient) or it's a hard failure that needs human intervention. Exponential backoff just delays reconnection after the network recovers.

Batching Strategy: Size vs. Time

Telemetry batches should be finalized and queued for transmission based on whichever threshold hits first: size or time.

Size-Based Finalization

When the accumulated batch data exceeds a configured maximum (typically 4-500 KB for JSON, 50-100 KB for binary), finalize and queue it. This prevents any single MQTT message from being too large for the broker or the network MTU.

Time-Based Finalization

When the batch has been collecting data for more than a configured timeout (typically 30-60 seconds), finalize it regardless of size. This ensures that even slowly-changing tags get transmitted within a bounded time window.

The Interaction Between Batching and Buffering

Batching and buffering are separate concerns that interact:

PLC Tags ──► Batch (collecting) ──► Buffer Page (queued) ──► MQTT (transmitted)

Tag reads accumulate When batch finalizes, Pages are transmitted
in the batch structure the encoded batch goes one at a time with
into the ring buffer PUBACK confirmation

A batch contains one or more "groups" — each group is a set of tag values read at the same timestamp. Multiple polling cycles might go into a single batch before it's finalized by size or time. The finalized batch then goes into the ring buffer as a single message.

Dependent Tag Reads and Atomic Groups

In many PLC configurations, certain tags are only meaningful when read together. For example:

  • Alarm word tags — a uint16 register where each bit represents a different alarm. You read the alarm word, then extract the individual bits. If the alarm word changes, you need to read and deliver the extracted bits atomically with the parent.

  • Machine state transitions — when a "blender running" tag changes from 0 to 1, you might need to immediately read all associated process values (RPM, temperatures, pressures) to capture the startup snapshot.

The architecture handles this through dependent tag chains:

Parent Tag (alarm_word, interval=1s, compare=true)
└── Calculated Tag (alarm_bit_0, shift=0, mask=0x01)
└── Calculated Tag (alarm_bit_1, shift=1, mask=0x01)
└── Dependent Tag (motor_speed, read_on_change=true)
└── Dependent Tag (temperature, read_on_change=true)

When the parent tag changes, the polling thread:

  1. Finalizes the current batch
  2. Recursively reads all dependent tags (forced read, ignoring intervals)
  3. Starts a new batch group with the same timestamp

This ensures that the dependent values are timestamped identically with the trigger event and delivered together.

Hourly Full-Read Reset

Change-of-value (COV) filtering dramatically reduces bandwidth, but it introduces a subtle failure mode: if a value changes during a transient read error, the gateway might never know it changed.

Here's the scenario:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. At 10:00:01, PLC returns an error for that tag → not transmitted
  3. At 10:00:02, tag value = 73.0 → compared against last successful read (72.5), change detected, transmitted
  4. But if the error at 10:00:01 was actually a valid read of 73.0 that was misinterpreted as an error, and the value stayed at 73.0, then at 10:00:02 the comparison against the last known value (72.5) correctly catches it.

The real problem is when:

  1. At 10:00:00, tag value = 72.5 → transmitted
  2. The PLC program changes the tag to 73.0 and then back to 72.5 between polling cycles
  3. The gateway never sees 73.0 — it polls at 10:00:00 and 10:00:01 and gets 72.5 both times

For most industrial applications, this sub-second transient is irrelevant. But to guard against drift — where small rounding differences accumulate between the gateway's cached value and the PLC's actual value — a full reset is performed every hour:

Every hour boundary (when the system clock's hour changes):
1. Clear the "read once" flag on every tag
2. Clear all last-known values
3. Force read and transmit every tag regardless of COV

This guarantees that the cloud platform has a complete snapshot of every tag value at least once per hour, even for tags that haven't changed.

Putting It All Together: The Polling Loop

Here's the complete polling loop architecture that ties all these patterns together:

main_polling_loop():
FOREVER:
current_time = monotonic_clock()

FOR each configured device:
// Hourly reset check
if hour(current_time) != hour(last_poll_time):
reset_all_tags(device)

// Start a new batch group
start_group(device.batch, unix_timestamp())

FOR each tag in device.tags:
// Check if this tag needs reading now
if not tag.read_once OR elapsed(tag.last_read) >= tag.interval:

value, status = read_tag(device, tag)

if status == LINK_ERROR:
set_link_state(device, DOWN)
break // Stop reading this device

set_link_state(device, UP)

// COV check
if tag.compare AND tag.read_once:
if value == tag.last_value AND status == tag.last_status:
continue // No change, skip

// Deliver value
if tag.do_not_batch:
deliver_immediately(device, tag, value)
else:
add_to_batch(device.batch, tag, value)

// Check dependent tags
if value_changed AND tag.has_dependents:
finalize_batch()
read_dependents(device, tag)
start_new_group()

// Update tracking
tag.last_value = value
tag.last_status = status
tag.read_once = true
tag.last_read = current_time

// Finalize batch group
stop_group(device.batch, output_buffer)
// ↑ This checks size/time thresholds and may
// queue the batch into the ring buffer

sleep(polling_interval)

Performance Characteristics

On a typical industrial edge gateway (ARM Cortex-A9, 512 MB RAM, Linux):

OperationTimeNotes
Mutex lock/unlock~1 µsPer buffer operation
Modbus TCP read (10 registers)5-15 msNetwork dependent
Modbus RTU read (10 registers)20-50 msBaud rate dependent (9600-115200)
EtherNet/IP tag read2-8 msCIP overhead
JSON batch encoding0.5-2 ms100 tags
Binary batch encoding0.1-0.5 ms100 tags
MQTT publish (QoS 1)1-5 msLAN broker
Buffer page write5-20 µsmemcpy only

The bottleneck is always the PLC protocol reads, not the buffer or transmission logic. A gateway polling 200 Modbus TCP tags can complete a full cycle in under 200 ms, leaving plenty of headroom for a 1-second polling interval.

For Modbus RTU (serial), the bottleneck shifts to the baud rate. At 9600 baud, a single register read takes ~15 ms including response. Polling 50 registers individually would take 750 ms — too close to a 1-second interval. This is why contiguous register grouping matters: reading 50 consecutive registers in a single request takes about 50 ms, a 15× improvement.

How machineCDN Implements These Patterns

machineCDN's edge gateway uses exactly these patterns — paged ring buffers with mutex-protected page management, QoS 1 MQTT with PUBACK-based buffer advancement, and both binary and JSON encoding depending on the deployment's bandwidth constraints.

The platform's gateway daemon runs on Linux-based edge hardware (including cellular routers like the Teltonika RUT series) and handles simultaneous Modbus RTU, Modbus TCP, and EtherNet/IP connections to mixed-vendor equipment. The buffer is sized during commissioning based on the expected outage duration — a 64 KB buffer holds roughly 4 hours of data at typical polling rates; a 512 KB buffer extends that to over 24 hours.

The result: plants running machineCDN don't lose telemetry during network outages. When connectivity recovers, the buffered data drains automatically and fills in the gaps in trending charts and analytics — no manual intervention, no missing data points.

Key Takeaways

  1. Use paged ring buffers, not unbounded queues — fixed memory, graceful overflow (oldest data dropped first)
  2. Protect buffer operations with a mutex, but never hold it during network I/O — microsecond lock durations keep producers and consumers non-blocking
  3. Track PUBACK per-message to prevent double-sending and enable reliable buffer advancement
  4. Implement a MQTT watchdog using last-delivery timestamps to catch zombie connections
  5. Batch by size OR time (whichever hits first) to balance bandwidth and latency
  6. Reset all tags hourly to guarantee complete snapshots and prevent drift
  7. Binary encoding saves 5-8× bandwidth with zero precision loss — essential for cellular-connected gateways
  8. Group contiguous Modbus registers into single requests — 15× faster than individual reads on RTU

Building a reliable IIoT edge gateway is fundamentally a systems programming challenge. The protocols, the buffering, the concurrency — each one is manageable alone, but getting them all right together, on constrained hardware, with zero tolerance for data loss, is what separates toy prototypes from production infrastructure.


See machineCDN's store-and-forward buffering in action with real factory data. Request a demo to explore the platform.

Total Cost of Ownership for IIoT Platforms: The Complete Guide to What You'll Actually Spend

· 9 min read
MachineCDN Team
Industrial IoT Experts

When manufacturing leaders evaluate Industrial IoT platforms, the conversation usually starts with license cost. "What's it per device? Per user? Per data point?" These are the wrong first questions. License fees typically represent only 20-40% of your total IIoT investment over three years. The rest hides in implementation, infrastructure, engineering time, change management, and ongoing operations.

How to Track Machine Utilization and Idle Time with IIoT: Stop Guessing, Start Measuring

· 9 min read
MachineCDN Team
Industrial IoT Experts

Ask any plant manager what their machine utilization is, and they'll give you a number. Ask how they calculated it, and you'll usually hear some version of "operator logs" or "we estimate about 75%."

The actual number is almost always lower. And the gap between perceived utilization and real utilization is where your capacity — and your margin — is hiding.

IIoT changes this from a guessing game to a measurement exercise. Here's how to implement real machine utilization and idle time tracking using PLC-level data.

Time-Sensitive Networking (TSN) for Industrial Ethernet: Why Deterministic Communication Is the Future of IIoT [2026]

· 11 min read

If you've spent any time on a factory floor, you know the fundamental tension: control traffic needs hard real-time guarantees (microsecond-level determinism), while monitoring and analytics traffic just needs "fast enough." For decades, the industry solved this by running separate networks — a PROFINET or EtherNet/IP fieldbus for control, and standard Ethernet for everything else.

Time-Sensitive Networking (TSN) eliminates that compromise. It brings deterministic, bounded-latency communication to standard IEEE 802.3 Ethernet — meaning your motion control packets and your IIoT telemetry can share the same physical wire without interfering with each other.

This isn't theoretical. TSN-capable switches are shipping from Cisco, Belden, Moxa, and Siemens. OPC-UA Pub/Sub over TSN is in production pilots. And if you're designing an IIoT architecture today, understanding TSN isn't optional — it's the foundation of where industrial networking is going.

The Problem TSN Solves

Standard Ethernet is "best effort." When you plug a switch into a network, frames are forwarded based on MAC address tables, and if two frames need the same port at the same time, one waits. That waiting — buffering, queueing, potential frame drops — is completely acceptable for web traffic. It's catastrophic for servo drives.

Consider a typical plastics manufacturing cell. An injection molding machine has:

  • Motion control loop running at 1ms cycle time (servo drives, hydraulic valves)
  • Process monitoring polling barrel temperatures every 2-5 seconds
  • Quality inspection sending 10MB camera images to an edge server
  • IIoT telemetry batching 500 tag values to MQTT every 30 seconds
  • MES integration exchanging production orders and counts

Before TSN, this required at minimum two separate networks — often three. The motion controller ran on a dedicated real-time fieldbus (PROFINET IRT, EtherCAT, or SERCOS III). Process monitoring lived on standard Ethernet. And the camera system had its own GigE network to avoid flooding the process network.

TSN says: one network, one wire, zero compromises.

The TSN Standards Stack

TSN isn't a single protocol — it's a family of IEEE 802.1 standards that work together. Understanding which ones matter for industrial deployments is critical.

IEEE 802.1AS: Time Synchronization

Everything in TSN starts with a shared clock. 802.1AS (generalized Precision Time Protocol, or gPTP) synchronizes all devices on the network to a common time reference with sub-microsecond accuracy.

Key differences from standard PTP (IEEE 1588):

FeatureIEEE 1588 PTPIEEE 802.1AS gPTP
ScopeAny IP networkLayer 2 only
Best Master ClockComplex negotiationSimplified selection
Peer delay measurementOptionalMandatory
TransportUDP (L3) or L2L2 only
Typical accuracy1-10 μs< 1 μs

For plant engineers, the practical implication is this: every TSN bridge (switch) participates in time synchronization. There's no "transparent clock" mode where a switch just passes PTP packets through. Every hop actively measures its own residence time and adjusts timestamps accordingly.

This gives you a synchronized time base across the entire network — which is what makes scheduled traffic possible.

IEEE 802.1Qbv: Time-Aware Shaper (TAS)

This is the core of TSN determinism. 802.1Qbv introduces the concept of time gates on each egress port of a switch. Every port has up to 8 priority queues (matching 802.1Q priority code points), and each queue has a gate that opens and closes on a precise schedule.

The schedule repeats on a fixed cycle — say, every 1ms. During the first 100μs, only the highest-priority queue (motion control) is open. During the next 300μs, process data queues open. The remaining 600μs is available for best-effort traffic (IIoT telemetry, file transfers, web browsing).

Time Cycle (1ms example):
├── 0-100μs: Gate 7 OPEN (motion control only)
├── 100-400μs: Gate 5-6 OPEN (process monitoring, alarms)
├── 400-1000μs: Gates 0-4 OPEN (IIoT, MES, IT traffic)
└── Cycle repeats...

The beauty of this approach is mathematical: if a motion control frame fits within its dedicated time slot, it's physically impossible for lower-priority traffic to delay it. No amount of IIoT telemetry bursts, camera image transfers, or IT traffic can interfere.

Practical consideration: TAS schedules must be configured consistently across all switches in the path. A motion control packet traversing 5 switches needs all 5 to have synchronized, compatible gate schedules. This is where centralized network configuration (via 802.1Qcc) becomes essential.

IEEE 802.1Qbu/802.3br: Frame Preemption

Even with scheduled gates, there's a problem: what if a low-priority frame is already being transmitted when the high-priority gate opens? On a 100Mbps link, a maximum-size Ethernet frame (1518 bytes) takes ~120μs to transmit. That's an unacceptable delay for a 1ms control loop.

Frame preemption solves this. It allows a switch to pause ("preempt") a low-priority frame mid-transmission, send the high-priority frame, then resume the preempted frame from where it left off.

The preempted frame is split into fragments, each with its own CRC for integrity checking. The receiving end reassembles them transparently. From the application's perspective, no frames are lost — the low-priority frame just arrives a bit later.

Why this matters in practice: Without preemption, you'd need to reserve guard bands — empty time slots before each high-priority window to ensure no large frame is in flight. Guard bands waste bandwidth. On a 100Mbps link with 1ms cycles, a 120μs guard band wastes 12% of available bandwidth. Preemption eliminates that waste entirely.

IEEE 802.1Qcc: Stream Reservation and Configuration

In a real plant, you don't manually configure gate schedules on every switch. 802.1Qcc defines a Centralized Network Configuration (CNC) model where a controller:

  1. Discovers the network topology
  2. Receives stream requirements from talkers (e.g., "I need to send 64 bytes every 1ms with max 50μs latency")
  3. Computes gate schedules across all switches in the path
  4. Programs the schedules into each switch

This is conceptually similar to how SDN (Software Defined Networking) works in data centers, adapted for the specific needs of industrial real-time traffic.

Current reality: CNC tooling is still maturing. As of early 2026, most TSN deployments use vendor-specific configuration tools (Siemens TIA Portal for PROFINET over TSN, Rockwell's Studio 5000 for EtherNet/IP over TSN). Full, vendor-agnostic CNC is coming but isn't plug-and-play yet.

IEEE 802.1CB: Frame Replication and Elimination

For safety-critical applications (emergency stops, protective relay controls), TSN supports seamless redundancy through 802.1CB. A talker sends duplicate frames along two independent paths through the network. Each receiving bridge eliminates the duplicate, passing only one copy to the application.

If one path fails, the other delivers the frame with zero switchover time. There's no spanning tree reconvergence, no RSTP timeout — the redundant frame was already there.

This gives you "zero recovery time" redundancy that's comparable to PRP (Parallel Redundancy Protocol) or HSR (High-availability Seamless Redundancy), but integrated into the TSN framework.

TSN vs. Existing Industrial Protocols

PROFINET IRT

PROFINET IRT (Isochronous Real-Time) achieves similar determinism to TSN, but it does so with proprietary hardware. IRT requires special ASICs in every switch and end device. Standard Ethernet switches don't work.

TSN-based PROFINET ("PROFINET over TSN") is Siemens' path forward. It preserves the PROFINET application layer while moving the real-time mechanism to TSN. The payoff: you can mix PROFINET devices with OPC-UA publishers, MQTT clients, and standard IT equipment on the same network.

EtherCAT

EtherCAT achieves extraordinary performance (sub-microsecond synchronization) by processing Ethernet frames "on the fly" — each slave modifies the frame as it passes through. This requires daisy-chain topology and dedicated EtherCAT hardware.

TSN can't match EtherCAT's raw performance in a daisy chain. But TSN supports standard star topologies with off-the-shelf switches, which is far more practical for plant-wide networks. The trend: EtherCAT for servo-level control within a machine, TSN for the plant-level network connecting machines.

Mitsubishi's CC-Link IE TSN was one of the first industrial protocols to adopt TSN natively. It demonstrates the model: keep the application-layer protocol (CC-Link IE Field), replace the real-time Ethernet mechanism with standard TSN. This lets CC-Link IE coexist with other TSN traffic on the same network.

Practical Architecture: TSN in a Manufacturing Plant

Here's how a TSN-based IIoT architecture looks in practice:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Servo Drives │ │ PLC / Motion│ │ Edge Gateway │
│ (TSN NIC) │────│ Controller │────│ (machineCDN) │
└─────────────┘ └─────────────┘ └──────┬───────┘
│ │
┌──────┴───────┐ │
│ TSN Switch │ │
│ (802.1Qbv) │────────────┘
└──────┬───────┘

┌────────────┼────────────┐
│ │ │
┌──────┴──┐ ┌────┴────┐ ┌────┴─────┐
│ HMI / │ │ Vision │ │ IT/Cloud │
│ SCADA │ │ System │ │ Traffic │
└─────────┘ └─────────┘ └──────────┘

The TSN switch runs 802.1Qbv with a gate schedule that guarantees:

  • Priority 7: Motion control frames — guaranteed 100μs slots at 1ms intervals
  • Priority 5-6: Process monitoring, alarms — 300μs slots
  • Priority 3-4: MES, HMI, SCADA — allocated bandwidth in best-effort window
  • Priority 0-2: IIoT telemetry, file transfers — fills remaining bandwidth

The edge gateway collecting IIoT telemetry operates in the best-effort tier. It polls PLC tags over EtherNet/IP or Modbus TCP, batches the data, and publishes to MQTT — all without any risk of interfering with the control loops sharing the same wire.

Platforms like machineCDN that bridge industrial protocols to cloud already handle the data collection side — Modbus register grouping, EtherNet/IP tag reads, change-of-value filtering. TSN just means that data collection traffic coexists safely with control traffic, eliminating the need for separate networks.

Performance Benchmarks

Real-world TSN deployments show consistent results:

MetricTypical Performance
Time sync accuracy200-800 ns across 10 hops
Minimum guaranteed cycle31.25 μs (with preemption)
Maximum jitter (scheduled traffic)< 1 μs
Maximum hops for < 10μs latency5-7 (at 1Gbps)
Bandwidth efficiency85-95% (vs 70-80% without preemption)
Frame preemption overhead~20 bytes per fragment (minimal)

Compare this to standard Ethernet QoS (802.1p priority queues without TAS): priority queuing gives you statistical priority, not deterministic guarantees. Under heavy load, even high-priority frames can experience hundreds of microseconds of jitter.

Common Pitfalls

1. Not All "TSN-Capable" Switches Are Equal

Some switches support 802.1AS (time sync) but not 802.1Qbv (scheduled traffic). Others support Qbv but not frame preemption. Check the specific IEEE profiles supported, not just the TSN marketing label.

The IEC/IEEE 60802 TSN Profile for Industrial Automation defines the mandatory feature set for industrial use. Look for compliance with this profile.

2. End-Device TSN Support Is Still Emerging

A TSN switch is only half the equation. For guaranteed determinism, the end device (PLC, drive, sensor) needs a TSN-capable Ethernet controller that can transmit frames at precisely scheduled times. Many current PLCs use standard Ethernet NICs — they benefit from TSN's traffic isolation but can't achieve sub-microsecond transmission timing.

3. Configuration Complexity

TSN gate schedules are powerful but complex. A misconfigured schedule can:

  • Create "dead time" where no queue is open (wasted bandwidth)
  • Allow large best-effort frames to overflow into scheduled slots
  • Cause frame drops if the schedule doesn't account for inter-frame gaps

Start simple: define two traffic classes (real-time and best-effort) before attempting multi-level scheduling.

4. Cabling and Distance

TSN doesn't change Ethernet's physical limitations. Standard Cat 5e/6 runs up to 100m per segment. For plant-wide TSN, you'll need fiber between buildings and proper cable management. Time synchronization accuracy degrades with asymmetric cable lengths — use equal-length cables for links between TSN bridges.

Getting Started

If you're designing a new IIoT deployment or modernizing an existing plant network:

  1. Audit your traffic classes. Map every communication flow to a priority level. Most plants have 3-4 distinct classes: hard real-time control, soft real-time monitoring, IT/business, and bulk transfers.

  2. Start with TSN-capable spine switches. Even if your end devices aren't TSN-ready, deploying TSN switches at the aggregation layer gives you traffic isolation today and a deterministic upgrade path for tomorrow.

  3. Deploy IIoT data collection at the appropriate priority. Edge gateways that poll PLCs and publish to MQTT typically operate fine at priority 3-4. They don't need deterministic guarantees — they need reliable throughput. TSN ensures that throughput is available even when control traffic is present.

  4. Plan for centralized configuration. As your TSN deployment grows beyond a single machine cell, manual switch configuration becomes untenable. Invest in network management tools that support 802.1Qcc configuration.

The Convergence Thesis

TSN's real impact isn't about making Ethernet faster — it's about eliminating the network boundaries between IT and OT.

Today, most factories have 3-5 separate network segments with firewalls, protocol converters, and data diodes between them. Each segment has its own switches, cables, management tools, and maintenance burden.

TSN collapses these into a single converged network where control traffic and IT traffic coexist with mathematical guarantees. That means:

  • Lower infrastructure cost (one network instead of three)
  • Simpler troubleshooting (one set of diagnostic tools)
  • Direct IIoT access to real-time data (no protocol conversion needed)
  • Unified security policy (one network to secure, one set of ACLs)

For plant engineers deploying IIoT platforms, TSN means the data you need is already on the same network — no bridging, no gateways, no proprietary converters. You connect your edge device, configure the right traffic priority, and start collecting data from machines that were previously on isolated control networks.

The deterministic network is coming. The question is whether your infrastructure will be ready for it.