Skip to main content

Protocol Bridging: Translating Modbus to MQTT at the Industrial Edge [2026]

· 15 min read

Protocol Bridging Architecture

Every plant floor speaks Modbus. Every cloud platform speaks MQTT. The 20 inches of Ethernet cable between them is where industrial IoT projects succeed or fail.

Protocol bridging — the act of reading data from one industrial protocol and publishing it via another — sounds trivial on paper. Poll a register, format a JSON payload, publish to a topic. Three lines of pseudocode. But the engineers who've actually deployed these bridges at scale know the truth: the hard problems aren't in the translation. They're in the timing, the buffering, the failure modes, and the dozens of edge cases that only surface when a PLC reboots at 2 AM while your MQTT broker is mid-failover.

This guide covers the real engineering of Modbus-to-MQTT bridges — from register-level data mapping to store-and-forward architectures that survive weeks of disconnection.

Why Bridging Is Harder Than It Looks

Modbus and MQTT are fundamentally different communication paradigms. Understanding these differences is critical to building a bridge that doesn't collapse under production conditions.

Modbus is synchronous and polled. The master (your gateway) initiates every transaction. It sends a request frame, waits for a response, processes the data, and moves on. There's no concept of subscriptions, push notifications, or asynchronous updates. If you want a value, you ask for it. Every. Single. Time.

MQTT is asynchronous and event-driven. Publishers send messages whenever they have data. Subscribers receive messages whenever they arrive. The broker decouples producers from consumers. There's no concept of polling — data flows when it's ready.

Bridging these two paradigms means your gateway must act as a Modbus master on one side (issuing timed read requests) and an MQTT client on the other (publishing messages asynchronously). The gateway is the only component that speaks both languages, and it bears the full burden of timing, error handling, and data integrity.

The Timing Mismatch

Modbus RTU on RS-485 at 9600 baud takes roughly 20ms per single-register transaction (request frame + inter-frame delay + response frame + turnaround time). Reading 100 registers individually would take 2 seconds — an eternity if you need sub-second update rates.

Modbus TCP eliminates the serial timing constraints but introduces TCP socket management, connection timeouts, and the possibility of the PLC's TCP stack running out of connections (most PLCs support only 4–8 simultaneous TCP connections).

MQTT, meanwhile, can handle thousands of messages per second. The bottleneck is never the MQTT side — it's always the Modbus side. Your bridge architecture must respect the slower protocol's constraints while maximizing throughput.

Register Mapping: The Foundation

The first engineering decision is how to map Modbus registers to MQTT topics and payloads. There are three common approaches, each with trade-offs.

Approach 1: One Register, One Message

Topic: plant/line3/plc1/holding/40001
Payload: {"value": 1847, "ts": 1709312400, "type": "uint16"}

Pros: Simple, granular, easy to subscribe to individual data points. Cons: Catastrophic at scale. 200 registers means 200 MQTT publishes per poll cycle. At a 1-second poll rate, that's 200 messages/second — sustainable for the broker, but wasteful in bandwidth and processing overhead on constrained gateways.

Approach 2: Batched JSON Messages

Topic: plant/line3/plc1/batch
Payload: {
"ts": 1709312400,
"device_type": 1010,
"tags": [
{"id": 1, "value": 1847, "type": "uint16"},
{"id": 2, "value": 23.45, "type": "float"},
{"id": 3, "value": true, "type": "bool"}
]
}

Pros: Drastically fewer MQTT messages. One publish carries an entire poll cycle's worth of data. Cons: JSON encoding adds CPU overhead on embedded gateways. Payload size can grow large if you have hundreds of tags.

Approach 3: Binary-Encoded Batches

Instead of JSON, encode tag values in a compact binary format: a header with timestamp and device metadata, followed by packed tag records (tag ID + status + type + value). A single 16-bit register value takes 2 bytes in binary vs. ~30 bytes in JSON.

Pros: Minimum bandwidth. Critical for cellular-connected gateways where data costs money per megabyte. Cons: Requires matching decoders on the cloud side. Harder to debug.

The right approach depends on your constraints. For Ethernet-connected gateways with ample bandwidth, batched JSON is the sweet spot. For cellular or satellite links, binary encoding can reduce data costs by 10–15x.

Contiguous Register Coalescing

The single most impactful optimization in any Modbus-to-MQTT bridge is contiguous register coalescing: instead of reading registers one at a time, group adjacent registers into a single Modbus read request.

Consider a tag list where you need registers at addresses 40100, 40101, 40102, 40103, and 40110. A naive implementation makes 5 read requests. A smart bridge recognizes that 40100–40103 are contiguous and reads them in one Read Holding Registers (function code 03) call with a quantity of 4. That's 2 transactions instead of 5.

The coalescing logic must respect several constraints:

  1. Same function code. You can't coalesce a coil read (FC 01) with a holding register read (FC 03). The bridge must group tags by their Modbus register type — coils (0xxxxx), discrete inputs (1xxxxx), input registers (3xxxxx), and holding registers (4xxxxx) — and coalesce within each group.

  2. Maximum register count per transaction. The Modbus specification limits a single read to 125 registers (for 16-bit registers) or 2000 coils. In practice, keeping blocks under 50 registers reduces the risk of timeout errors on slower PLCs.

  3. Addressing gaps. If registers 40100 and 40150 both need reading, coalescing them into a single 51-register read wastes 49 registers worth of response data. Set a maximum gap threshold (e.g., 10 registers) — if the gap exceeds it, split into separate transactions.

  4. Same polling interval. Tags polled every second shouldn't be grouped with tags polled every 60 seconds. Coalescing must respect per-tag timing configuration.

// Pseudocode: Coalescing algorithm
sort tags by address ascending
group_head = first_tag
group_count = 1

for each subsequent tag:
if tag.function_code == group_head.function_code
AND tag.address == group_head.address + group_registers
AND group_registers < MAX_BLOCK_SIZE
AND tag.interval == group_head.interval:
// extend current group
group_registers += tag.elem_count
group_count += 1
else:
// read current group, start new one
read_modbus_block(group_head, group_count, group_registers)
group_head = tag
group_count = 1

In production deployments, contiguous coalescing routinely reduces Modbus transaction counts by 5–10x, which directly translates to faster poll cycles and fresher data.

Data Type Handling: Where the Devils Live

Modbus registers are 16-bit words. Everything else — 32-bit integers, IEEE 754 floats, booleans packed into bit fields — is a convention imposed by the PLC programmer. Your bridge must handle all of these correctly.

32-Bit Values Across Two Registers

A 32-bit float or integer spans two consecutive 16-bit Modbus registers. The critical question: which register contains the high word?

There's no standard. Some PLCs use big-endian word order (high word first, often called "ABCD" byte order). Others use little-endian word order (low word first, "CDAB"). Some use mid-endian orders ("BADC" or "DCBA"). You must know your PLC's convention, or your 23.45°C temperature reading becomes 1.7e+38 garbage.

For IEEE 754 floats specifically, the conversion from two 16-bit registers to a float is:

// Big-endian word order (ABCD)
float_value = ieee754_decode(register[n] << 16 | register[n+1])

// Little-endian word order (CDAB)
float_value = ieee754_decode(register[n+1] << 16 | register[n])

Production bridges must support configurable byte/word ordering on a per-tag basis, because it's common to have PLCs from different manufacturers on the same network.

Boolean Extraction From Status Words

PLCs frequently pack multiple boolean states into a single 16-bit register — machine running, alarm active, door open, etc. Extracting individual bits requires configurable shift-and-mask operations:

bit_value = (register_value >> shift_count) & mask

Where shift_count identifies the bit position (0–15) and mask is typically 0x01 for a single bit. The bridge's tag configuration should support this as a first-class feature, not a post-processing hack.

Type Safety Across the Bridge

When values cross from Modbus to MQTT, type information must be preserved. A uint16 register value of 65535 means something very different from a signed int16 value of -1 — even though the raw bits are identical. Your MQTT payload must carry the type alongside the value, whether in JSON field names or binary format headers.

Connection Resilience: The Store-and-Forward Pattern

The Modbus side of a protocol bridge is local — wired directly to PLCs over Ethernet or RS-485. It rarely fails. The MQTT side connects to a remote broker over a WAN link that will fail. Cellular drops out. VPN tunnels collapse. Cloud brokers restart for maintenance.

A production bridge must implement store-and-forward: continue reading from Modbus during MQTT outages, buffer the data locally, and drain the buffer when connectivity returns.

Page-Based Ring Buffers

The most robust buffering approach for embedded gateways uses a page-based ring buffer in pre-allocated memory:

  1. Format a fixed memory region into equal-sized pages at startup.
  2. Write incoming Modbus data to the current "work page." When a page fills, move it to the "used" queue.
  3. Send pages from the "used" queue to MQTT, one message at a time. Wait for the MQTT publish acknowledgment (at QoS 1) before advancing the read pointer.
  4. Recycle fully-delivered pages back to the "free" list.

If the MQTT connection drops:

  • Stop sending, but keep writing to new pages.
  • If all pages fill up (true buffer overflow), start overwriting the oldest used page. You lose the oldest data, but never the newest.

This design has several properties that matter for industrial deployments:

  • No dynamic memory allocation. The entire buffer is pre-allocated. No malloc, no fragmentation, no out-of-memory crashes at 3 AM.
  • Bounded memory usage. You know exactly how much RAM the buffer consumes. Critical on gateways with 64–256 MB.
  • Delivery guarantees. Each page tracks its own read pointer. If the gateway crashes mid-delivery, the page is re-sent on restart (at-least-once semantics).

How Long Can You Buffer?

Quick math: A gateway reading 100 tags every 5 seconds generates roughly 2 KB of batched JSON per poll cycle. That's 24 KB/minute, 1.4 MB/hour, 34 MB/day. A 256 MB buffer holds 7+ days of data. In binary format, that extends to 50+ days.

For most industrial applications, 24–48 hours of buffering is sufficient to survive maintenance windows, network outages, and firmware upgrades.

MQTT Connection Management

The MQTT side of the bridge deserves careful engineering. Industrial connections aren't like web applications — they run for months without restart, traverse multiple NATs and firewalls, and must recover automatically from every failure mode.

Async Connection With Threaded Reconnect

Never block the Modbus polling loop waiting for an MQTT connection. The correct architecture uses a separate thread for MQTT connection management:

  1. The main thread polls Modbus on a tight timer and writes data to the buffer.
  2. A connection thread handles MQTT connect/reconnect attempts asynchronously.
  3. The buffer drains automatically when the MQTT connection becomes available.

This separation ensures that a 30-second MQTT connection timeout doesn't stall your 1-second Modbus poll cycle. Data keeps flowing into the buffer regardless of MQTT state.

Reconnect Strategy

Use a fixed reconnect delay (5 seconds works well for most deployments) rather than exponential backoff. Industrial MQTT connections are long-lived — the overhead of a 5-second retry is negligible compared to the cost of missing data during a 60-second exponential backoff.

However, protect against connection storms: if the broker is down for an extended period, ensure reconnect attempts don't overwhelm the gateway's CPU or the broker's TCP listener.

TLS Certificate Management

Production MQTT bridges almost always use TLS (port 8883 rather than 1883). The bridge must handle:

  • Certificate expiration. Monitor the TLS certificate file's modification timestamp. If the cert file changes on disk, tear down the current MQTT connection and reinitialize with the new certificate. Don't wait for the existing connection to fail — proactively reconnect.
  • SAS token rotation. When using Azure IoT Hub or similar services with time-limited tokens, parse the token's expiration timestamp and reconnect before it expires.
  • CA certificate bundles. Embedded gateways often ship with minimal CA stores. Ensure your IoT hub's root CA is explicitly included in the gateway's certificate chain.

Change-of-Value vs. Periodic Reporting

Not all tags need the same reporting strategy. A bridge should support both:

Periodic reporting publishes every tag value at a fixed interval, regardless of whether the value changed. Simple, predictable, but wasteful for slowly-changing values like ambient temperature or firmware version.

Change-of-value (COV) reporting compares each newly read value against the previous value and only publishes when a change is detected. This dramatically reduces MQTT traffic for boolean states (machine on/off), setpoints, and alarm registers that change infrequently.

The implementation stores the last-read value for each tag and performs a comparison before deciding whether to publish:

if tag.compare_enabled:
if new_value != tag.last_value:
publish(tag, new_value)
tag.last_value = new_value
else:
publish(tag, new_value) # always publish

A hybrid approach works best: use COV for digital signals and alarm words, periodic for analog measurements like temperature and pressure. Some tags (critical alarms, safety interlocks) should always be published immediately — bypassing both the normal comparison logic and the batching system — to minimize latency.

Calculated and Dependent Tags

Real-world PLCs don't always expose data in the format you need. A bridge should support calculated tags — values derived from raw register data through mathematical or bitwise operations.

Common patterns include:

  • Bit extraction from status words. A 16-bit register contains 16 individual boolean states. The bridge extracts each bit as a separate tag using shift-and-mask operations.
  • Scaling and offset. Raw register value 4000 represents 400.0°F when divided by 10. The bridge applies a linear transformation (value × k1 / k2) to produce engineering units.
  • Dependent tag chains. When a parent tag's value changes, the bridge automatically reads and publishes a set of dependent tags. Example: when the "recipe number" register changes, immediately read all recipe parameter registers.

These calculations must happen at the edge, inside the bridge, before data is published to MQTT. Pushing raw register values to the cloud and calculating there wastes bandwidth and adds latency.

A bridge should publish its own health status alongside machine data. The most critical metric is link state — whether the gateway can actually communicate with the PLC.

When a Modbus read fails with a connection error (timeout, connection reset, connection refused, or broken pipe), the bridge should:

  1. Set the link state to "down" and publish immediately (not batched).
  2. Close the existing Modbus connection and attempt reconnection.
  3. Continue publishing link-down status at intervals so the cloud system knows the gateway is alive but the PLC is unreachable.
  4. When reconnection succeeds, set link state to "up" and force-read all tags to re-establish baseline values.

This link state telemetry is invaluable for distinguishing between "the machine is off" and "the network cable is unplugged" — two very different problems that look identical without gateway-level diagnostics.

How machineCDN Handles Protocol Bridging

machineCDN's edge gateway was built from the ground up for exactly this problem. The gateway daemon handles Modbus RTU (serial), Modbus TCP, and EtherNet/IP on the device side, and publishes all data over MQTT with TLS to the cloud.

Key architectural decisions in the machineCDN gateway:

  • Pre-allocated page buffer with configurable page sizes for zero-allocation runtime operation.
  • Automatic contiguous register coalescing that respects function code boundaries, tag intervals, and register limits.
  • Per-tag COV comparison with an option to bypass batching for latency-critical values.
  • Calculated tag chains for bit extraction and dependent tag reads.
  • Hourly full refresh — every 60 minutes, the gateway resets all COV baselines and publishes every tag value, ensuring the cloud always has a complete snapshot even if individual change events were missed.
  • Async MQTT reconnection with certificate hot-reloading and SAS token expiration monitoring.

The result is a bridge that reliably moves data from plant-floor PLCs to cloud dashboards with sub-second latency during normal operation and zero data loss during outages lasting hours or days.

Deployment Checklist

Before deploying a Modbus-to-MQTT bridge in production:

  • Map every register — document address, data type, byte order, scaling factor, and engineering units
  • Set appropriate poll intervals — 1s for process-critical, 5–60s for environmental, 300s+ for configuration data
  • Size the buffer — calculate daily data volume and ensure the buffer can hold 24+ hours
  • Test byte ordering — verify float and 32-bit integer decoding against known PLC values before trusting the data
  • Configure COV vs periodic — boolean and alarm tags = COV, analog = periodic
  • Enable TLS — never run MQTT unencrypted on production networks
  • Monitor link state — alert on PLC disconnections, not just missing data
  • Test failover — unplug the WAN cable for 4 hours and verify data drains correctly when it reconnects

Protocol bridging isn't glamorous work. It's plumbing. But it's the plumbing that determines whether your IIoT deployment delivers reliable data or expensive noise. Get the bridge right, and everything downstream — analytics, dashboards, predictive maintenance — just works.

Best Real-Time Manufacturing Dashboard Software 2026: See Your Factory in Real Time

· 9 min read
MachineCDN Team
Industrial IoT Experts

A manufacturing dashboard isn't useful if it shows you what happened yesterday. By the time you're reading yesterday's production report, the scrap is already in the bin, the machine has been down for 8 hours, and your best customer's order is late.

Real-time manufacturing dashboards change the equation. They show you what's happening right now — which machines are running, which are idle, which are alarming, and how your shift is tracking against plan. The difference between a 5-second data refresh and a next-day report is the difference between catching a problem and cleaning up after one.

Here's what the best real-time dashboard platforms deliver in 2026, and how to pick the right one for your operation.

How to Reduce Energy Costs in Manufacturing with IIoT: A Practical Guide to Cutting 15-30% Off Your Power Bill

· 10 min read
MachineCDN Team
Industrial IoT Experts

Energy is the expense that hides in plain sight. Most manufacturers know their monthly utility bill, but few can answer these questions:

  • Which machines consume the most energy per part produced?
  • How much energy does your factory waste during idle time and changeovers?
  • What's the actual cost of running Machine #7 versus Machine #12 for the same product?
  • How does your energy consumption compare between shifts, operators, or products?

Without machine-level energy visibility, you're paying a number you can't optimize. And that number is getting bigger — U.S. industrial electricity rates have risen 22% since 2020, and the trend isn't reversing.

This guide shows you how to use IIoT monitoring to find and eliminate energy waste in manufacturing operations — with a realistic target of 15–30% reduction in energy costs within the first year.

Reliable Telemetry Delivery in IIoT: Page Buffers, Batch Finalization, and Disconnection Recovery [2026]

· 13 min read

Your edge gateway reads 200 tags from a PLC every second. The MQTT connection to your cloud broker drops for 3 minutes because someone bumped the cellular antenna. What happens to the 36,000 data points collected during the outage?

If your answer is "they're gone," you have a toy system, not an industrial one.

Reliable telemetry delivery is the hardest unsolved problem in most IIoT architectures. Everyone focuses on the protocol layer — Modbus reads, EtherNet/IP connections, OPC-UA subscriptions — but the real engineering is in what happens between reading a value and confirming it reached the cloud. This article breaks down the buffer architecture that makes zero-data-loss telemetry possible on resource-constrained edge hardware.

Reliable telemetry delivery buffer architecture

The Problem: Three Asynchronous Timelines

In any edge-to-cloud telemetry system, you're managing three independent timelines:

  1. PLC read cycle — Tags are read at fixed intervals (1s, 60s, etc.). This never stops. The PLC doesn't care if your cloud connection is down.

  2. Batch collection — Raw tag values are grouped into batches by timestamp and device. Batches accumulate until they hit a size limit or a timeout.

  3. MQTT delivery — Batches are published to the broker. The broker acknowledges receipt. At QoS 1, the MQTT library handles retransmission, but only if you give it data in the right form.

These three timelines run independently. The PLC read loop runs on a tight 1-second cycle. Batch finalization might happen every 30–60 seconds. MQTT delivery depends on network availability. If any one of these stalls, the others must keep running without data loss.

This is fundamentally a producer-consumer problem with a twist: the consumer (MQTT) can disappear for minutes at a time, and the producer (PLC reads) cannot slow down.

The Batch Layer: Grouping Values for Efficient Transport

Raw tag values are tiny — a temperature reading is 4 bytes, a boolean is 1 byte. Sending each value as an individual MQTT message would be absurdly wasteful. Instead, values are collected into batches — structured payloads that contain multiple timestamped readings from one or more devices.

Batch Structure

A batch is organized as a series of groups, where each group represents one polling cycle (one timestamp, one device):

Batch
├── Group 0: { timestamp: 1709284800, device_type: 5000, serial: 12345 }
│ ├── Value: { id: 2, values: [72.4] } // Delivery Temp
│ ├── Value: { id: 3, values: [68.1] } // Mold Temp
│ └── Value: { id: 5, values: [12.6] } // Flow Value
├── Group 1: { timestamp: 1709284860, device_type: 5000, serial: 12345 }
│ ├── Value: { id: 2, values: [72.8] }
│ ├── Value: { id: 3, values: [68.3] }
│ └── Value: { id: 5, values: [12.4] }
└── ...

Dual-Format Encoding: JSON vs Binary

Production edge daemons typically support two encoding formats for batches, and the choice has massive implications for bandwidth:

JSON format:

{
"groups": [
{
"ts": 1709284800,
"device_type": 5000,
"serial_number": 12345,
"values": [
{"id": 2, "values": [72.4]},
{"id": 3, "values": [68.1]}
]
}
]
}

Binary format (same data):

Header:  F7                           (1 byte - magic)
Groups: 00 00 00 01 (4 bytes - group count)
Group 0: 65 E5 A0 00 (4 bytes - timestamp)
13 88 (2 bytes - device type: 5000)
00 00 30 39 (4 bytes - serial number)
00 00 00 02 (4 bytes - value count)
Value 0: 00 02 (2 bytes - tag id)
00 (1 byte - status: OK)
01 (1 byte - values count)
04 (1 byte - element size: 4 bytes)
42 90 CC CD (4 bytes - float 72.4)
Value 1: 00 03
00
01
04
42 88 33 33 (4 bytes - float 68.1)

The JSON version of this payload: ~120 bytes. The binary version: ~38 bytes. That's a 3.2x reduction — and on a metered cellular connection at $0.01/MB, that savings compounds quickly when you're transmitting every 30 seconds 24/7.

The binary format uses a simple TLV-like structure: magic byte, group count (big-endian uint32), then for each group: timestamp (uint32), device type (uint16), serial number (uint32), value count (uint32), then for each value: tag ID (uint16), status byte, value count, element size, and raw value bytes. No field names, no delimiters, no escaping — just packed binary data.

Batch Finalization Triggers

A batch should be finalized (sealed and queued for delivery) when either condition is met:

  1. Size limit exceeded — When the accumulated batch size exceeds a configured maximum (e.g., 500KB for JSON, or when the binary buffer is 90%+ full). The 90% threshold for binary avoids the edge case where the next value would overflow the buffer.

  2. Collection timeout expired — When elapsed time since the batch started exceeds a configured maximum (e.g., 60 seconds). This ensures data flows even during quiet periods with few value changes.

if (elapsed_seconds > max_collection_time) → finalize
if (batch_size > max_batch_size) → finalize

Both checks happen after every group is closed (after every polling cycle). This means finalization granularity is tied to your polling interval — if you poll every 1 second and your batch timeout is 60 seconds, each batch will contain roughly 60 groups.

The "Do Not Batch" Exception

Some values are too important to wait for batch finalization. Equipment alarms, pump state changes, emergency stops — these need to reach the cloud immediately. These tags are flagged as "do not batch" in the configuration.

When a do-not-batch tag changes value, it bypasses the normal batch pipeline entirely. A mini-batch is created on the spot — containing just that single value — and pushed directly to the outgoing buffer. This ensures sub-second cloud visibility for critical state changes, while bulk telemetry still benefits from batch efficiency.

Tag: "Pump Status"     interval: 1s    do_not_batch: true
Tag: "Heater Status" interval: 1s do_not_batch: true
Tag: "Delivery Temp" interval: 60s do_not_batch: false ← normal batching

The Buffer Layer: Surviving Disconnections

This is where most IIoT implementations fail. The batch layer produces data. The MQTT layer consumes it. But what sits between them? If it's just an in-memory queue, you'll lose everything on disconnect.

Page-Based Ring Buffer Architecture

The production-grade answer is a page-based ring buffer — a fixed-size memory region divided into equal-sized pages that cycle through three states:

States:
FREE → Available for writing
WORK → Currently being filled with batch data
USED → Filled, waiting for MQTT delivery

Lifecycle:
FREE → WORK (when first data is added)
WORK → USED (when page is full or batch is finalized)
USED → transmit → delivery ACK → FREE (recycled)

Here's how it works:

Memory layout: At startup, a contiguous block of memory is allocated (e.g., 2MB). This block is divided into pages of a configured size (matching the MQTT max packet size, typically matching the batch size). Each page has a small header tracking its state and a data area.

┌──────────────────────────────────────────────┐
│ [Page 0: USED] [Page 1: USED] [Page 2: WORK]│
│ [Page 3: FREE] [Page 4: FREE] [Page 5: FREE]│
│ [Page 6: FREE] ... [Page N: FREE] │
└──────────────────────────────────────────────┘

Writing data: When a batch is finalized, its serialized bytes are written to the current WORK page. Each message gets a small header: a 4-byte message ID slot (filled later by the MQTT library) and a 4-byte size field. If the current page can't fit the next message, it transitions to USED and a fresh FREE page becomes the new WORK page.

Overflow handling: When all FREE pages are exhausted, the buffer reclaims the oldest USED page — the one that's been waiting for delivery the longest. This means you lose old data rather than new data, which is the right trade-off: the most recent readings are the most valuable. An overflow warning is logged so operators know the buffer is under pressure.

Delivery: When the MQTT connection is active, the buffer walks through USED pages and publishes their contents. Each publish gets a packet ID from the MQTT library. When the broker ACKs the packet (via the PUBACK callback for QoS 1), the corresponding page is recycled to FREE.

Disconnection recovery: When the MQTT connection drops:

  1. The disconnect callback fires
  2. The buffer marks itself as disconnected
  3. Data continues accumulating in pages (WORK → USED)
  4. When reconnected, the buffer immediately starts draining USED pages

No data is lost unless the buffer physically overflows. With 2MB of buffer and 500KB page size, you get 4 pages of headroom — enough to survive several minutes of disconnection at typical telemetry rates.

Thread Safety

The PLC read loop and the MQTT event loop run on different threads. The buffer must be thread-safe. Every buffer operation acquires a mutex:

  • buffer_add_data() — called from the PLC read thread after batch finalization
  • buffer_process_data_delivered() — called from the MQTT callback thread on PUBACK
  • buffer_process_connect() / buffer_process_disconnect() — called from MQTT lifecycle callbacks

Without proper locking, you'll see corrupted pages, double-free crashes, and mysterious data loss under load. This is non-negotiable.

Sizing the Buffer

Buffer sizing depends on three variables:

  1. Data rate: How many bytes per second does your polling loop produce?
  2. Expected outage duration: How long do you need to survive without MQTT?
  3. Available memory: Edge devices (especially industrial routers) have limited RAM

Example calculation:

  • 200 tags, average 6 bytes each (including binary overhead) = 1,200 bytes/group
  • Polling every 1 second = 1,200 bytes/second = 72KB/minute
  • Target: survive 30-minute outage = 2.16MB buffer
  • With 500KB pages = 5 pages minimum (round up for safety)

In practice, 2–4MB covers most scenarios. On a 32MB industrial router, that's well within budget.

The MQTT Layer: QoS, Reconnection, and Watchdogs

QoS 1: At-Least-Once Delivery

For industrial telemetry, QoS 1 is the right choice:

  • QoS 0 (fire and forget): No delivery guarantee. Unacceptable for production data.
  • QoS 1 (at least once): Broker ACKs every message. Duplicates possible but data loss prevented. Good trade-off.
  • QoS 2 (exactly once): Eliminates duplicates but doubles the handshake overhead. Rarely worth it for telemetry.

The page buffer's recycling logic depends on QoS 1: pages are only freed when the PUBACK arrives. If the ACK never comes (connection drops mid-transmission), the page stays in USED state and will be retransmitted after reconnection.

Connection Watchdog

MQTT connections can enter a zombie state — the TCP socket is open, the MQTT loop is running, but no data is actually flowing. This happens when network routing changes, firewalls silently drop the connection, or the broker becomes unresponsive.

The fix: a watchdog timer that monitors delivery acknowledgments. If no PUBACK has been received within a timeout window (e.g., 120 seconds) and data has been queued for transmission, force a reconnect:

if (now - last_delivered_packet_time > 120s) {
if (has_pending_data) {
// Force MQTT reconnection
reset_mqtt_client();
}
}

This catches the edge case where the MQTT library thinks it's connected but the network is actually dead. Without this watchdog, your edge daemon could silently accumulate hours of undelivered data in the buffer, eventually overflowing and losing it all.

Asynchronous Connection

MQTT connection establishment (DNS resolution, TLS handshake, CONNACK) can take several seconds, especially over cellular links. This must not block the PLC read loop. The connection should happen on a separate thread:

  1. Main thread detects connection is needed
  2. Connection thread starts connect_async()
  3. Main thread continues reading PLCs
  4. On successful connect, the callback fires and buffer delivery begins

If the connection thread is still working when a new connection attempt is needed, skip it — don't queue multiple connection attempts or you'll thrash the network stack.

TLS for Production

Any MQTT connection leaving your plant network must use TLS. Period. Industrial telemetry data — temperatures, pressures, equipment states, alarm conditions — is operationally sensitive. On the wire without encryption, anyone on the network path can see (and potentially modify) your readings.

For cloud brokers like Azure IoT Hub, TLS is mandatory. The edge daemon should:

  • Load the CA certificate from a PEM file
  • Use MQTT v3.1.1 protocol (widely supported, well-tested)
  • Monitor the SAS token expiration timestamp and alert before it expires
  • Automatically reinitialize the MQTT client when the certificate or connection string changes (file modification detected via stat())

Daemon Status Reporting

A well-designed edge daemon reports its own health back through the same MQTT channel it uses for telemetry. A periodic status message should include:

  • System uptime and daemon uptime — detect restarts
  • PLC link state — is the PLC connection healthy?
  • Buffer state — how full is the outgoing buffer?
  • MQTT state — connected/disconnected, last ACK time
  • SAS token expiration — days until credentials expire
  • Software version — for remote fleet management

An extended status format can include per-tag state: last read time, last delivery time, current value, and error count. This is invaluable for remote troubleshooting — you can see from the cloud exactly which tags are stale and why.

Value Comparison and Change Detection

Not all values need to be sent every polling cycle. A temperature that's been 72.4°F for the last hour doesn't need to be transmitted 3,600 times. Change detection — comparing the current value to the last sent value — can dramatically reduce bandwidth.

The implementation: each tag stores its last transmitted value. After reading, compare:

if (tag.compare_enabled && tag.has_been_read_once) {
if (current_value == tag.last_value) {
skip_this_value(); // Don't add to batch
}
}

Important caveats:

  • Not all tags should use comparison. Continuous process variables (temperatures, flows) should always send, even if unchanged — the recipient needs the full time series to calculate trends and detect flatlines (a stuck sensor reads the same value forever, which is itself a fault condition).
  • Discrete state tags (booleans, enums) are ideal for comparison — they change rarely and each change is significant.
  • Floating-point comparison should use an epsilon threshold, not exact equality, to avoid sending noise from ADC jitter.

Putting It All Together: The Main Loop

The complete edge daemon main loop ties all these layers together:

1. Parse configuration (device addresses, tag lists, MQTT credentials)
2. Allocate memory (PLC config pool + output buffer)
3. Format output buffer into pages
4. Start MQTT connection thread
5. Detect PLC device (probe address, determine type/protocol)
6. Load device-specific tag configuration

MAIN LOOP (runs every 1 second):
a. Check for config file changes → restart if changed
b. Read PLC tags (coalesced Modbus/EtherNet/IP)
c. Add values to batch (with comparison filtering)
d. Check batch finalization triggers (size/timeout)
e. Process incoming commands (config updates, force reads)
f. Check MQTT connection watchdog
g. Sleep 1 second

Every component — polling, batching, buffering, delivery — operates within this single loop iteration, keeping the system deterministic and debuggable.

How machineCDN Implements This

The machineCDN edge runtime implements this full stack natively on resource-constrained industrial routers. The page-based ring buffer runs in pre-allocated memory (no dynamic allocation after startup), the MQTT layer handles Azure IoT Hub and local broker configurations interchangeably, and the batch layer supports both JSON and binary encoding selectable per-device.

On a Teltonika RUT9xx router with 256MB RAM, the daemon typically uses under 4MB total — including 2MB of buffer space that can store 20+ minutes of telemetry during a connectivity outage. Tags are automatically sorted, coalesced, and dispatched with zero configuration beyond listing the tag names and addresses.

The result: edge gateways that have been running continuously for years in production environments, surviving cellular dropouts, network reconfigurations, and even firmware updates without losing a single data point.

Conclusion

Reliable telemetry delivery isn't about the protocol — it's about the pipeline. Modbus reads are the easy part. The hard engineering is in the layers between: batching values efficiently, buffering them through disconnections, and confirming delivery before recycling memory.

The key design principles:

  1. Never block the read loop — PLC polling is sacred
  2. Buffer with finite, pre-allocated memory — dynamic allocation on embedded systems is asking for trouble
  3. Reclaim oldest data first — in overflow, recent values matter more
  4. Acknowledge before recycling — a page stays USED until the broker confirms receipt
  5. Watch for zombie connections — a connected socket doesn't mean data is flowing

Get these right, and your edge infrastructure becomes invisible — which is exactly what production IIoT should be.

Best Remote Machine Monitoring Software 2026: Monitor Your Factory Floor From Anywhere

· 10 min read
MachineCDN Team
Industrial IoT Experts

The maintenance manager who physically walks the plant floor checking gauges is going extinct. Not because the role is less important — but because manufacturers have figured out that real-time remote access to machine data catches problems faster, costs less, and scales better than clipboard rounds ever could.

Remote machine monitoring software connects your factory equipment to the cloud, giving you live dashboards, instant alerts, and full production analytics on your laptop, tablet, or phone — whether you're on the plant floor, in a corner office, or at home on a Sunday night when the alarm goes off.

Here's what the best remote monitoring platforms deliver in 2026, and how to evaluate them for your operation.

The ROI of Real-Time Machine Data: Why a 5-Minute Data Lag Costs You Thousands

· 11 min read
MachineCDN Team
Industrial IoT Experts

There's a question that every manufacturing executive should be asking but few actually do: How much does data latency cost us?

Not "do we have data?" — most factories have some form of production reporting. The question is about when that data reaches the people who can act on it. The difference between a 2-second alert and a 5-minute update isn't a technical detail. It's the difference between catching a $200 problem and cleaning up a $20,000 disaster.

This article quantifies that difference. We'll look at what happens at each level of data latency — real-time, near-real-time, delayed, and batch — and show exactly where the money goes when your data lags behind your machines.

Top 8 Samsara Alternatives for Industrial IoT and Manufacturing in 2026

· 9 min read
MachineCDN Team
Industrial IoT Experts

Samsara has built a massive business around fleet management, GPS tracking, and operational visibility. But if you're a manufacturing engineer or plant manager evaluating Samsara for factory floor monitoring, you've probably noticed something: Samsara was built for trucks and field operations, not for PLCs and production lines.

Serial Number Tracking in Manufacturing: How IIoT Enables Complete Product Traceability

· 9 min read
MachineCDN Team
Industrial IoT Experts

When a customer calls with a warranty claim, a regulator demands a recall, or a quality issue surfaces in the field, the first question is always the same: "Which production run made this unit?"

If your serial number tracking is a spreadsheet updated by hand at the end of each shift, you're guessing. If your MES captures serial numbers but not the machine conditions during production, you have traceability in name but not in substance. True product traceability means connecting every serial number to the complete story of how that unit was manufactured — which machine, which operating conditions, which materials, which operator, which shift.

IIoT platforms that capture real-time production data from PLCs are uniquely positioned to deliver this level of traceability, because the PLC already knows when each unit was produced and under what conditions. The question is whether your serialization system is connected to that data.

PTC ThingWorx Pricing in 2026: What Does ThingWorx Actually Cost?

· 7 min read
MachineCDN Team
Industrial IoT Experts

PTC ThingWorx is one of the most recognized names in industrial IoT, consistently appearing in Gartner and Forrester reports. But behind that brand recognition sits one of the most complex — and expensive — pricing structures in the IIoT market. If you're evaluating ThingWorx for your manufacturing operations, this guide will help you understand what you're actually signing up for.

Time Synchronization in Industrial IoT: Why Milliseconds Matter on the Factory Floor [2026]

· 10 min read

Time synchronization across industrial IoT devices

When a batch blender reports a weight deviation at 14:32:07.341 and the downstream alarm system logs a fault at 14:32:07.892, the 551-millisecond gap tells an engineer something meaningful — the weight spike preceded the alarm by half a second, pointing to a feed hopper issue rather than a sensor failure.

But if those timestamps came from devices with unsynchronized clocks, the entire root cause analysis falls apart. The weight deviation might have actually occurred after the alarm. Every causal inference becomes unreliable.

Time synchronization isn't a nice-to-have in industrial IoT — it's the foundation that makes every other data point trustworthy.

The Time Problem in Manufacturing

A typical factory floor has dozens of time sources that disagree with each other:

  • PLCs running internal clocks that drift 1–5 seconds per day
  • Edge gateways syncing to NTP servers over cellular connections with variable latency
  • SCADA historians timestamping on receipt rather than at the source
  • Cloud platforms operating in UTC while operators think in local time
  • Batch systems logging in the timezone of the plant that configured them

The result: a single production event might carry three different timestamps depending on which system you query. Multiply that across 50 machines in 4 plants across 3 time zones, and your "single source of truth" becomes a contradictory mess.

Why Traditional IT Time Sync Falls Short

In enterprise IT, NTP (Network Time Protocol) synchronizes servers to within a few milliseconds and everyone moves on. Factory floors are different:

  1. Air-gapped networks: Many OT networks have no direct internet access for NTP
  2. Deterministic requirements: Process control needs microsecond precision that standard NTP can't guarantee
  3. Legacy devices: PLCs from 2005 might not support NTP at all
  4. Timezone complexity: A single machine might have components configured in UTC, local time, and "plant time" (an arbitrary reference the original integrator chose)
  5. Daylight saving transitions: A one-hour clock jump during a 24-hour production run creates data gaps or overlaps

Protocol Options: NTP vs. PTP vs. GPS

NTP (Network Time Protocol)

Accuracy: 1–50ms over LAN, 10–100ms over WAN

NTP is the workhorse for most IIoT deployments. It's universally supported, works over standard IP networks, and provides millisecond-level accuracy that's sufficient for 90% of manufacturing use cases.

Best practice for edge gateways:

# /etc/ntp.conf for an edge gateway
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst

# Local fallback — GPS or local stratum-1
server 192.168.1.1 prefer

# Drift file to compensate for oscillator aging
driftfile /var/lib/ntp/ntp.drift

# Restrict to prevent the gateway from being used as a time source
restrict default nomodify notrap nopeer noquery

The iburst flag is critical for edge gateways that might lose connectivity. When the NTP client reconnects, iburst sends 8 rapid packets instead of waiting for the normal 64-second polling interval, reducing convergence time from minutes to seconds.

Key limitation: NTP assumes symmetric network delay. On cellular connections where upload latency (50–200ms) differs from download latency (30–80ms), NTP's accuracy degrades to ±50ms or worse.

PTP (Precision Time Protocol / IEEE 1588)

Accuracy: sub-microsecond with hardware timestamping

PTP is the gold standard for applications where sub-millisecond accuracy matters — motion control, coordinated robotics, or synchronized sampling across multiple sensors.

However, PTP requires:

  • Network switches that support PTP (transparent or boundary clock mode)
  • Hardware timestamping NICs on endpoints
  • Careful network design to minimize asymmetric paths

For most discrete manufacturing (batch blending, extrusion, drying), PTP is overkill. The extra infrastructure cost rarely justifies the precision gain over well-configured NTP.

GPS-Disciplined Clocks

Accuracy: 50–100 nanoseconds

A GPS receiver with a clear sky view provides the most accurate time reference independent of network infrastructure. Some edge gateways include GPS modules that serve dual purposes — location tracking for mobile assets and time synchronization for the local network.

Practical deployment:

# chronyd configuration with GPS PPS
refclock PPS /dev/pps0 lock NMEA refid GPS
refclock SHM 0 poll 3 refid NMEA noselect

GPS-disciplined clocks work exceptionally well as local stratum-1 NTP servers, providing sub-microsecond accuracy to every device on the plant network without depending on internet connectivity.

Timestamp Handling at the Edge

The edge gateway sits between PLCs that think in register values and cloud platforms that expect ISO 8601 timestamps. Getting this translation right is where most deployments stumble.

Strategy 1: Gateway-Stamped Timestamps

The simplest approach — the edge gateway applies its own timestamp when it reads data from the PLC.

Pros:

  • Consistent time source across all devices
  • Works with any PLC, regardless of clock capabilities
  • Single NTP configuration to maintain

Cons:

  • Introduces polling latency as timestamp error (if you poll every 5 seconds, your timestamp could be up to 5 seconds late)
  • Loses sub-poll precision for fast-changing values
  • Multiple devices behind one gateway share the gateway's clock accuracy

When to use: Slow-moving process variables (temperatures, pressures, levels) where 1–5 second accuracy is sufficient.

Strategy 2: PLC-Sourced Timestamps

Some PLCs (Siemens S7-1500, Allen-Bradley CompactLogix) can include timestamps in their responses. The gateway reads both the value and the PLC's timestamp.

Pros:

  • Microsecond precision at the source
  • No polling latency error
  • Accurate even with irregular polling intervals

Cons:

  • Requires PLC clock synchronization (the PLC's internal clock must be accurate)
  • Not all PLCs support timestamped reads
  • Different PLC brands use different epoch formats (some use 1970, others 1984, others 2000)

When to use: High-speed processes (injection molding cycles, press operations) where sub-second event correlation matters.

Strategy 3: Hybrid Approach

The most robust strategy combines both:

  1. Gateway records its own timestamp at read time
  2. If the PLC provides a source timestamp, both are stored
  3. The cloud platform calculates and monitors the delta between gateway and PLC clocks
  4. If delta exceeds a threshold (e.g., 500ms), an alert fires for clock drift investigation
{
"device_id": "SN-4821",
"tag": "hopper_weight",
"value": 247.3,
"gateway_ts": 1709312547341,
"source_ts": 1709312547298,
"delta_ms": 43
}

This hybrid approach lets you detect clock drift before it corrupts your analytics — and provides both timestamps for forensic analysis.

Timezone Management Across Multi-Site Deployments

Time synchronization is about getting clocks accurate. Timezone management is about interpreting those accurate clocks correctly. They're separate problems that compound when combined poorly.

The UTC-Everywhere Approach

Store everything in UTC. Convert on display.

This is the correct strategy, but implementing it correctly requires discipline:

  1. Edge gateways transmit Unix timestamps (seconds or milliseconds since epoch) — inherently UTC
  2. Databases store timestamps as UTC integers or timestamptz columns
  3. APIs return UTC with explicit timezone indicators
  4. Dashboards convert to the user's configured timezone on render

The failure mode: someone hard-codes a timezone offset in the edge gateway configuration. When daylight saving time changes, every historical query returns data shifted by one hour for half the year.

Per-Device Timezone Assignment

In multi-plant deployments, each device needs a timezone association — not for data storage (which remains UTC), but for:

  • Shift calculations: "First shift" means 6:00 AM in the plant's local time
  • OEE windows: Planned production time is defined in local time
  • Downtime classification: Non-production hours (nights, weekends) depend on the plant's calendar
  • Report generation: Daily summaries should align with the plant's operating day, not UTC midnight

The timezone should be associated with the location, not the device. When a device is moved between plants, it inherits the new location's timezone automatically.

Handling Daylight Saving Transitions

The spring-forward transition creates a one-hour gap. The fall-back transition creates a one-hour overlap. Both wreak havoc on:

  • OEE availability calculations: A 23-hour day in spring inflates availability; a 25-hour day in fall deflates it
  • Production counters: Shift-based counting might miss or double-count an hour
  • Alarm timestamps: An alarm at 2:30 AM during fall-back is ambiguous — which 2:30 AM?

Mitigation:

# Always use timezone-aware datetime libraries
from zoneinfo import ZoneInfo

plant_tz = ZoneInfo("America/Chicago")
utc_ts = datetime(2026, 3, 8, 8, 0, 0, tzinfo=ZoneInfo("UTC"))
local_time = utc_ts.astimezone(plant_tz)

# For OEE calculations, use calendar day boundaries in local time
day_start = datetime(2026, 3, 8, 0, 0, 0, tzinfo=plant_tz)
day_end = datetime(2026, 3, 9, 0, 0, 0, tzinfo=plant_tz)
# This correctly handles 23-hour or 25-hour days
planned_hours = (day_end - day_start).total_seconds() / 3600

Clock Drift Detection and Compensation

Even with NTP, clocks drift. Industrial environments make it worse — temperature extremes, vibration, and aging oscillators all degrade crystal accuracy.

Monitoring Drift Systematically

Every edge gateway should report its NTP offset as telemetry alongside process data:

MetricAcceptable RangeWarningCritical
NTP offset±10ms±100ms±500ms
NTP jitter<5ms<50ms<200ms
NTP stratum2–34–56+
Last sync<300s ago<3600s ago>3600s ago

When an edge gateway goes offline (cellular outage, power cycle), its clock immediately starts drifting. A typical crystal oscillator drifts 20–100 ppm, which translates to:

  • 1 minute offline: ±6ms drift (negligible)
  • 1 hour offline: ±360ms drift (starting to matter)
  • 1 day offline: ±8.6 seconds drift (data alignment problems)
  • 1 week offline: ±60 seconds drift (shift calculations break)

Compensating for Known Drift

If a gateway was offline for a known period and its drift rate is characterized:

corrected_ts = raw_ts - (drift_rate_ppm × elapsed_seconds × 1e-6)

Some industrial time-series databases support retroactive timestamp correction — ingesting data with provisional timestamps and correcting them when the clock re-synchronizes. This is far better than discarding data from offline periods.

Practical Implementation Checklist

For any new IIoT deployment, this checklist prevents the most common time-related failures:

  1. Configure NTP on every edge gateway with at least 2 upstream servers and a local fallback
  2. Set drift file paths so NTP can learn the oscillator's characteristics over time
  3. Store all timestamps as UTC — no exceptions, no "plant time" columns
  4. Associate timezones with locations, not devices
  5. Log NTP status (offset, jitter, stratum) as system telemetry
  6. Alert on drift exceeding application-specific thresholds
  7. Test DST transitions before they happen — simulate spring-forward and fall-back in staging
  8. Document epoch formats for every PLC model in the fleet (1970 vs. 2000 vs. relative)
  9. Use monotonic clocks for duration calculations (uptime, cycle time) — wall clocks are for event ordering
  10. Plan for offline operation — characterize drift rates and implement correction on reconnect

How machineCDN Handles Time at Scale

machineCDN's platform processes telemetry from edge gateways deployed across multiple plants and timezones. Every data point carries a UTC timestamp applied at the gateway level, and timezone interpretation happens at the application layer based on each device's location assignment.

This means OEE calculations, shift-based analytics, planned production schedules, and alarm histories are all timezone-aware without any timezone information embedded in the raw data stream. When a machine is reassigned to a different plant, its historical data remains correct in UTC — only the display context changes.

The result: engineers in Houston, São Paulo, and Munich can all look at the same machine's data and see it rendered in their local context, while the underlying data remains a single, unambiguous source of truth.


Time synchronization is the invisible infrastructure that makes everything else in IIoT reliable. Get it wrong, and you're building analytics on a foundation of sand. Get it right, and every alarm, every OEE calculation, and every root cause analysis becomes trustworthy.