Skip to main content

32 posts tagged with "industrial-protocols"

View All Tags

Modbus RTU Serial Link Tuning: Baud Rate, Parity, and Timeout Optimization for Reliable PLC Communication [2026]

· 11 min read

Modbus RTU Serial Communication

Modbus TCP gets all the attention in modern IIoT deployments, but Modbus RTU over RS-485 remains the workhorse of industrial communication. Millions of devices — temperature controllers, VFDs, power meters, PLCs, and process instruments — speak Modbus RTU natively. When you're building an edge gateway that bridges these devices to the cloud, getting the serial link parameters right is the difference between rock-solid telemetry and a frustrating stream of timeouts and CRC errors.

This guide covers the six critical parameters that govern Modbus RTU serial communication, the real-world trade-offs behind each one, and the tuning strategies that separate production-grade deployments from lab prototypes.

The Six Parameters That Control Everything

Every Modbus RTU serial connection is defined by six parameters that must match between the master (your gateway) and every slave device on the bus:

1. Baud Rate

The baud rate determines how many bits per second travel across the RS-485 bus. Common values:

Baud RateBits/secTypical Use Case
96009,600Legacy devices, long cable runs (>500m)
1920019,200Standard industrial default
3840038,400Mid-range, common in newer PLCs
5760057,600Higher-speed applications
115200115,200Short runs, high-frequency polling

The real-world constraint: Every device on the RS-485 bus must use the same baud rate. If you have a mix of legacy power meters at 9600 baud and newer VFDs at 38400, you need separate RS-485 segments — each with its own serial port on the gateway.

Practical recommendation: Start at 19200 baud. It's universally supported, tolerant of cable lengths up to 1000m, and fast enough for most 1-second polling cycles. Only go higher if your polling budget demands it and your cable runs are short.

2. Parity Bit

Parity provides basic error detection at the frame level:

  • Even parity (E): Most common in industrial settings. The parity bit is set so the total number of 1-bits (data + parity) is even.
  • Odd parity (O): Less common, but some older devices default to it.
  • No parity (N): Removes the parity bit entirely. When using no parity, you typically add a second stop bit to maintain frame timing.

The Modbus RTU specification recommends even parity with one stop bit as the default (8E1). However, many devices ship configured for 8N1 (no parity, one stop bit) or 8N2 (no parity, two stop bits).

Why it matters: A parity mismatch doesn't generate an error message — the slave device simply ignores the malformed frame. You'll see ETIMEDOUT errors on the master side, and the failure mode looks identical to a wiring problem or wrong slave address. This is the single most common misconfiguration in Modbus RTU deployments.

3. Data Bits

Almost universally 8 bits in modern Modbus RTU. Some ancient devices use 7-bit ASCII mode, but if you encounter one in 2026, it's time for a hardware upgrade. Don't waste time debugging 7-bit configurations — the Modbus RTU specification mandates 8 data bits.

4. Stop Bits

Stop bits mark the end of each byte frame:

  • 1 stop bit: Standard when using parity (8E1 or 8O1)
  • 2 stop bits: Standard when NOT using parity (8N2)

The total frame length should be 11 bits: 1 start + 8 data + 1 parity + 1 stop, OR 1 start + 8 data + 0 parity + 2 stop. This 11-bit frame length matters because the Modbus RTU inter-frame gap (the "silent interval" between messages) is defined as 3.5 character times — and a "character time" is 11 bits at the configured baud rate.

5. Byte Timeout

This is where things get interesting — and where most tuning guides fall short.

The byte timeout (also called "inter-character timeout" or "character timeout") defines how long the master waits between individual bytes within a single response frame. If the gap between any two consecutive bytes exceeds this timeout, the master treats it as a frame boundary.

The Modbus RTU specification says: The inter-character gap must not exceed 1.5 character times. At 9600 baud with 11-bit frames, one character time is 11/9600 = 1.146ms, so the maximum inter-character gap is 1.5 × 1.146ms ≈ 1.72ms.

What actually happens in practice:

At 9600 baud:  1 char = 1.146ms → byte timeout ≈ 2ms (safe margin)
At 19200 baud: 1 char = 0.573ms → byte timeout ≈ 1ms
At 38400 baud: 1 char = 0.286ms → byte timeout ≈ 500μs
At 115200 baud: 1 char = 0.095ms → byte timeout ≈ 200μs

The catch: At baud rates above 19200, the byte timeout drops below 1ms. Most operating systems can't guarantee sub-millisecond timer resolution, especially on embedded Linux or OpenWrt. This is why many Modbus RTU implementations clamp the byte timeout to a minimum of 500μs regardless of baud rate.

Practical recommendation: Set byte timeout to max(500μs, 2 × character_time). On embedded gateways running Linux, use 1000μs (1ms) as a safe floor.

6. Response Timeout

The response timeout defines how long the master waits after sending a request before declaring the slave unresponsive. This is the most impactful tuning parameter for overall system performance.

Factors that affect response time:

  1. Request transmission time: At 19200 baud, an 8-byte request takes ~4.6ms
  2. Slave processing time: Varies from 1ms (simple register read) to 50ms+ (complex calculations or EEPROM access)
  3. Response transmission time: At 19200 baud, a 40-register response (~85 bytes) takes ~48.7ms
  4. RS-485 transceiver turnaround: 10-50μs

The math for worst case:

Total = request_tx + slave_processing + turnaround + response_tx
= 4.6ms + 50ms + 0.05ms + 48.7ms
≈ 103ms

Practical recommendation: Set response timeout to 200-500ms for most applications. Some slow devices (energy meters doing power quality calculations) may need 1000ms. If you're polling temperature sensors that update once per second, there's no penalty to a generous 500ms timeout.

The retry question: When a timeout occurs, should you retry immediately? In production edge gateways, the answer is nuanced:

  • Retry 2-3 times before marking the device as offline
  • Insert a 50ms pause between retries to let the bus settle
  • Flush the serial buffer between retries to clear any partial frames
  • Track consecutive timeouts — if a device fails 3 reads in a row, something has changed (wiring, device failure, address conflict)

RS-485 Bus Topology: The Physical Foundation

No amount of parameter tuning can compensate for bad RS-485 wiring. The critical rules:

Daisy-chain only. RS-485 is a multi-drop bus, NOT a star topology. Every device must be wired in series — A to A, B to B — from the first device to the last.

Gateway ---[A/B]--- Device1 ---[A/B]--- Device2 ---[A/B]--- Device3
|
120Ω terminator

Termination resistors: Place a 120Ω resistor across A and B at both ends of the bus — at the gateway and at the last device. Without termination, reflections on long cable runs cause bit errors that look like random CRC failures.

Cable length limits:

Baud RateMax Cable Length
96001200m (3900 ft)
192001000m (3200 ft)
38400700m (2300 ft)
115200300m (1000 ft)

Maximum devices per segment: The RS-485 spec allows 32 unit loads per segment. Modern 1/4 unit-load transceivers can push this to 128, but in practice, keep it under 32 devices to maintain signal integrity.

Slave Address Configuration

Every device on a Modbus RTU bus needs a unique slave address (1-247). Address 0 is broadcast-only (no response expected), and addresses 248-255 are reserved.

The slave address is configured at the device level — typically through DIP switches, front-panel menus, or configuration software. The gateway needs to match.

Common mistake: Address 1 is the factory default for almost every Modbus device. If you connect two new devices without changing addresses, neither will respond reliably — they'll both try to drive the bus simultaneously, corrupting each other's responses.

Contiguous Register Grouping: The Bandwidth Multiplier

The most impactful optimization for Modbus RTU polling performance isn't a link parameter — it's how you structure your read requests.

Modbus function codes 03 (Read Holding Registers) and 04 (Read Input Registers) can read up to 125 contiguous registers in a single request. Instead of issuing separate requests for each tag, a well-designed gateway groups tags by:

  1. Function code — holding registers and input registers require different commands
  2. Address contiguity — registers must be adjacent (no gaps)
  3. Polling interval — tags polled at the same rate should be read together
  4. Maximum PDU size — cap at 50-100 registers per request to avoid overwhelming slow devices

Example: If you need registers 40001, 40002, 40003, 40010, 40011:

  • Naive approach: 5 separate read requests (5 × 12ms round-trip = 60ms)
  • Grouped approach: Read 40001-40003 in one request, 40010-40011 in another (2 × 12ms = 24ms)

The gap between 40003 and 40010 breaks contiguity, so two requests are optimal. But reading 40001-40011 as one block (11 registers) might be acceptable — the extra 7 "wasted" registers add only ~7ms of transmission time, which is less than the overhead of an additional request-response cycle.

Production-grade gateways build contiguous read groups at startup and maintain them throughout the polling cycle, only splitting groups when address gaps exceed a configurable threshold.

Error Handling: Beyond the CRC

Modbus RTU includes a 16-bit CRC at the end of every frame. If the CRC doesn't match, the frame is discarded. But there are several failure modes beyond CRC errors:

ETIMEDOUT — No response from slave:

  • Wrong slave address
  • Baud rate or parity mismatch
  • Wiring fault (A/B swapped, broken connection)
  • Device powered off or in bootloader mode
  • Bus contention (two devices with same address)

ECONNRESET — Connection reset:

  • On RS-485, this usually means the serial port driver detected a framing error
  • Often caused by electrical noise or ground loops
  • Add ferrite cores to cables near VFDs and motor drives

EBADF — Bad file descriptor:

  • The serial port was disconnected (USB-to-RS485 adapter unplugged)
  • Requires closing and reopening the serial connection

EPIPE — Broken pipe:

  • Rare on physical serial ports, more common on virtual serial ports or serial-over-TCP bridges
  • Indicates the underlying transport has failed

For each of these errors, a robust gateway should:

  1. Close the Modbus connection and flush all buffers
  2. Wait a brief cooldown (100-500ms) before attempting reconnection
  3. Update link state to notify the cloud that the device is offline
  4. Log the error type for remote diagnostics

Tuning for Specific Device Types

Different industrial devices have different Modbus RTU characteristics:

Temperature Controllers (TCUs)

  • Typically use function code 03 (holding registers)
  • Response times: 5-15ms
  • Recommended timeout: 200ms
  • Polling interval: 5-60 seconds (temperatures change slowly)

Variable Frequency Drives (VFDs)

  • Often have non-contiguous register maps
  • Response times: 10-30ms
  • Recommended timeout: 300ms
  • Polling interval: 1-5 seconds for speed/current, 60s for configuration

Power/Energy Meters

  • Large register blocks (power quality data)
  • Response times: 20-100ms (some meters buffer internally)
  • Recommended timeout: 500-1000ms
  • Polling interval: 1-15 seconds depending on resolution needed

Central Chillers (Multi-Circuit)

  • Dozens of input registers per compressor circuit
  • Deep register maps spanning 700+ addresses
  • Naturally contiguous register layouts within each circuit
  • Alarm bit registers should be polled at 1-second intervals with change detection
  • Process values (temperatures, pressures) can use 60-second intervals

Putting It All Together

A production Modbus RTU configuration for a typical industrial gateway looks like this:

# Example configuration (generic format)
serial_port: /dev/ttyUSB0
baud_rate: 19200
parity: even # 'E' — matches Modbus spec default
data_bits: 8
stop_bits: 1
slave_address: 1
byte_timeout_ms: 1
response_timeout_ms: 300

# Polling strategy
max_registers_per_read: 50
retry_count: 3
retry_delay_ms: 50
flush_between_retries: true

How machineCDN Handles Modbus RTU

machineCDN's edge gateway handles both Modbus TCP and Modbus RTU through a unified data acquisition layer. The gateway auto-configures serial link parameters from the device profile, groups contiguous registers for optimal bus utilization, implements intelligent retry logic with bus-level error recovery, and bridges Modbus RTU data to cloud-bound MQTT with store-and-forward buffering. Whether your devices speak Modbus RTU at 9600 baud or EtherNet/IP over Gigabit Ethernet, the data arrives in the same normalized format — ready for analytics, alerting, and operational dashboards.


Need to connect legacy Modbus RTU devices to a modern IIoT platform? machineCDN bridges serial protocols to the cloud without replacing your existing equipment. Talk to us about your multi-protocol deployment.

MQTT Topic Architecture for Multi-Site Manufacturing: Designing Scalable Namespaces That Don't Collapse at 10,000 Devices [2026]

· 14 min read
MachineCDN Team
Industrial IoT Experts

Every MQTT tutorial starts the same way: sensor/temperature. Clean, simple, obvious. Then you ship to production and discover that topic architecture is to MQTT what database schema is to SQL — get it wrong early and you'll spend the next two years paying for it.

Manufacturing environments are particularly brutal to bad topic design. A single plant might have 200 machines, each with 30–100 tags, across 8 production lines, reporting to 4 different consuming systems (historian, SCADA, analytics, alerting). Multiply by 5 plants across 3 countries, and your MQTT broker is routing messages across a topic tree with 50,000+ leaf nodes. The topic hierarchy you chose in month one determines whether this scales gracefully or becomes an operational nightmare.

Sparkplug B Specification Deep Dive: Birth Certificates, Death Certificates, and Why Your IIoT MQTT Deployment Needs It [2026]

· 14 min read
MachineCDN Team
Industrial IoT Experts

MQTT is the de facto transport layer for industrial IoT. Every edge gateway, every cloud platform, and every IIoT architecture diagram draws that same line: device → MQTT broker → cloud. But here's the uncomfortable truth that anyone who's deployed MQTT in a real factory knows: raw MQTT tells you nothing about the data inside those payloads.

MQTT is a transport protocol. It delivers bytes. It doesn't define what a "temperature reading" looks like, how to discover which devices are online, or what happens when a device reboots at 3 AM. That's where Sparkplug B comes in — and understanding it deeply is the difference between a demo and a production deployment.

Binary Payload Encoding for Industrial MQTT: Cutting Bandwidth by 10x on Constrained Networks [2026]

· 13 min read

Binary Payload Encoding

JSON is killing your cellular data budget.

When your edge gateway publishes a single temperature reading as {"tag_id": 42, "value": 23.45, "type": "float", "status": 0, "ts": 1709312400}, that's 72 bytes of text to convey 10 bytes of actual information: a 2-byte tag ID, a 4-byte float, a 1-byte status code, and a 4-byte timestamp (which is shared across all tags in the same poll cycle anyway).

At 200 tags polled every 5 seconds, JSON payloads consume roughly 100 KB/minute — over 4 GB/month. On a $15/month cellular plan with a 1 GB cap, you've blown your data budget by day 8.

Binary encoding solves this. By designing a compact wire format purpose-built for industrial telemetry, you can reduce per-tag overhead from ~70 bytes to ~7 bytes — a 10x reduction that makes cellular and satellite IIoT deployments economically viable.

This article covers the engineering of binary payload formats for industrial MQTT, from byte-level encoding decisions to the buffering and delivery systems that ensure data integrity.

Why JSON Falls Short for Industrial Telemetry

JSON became the default payload format for MQTT in the IIoT world because it's human-readable, self-describing, and every platform can parse it. These are real advantages during development and debugging. But they come at a cost that compounds brutally at scale.

The Overhead Tax

Let's dissect a typical JSON telemetry message:

{
"device_type": 1010,
"serial": 1106550353,
"ts": 1709312400,
"tags": [
{"id": 1, "status": 0, "type": "uint16", "values": [4200]},
{"id": 2, "status": 0, "type": "float", "values": [23.45]},
{"id": 3, "status": 0, "type": "bool", "values": [1]}
]
}

This payload is approximately 250 bytes. The actual data content:

  • Device type: 2 bytes
  • Serial number: 4 bytes
  • Timestamp: 4 bytes
  • 3 tag values: 2 + 4 + 1 = 7 bytes
  • 3 tag IDs: 6 bytes
  • 3 status codes: 3 bytes

Total useful data: 26 bytes. The other 224 bytes are structural overhead — curly braces, square brackets, quotation marks, colons, commas, key names, and redundant type strings.

That's an overhead ratio of 9.6x. For every byte of machine data, you're transmitting nearly 10 bytes of JSON syntax.

CPU Cost on Embedded Gateways

JSON serialization isn't free on embedded hardware. Constructing JSON objects, converting numbers to strings, escaping special characters, and computing string lengths all consume CPU cycles that could be spent polling more tags or running edge analytics.

On an ARM Cortex-A7 gateway (common in industrial routers), JSON serialization of a 200-tag batch takes 2–5ms. The equivalent binary encoding takes 200–500μs — an order of magnitude faster. When you're polling Modbus every second and need to leave CPU headroom for other tasks, this matters.

Designing a Binary Telemetry Format

A practical binary format for industrial MQTT must balance compactness with extensibility. Here's a proven structure used in production industrial gateways.

Message Structure

┌─────────────────────────────────────────┐
│ Header │
│ ├─ Timestamp (4 bytes, uint32) │
│ ├─ Device Type (2 bytes, uint16) │
│ └─ Serial Number (4 bytes, uint32) │
├─────────────────────────────────────────┤
│ Tag Group │
│ ├─ Tag Count (2 bytes, uint16) │
│ ├─ Tag Record 1 │
│ │ ├─ Tag ID (2 bytes, uint16) │
│ │ ├─ Status (1 byte, uint8) │
│ │ ├─ Type (1 byte, uint8) │
│ │ ├─ Value Count (1 byte, uint8) │
│ │ └─ Values (variable) │
│ ├─ Tag Record 2 │
│ │ └─ ... │
│ └─ Tag Record N │
└─────────────────────────────────────────┘

Type Encoding

Use a single byte to encode the value type, which also determines the byte width of each value:

Type CodeTypeBytes per Value
0x01bool1
0x02int324
0x03uint324
0x04float324
0x05int162
0x06uint162
0x07int81
0x08uint81

This type system covers every data type you'll encounter in Modbus and EtherNet/IP PLCs. The decoder uses the type code to determine exactly how many bytes to read for each value — no parsing ambiguity, no delimiter scanning.

Size Comparison

For the same 3-tag example above:

Binary encoding:

  • Header: 10 bytes (timestamp + device type + serial)
  • Tag count: 2 bytes
  • Tag 1 (uint16): 2 + 1 + 1 + 1 + 2 = 7 bytes
  • Tag 2 (float32): 2 + 1 + 1 + 1 + 4 = 9 bytes
  • Tag 3 (bool): 2 + 1 + 1 + 1 + 1 = 6 bytes

Total: 34 bytes vs. 250 bytes for JSON. That's a 7.3x reduction.

The savings compound as tag count increases. At 100 tags (a typical mid-size PLC), a JSON batch runs 6–8 KB; the binary equivalent is 700–900 bytes. At 200 tags, JSON hits 12–16 KB while binary stays under 2 KB.

Data Grouping: Batches and Groups

Individual tag values shouldn't be published as individual MQTT messages. The MQTT protocol itself adds overhead: a PUBLISH packet includes a fixed header (2 bytes minimum), topic string (20–50 bytes for a typical industrial topic), and packet identifier (2 bytes for QoS 1). Publishing 200 individual messages means 200× this overhead.

Timestamp-Grouped Batches

The most effective grouping strategy collects all tag values from a single poll cycle into one batch, sharing a single timestamp:

[Batch Start: timestamp=1709312400]
Tag 1: id=1, status=0, type=uint16, value=4200
Tag 2: id=2, status=0, type=float, value=23.45
Tag 3: id=3, status=0, type=bool, value=1
...
[Batch End]

The timestamp in the batch header applies to all contained tags. This eliminates per-tag timestamp overhead — a savings of 4 bytes per tag, or 800 bytes across 200 tags.

Batch Size Limits

MQTT brokers and clients have maximum message size limits. Azure IoT Hub limits messages to 256 KB. AWS IoT Core allows 128 KB. Most on-premise Mosquitto deployments default to 256 MB but should be configured lower for production use.

More importantly, your edge gateway's memory and processing constraints impose practical limits. A 4 KB batch size works well for most deployments:

  • Large enough to hold 200+ tags in binary format
  • Small enough to fit in constrained gateway memory
  • Fast enough to serialize without impacting the poll loop

When a batch exceeds the configured size, close it and start a new one. The cloud decoder handles multiple batches with the same timestamp gracefully.

Change-of-Value Filtering Before Batching

Apply change-of-value (COV) filtering before adding values to the batch, not after. If a tag's value hasn't changed since the last report and COV is enabled for that tag, skip it entirely. This reduces batch sizes further during steady-state operation — when 80% of tags are unchanged, your binary batch shrinks proportionally.

However, implement a periodic full-refresh: every hour (or configurable interval), reset all COV baselines and include every tag in the next batch. This ensures the cloud always has a complete snapshot, even if individual change events were lost during a brief disconnection.

The Page Buffer: Store-and-Forward in Fixed Memory

Binary encoding solves the bandwidth problem. But you still need to handle MQTT disconnections without losing data. The page-based ring buffer is the industrial standard for store-and-forward in embedded systems.

Architecture

Pre-allocate a contiguous memory region at startup and divide it into fixed-size pages:

┌────────────────────────────────────────────────┐
│ Buffer Memory (e.g., 512 KB) │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │Page 0│ │Page 1│ │Page 2│ │Page 3│ │Page 4│ │
│ │ │ │ │ │ │ │ │ │ │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ │
└────────────────────────────────────────────────┘

Pages cycle through three states:

  1. Free — empty, available for writing
  2. Work — currently being written to by the Modbus polling thread
  3. Used — full, waiting for MQTT delivery

Page Layout

Each page contains multiple messages, packed sequentially:

┌─────────────────────────────────────┐
│ Page Header (struct, ~16 bytes) │
├─────────────────────────────────────┤
│ Message 1: │
│ ├─ Message ID (4 bytes) │
│ ├─ Message Size (4 bytes) │
│ └─ Message Body (variable) │
├─────────────────────────────────────┤
│ Message 2: │
│ ├─ Message ID (4 bytes) │
│ ├─ Message Size (4 bytes) │
│ └─ Message Body (variable) │
├─────────────────────────────────────┤
│ ... (more messages) │
├─────────────────────────────────────┤
│ Free space │
└─────────────────────────────────────┘

The 4-byte message ID field is filled by the MQTT library when the message is published (at QoS 1). The gateway uses this ID to match publish acknowledgments to specific messages.

Write Path

  1. Check if the current work page has enough space for the new message (message size + 8 bytes for ID and size fields).
  2. If yes: write the message, advance the write pointer.
  3. If no: move the work page to the "used" queue, grab a free page as the new work page, and write there.
  4. If no free pages exist: grab the oldest used page (overflow condition). Log a warning — you're losing the oldest buffered data, but preserving the newest.

This overflow strategy is deliberately biased toward fresh data. In industrial monitoring, a temperature reading from 5 minutes ago is far more valuable than one from 3 days ago that was buffered during an outage.

Delivery Path

  1. Take the first page from the "used" queue.
  2. Read the next undelivered message (tracked by a per-page read pointer).
  3. Publish via MQTT at QoS 1.
  4. Wait for PUBACK — don't advance the read pointer until the broker confirms receipt.
  5. On PUBACK: advance the read pointer. If the page is fully delivered, move it back to "free."
  6. On disconnect: stop sending, keep writing. The buffer absorbs the outage.

The wait-for-PUBACK step is critical. Without it, you're fire-and-forgetting into a potentially disconnected socket, and data silently disappears.

Thread Safety

The write path (Modbus polling thread) and delivery path (MQTT thread) operate concurrently on the same buffer. A mutex protects all page state transitions:

  • Moving pages between free/work/used queues
  • Checking available space
  • Advancing read/write pointers
  • Processing delivery acknowledgments

Keep the critical section as small as possible — lock, update pointers, unlock. Never hold the mutex during a Modbus read or MQTT publish; those operations can block for seconds.

Delivery Tracking and Watchdogs

In production, "the MQTT connection is up" doesn't mean data is flowing. The connection can be technically alive (TCP socket open, keepalives passing) while messages silently fail to publish or acknowledge.

Delivery Timestamp Tracking

Track the timestamp of the last successfully delivered message (confirmed by PUBACK). If this timestamp falls more than N minutes behind the current time, something is wrong:

  • The broker may be rejecting messages (payload too large, topic permission denied)
  • The network may be passing keepalives but dropping data packets
  • The MQTT library may be stuck in an internal error state

When the delivery watchdog fires, tear down the entire MQTT connection and reinitialize. It's a heavy-handed recovery, but it's reliable. In industrial systems, a clean restart beats a subtle degradation every time.

Status Telemetry

The gateway should periodically publish its own status message containing:

  • Daemon uptime — how long since last restart
  • System uptime — how long since last boot
  • Buffer state — pages free/used/work, current fill level
  • PLC link state — is the Modbus connection healthy
  • Firmware version — for remote fleet management
  • Token expiration — time remaining on the MQTT auth token

This status message can use JSON even if data messages use binary — it's infrequent (every 30–60 seconds) and readability matters more than compactness for diagnostics.

Bandwidth Math: Real-World Numbers

Let's calculate the actual savings for a typical deployment:

Scenario: 150 tags, polled every 5 seconds, 50% change rate with COV enabled, cellular connection.

JSON Format

  • Average tag JSON: ~60 bytes
  • Tags per poll (with 50% COV): 75
  • Batch overhead: ~50 bytes
  • Total per poll: 75 × 60 + 50 = 4,550 bytes
  • Per minute (12 polls): 54.6 KB
  • Per day: 78.6 MB
  • Per month: 2.36 GB

Binary Format

  • Average tag binary: ~7 bytes
  • Header per batch: 12 bytes
  • Total per poll: 75 × 7 + 12 = 537 bytes
  • Per minute (12 polls): 6.4 KB
  • Per day: 9.3 MB
  • Per month: 279 MB

Savings: 88% reduction — from 2.36 GB to 279 MB. On a $20/month cellular plan with 500 MB included, JSON doesn't fit. Binary does, with headroom.

Add MQTT overhead (topic strings, packet headers) and TLS overhead (~40 bytes per record), and real-world savings are slightly less dramatic but still consistently in the 8–10x range.

Decoding on the Cloud Side

Binary encoding shifts complexity from the edge to the cloud. The decoder must:

  1. Parse the header to extract timestamp, device type, and serial number.
  2. Iterate tag records using the type code to determine value byte widths.
  3. Reconstruct typed values — particularly IEEE 754 floats from their 4-byte binary representation.
  4. Handle partial messages — if a batch was truncated due to buffer overflow, the decoder must fail gracefully on the last incomplete record without losing the valid records before it.

Most cloud platforms (Azure IoT Hub, AWS IoT Core) support custom message decoders that transform binary payloads to JSON for downstream processing. Write the decoder once, and the rest of your analytics pipeline sees standard JSON.

How machineCDN Implements Binary Telemetry

machineCDN's edge daemon uses binary encoding by default for all data telemetry. The implementation includes:

  • Compact binary batching with shared timestamps per group, reducing per-tag overhead to 5–9 bytes depending on data type.
  • Page-based ring buffer with pre-allocated memory, zero runtime allocation, and deliberate overflow behavior that preserves fresh data.
  • Per-message PUBACK tracking with delivery watchdog and automatic connection recycling.
  • Parallel JSON status messages for gateway diagnostics, published on a separate topic at lower frequency.
  • Automatic format negotiation — the cloud ingestion layer detects binary vs. JSON based on the first byte of the payload and routes to the appropriate decoder.

The result: machineCDN gateways routinely operate on 500 MB/month cellular plans, monitoring 200+ tags at 5-second intervals, with full store-and-forward resilience during connectivity outages.

When to Use Binary vs. JSON

Binary encoding isn't always the right choice. Use this decision framework:

CriterionUse BinaryUse JSON
NetworkCellular, satellite, meteredEthernet, WiFi, unmetered
Tag count> 50< 20
Poll interval< 10 seconds> 60 seconds
Gateway CPUConstrained (< 500 MHz)Capable (> 1 GHz)
Debug needsProduction, stableDevelopment, changing
DownstreamCustom decoder availableGeneric tooling needed

For most production industrial deployments — where gateways connect hundreds of tags over cellular and reliability trumps developer convenience — binary encoding is the clear winner. Save JSON for your status messages and the debugging serial port.

Getting Started

If you're designing a binary telemetry format for your own gateway:

  1. Start with the type system. Define your type codes and byte widths. Match them to your PLC's native data types.
  2. Design the header. Include version, device identity, and a shared timestamp. Add a format version byte so you can evolve the format without breaking old decoders.
  3. Build the buffer first. Get store-and-forward working before optimizing the encoding. Data integrity matters more than data compactness.
  4. Write the decoder alongside the encoder. Test with known values. Verify float encoding especially — IEEE 754 byte ordering bugs are silent and devastating.
  5. Measure real bandwidth. Deploy both JSON and binary formats on the same gateway for a week and compare actual data consumption. The numbers will sell the approach to stakeholders who question the added complexity.

Binary encoding is a solved problem in industrial telemetry. The patterns are well-established, the savings are dramatic, and the complexity cost is paid once at design time and amortized across every byte your fleet ever transmits.

EtherNet/IP Implicit vs Explicit Messaging: What Plant Engineers Actually Need to Know [2026]

· 11 min read

EtherNet/IP CIP Protocol Architecture

If you've ever tried to pull real-time data from an Allen-Bradley PLC over EtherNet/IP and found yourself staring at timeouts, missed packets, or inexplicable latency spikes — you've probably run into the implicit vs. explicit messaging divide without realizing it.

EtherNet/IP is one of the most widely deployed industrial Ethernet protocols, yet the nuances of its messaging model trip up even experienced automation engineers. This guide breaks down what actually matters when you're connecting PLCs to edge gateways, SCADA systems, or IIoT platforms like machineCDN.

CIP: The Protocol Inside the Protocol

EtherNet/IP is really just a transport wrapper around the Common Industrial Protocol (CIP). CIP is the application layer that defines how devices discover each other, exchange data, and manage connections. Understanding CIP is understanding EtherNet/IP — everything else is TCP/UDP plumbing.

CIP organizes everything into objects. Every device has a set of objects, each with attributes you can read or write. The key objects you'll encounter:

ObjectClass IDPurpose
Identity0x01Device name, serial number, vendor ID
Message Router0x02Routes CIP requests to the right object
Connection Manager0x06Manages I/O and explicit connections
Assembly0x04Groups data points into input/output assemblies
TCP/IP Interface0xF5Network configuration
Ethernet Link0xF6Link-layer statistics

When your edge gateway reads a tag like capacity_utilization from a Micro800 or CompactLogix PLC, it's ultimately reading an attribute from a CIP object — the protocol just hides this behind a friendlier tag-name interface.

Explicit Messaging: The Request-Response Model

Explicit messaging is CIP's "ask and receive" mode. Your client sends a request over TCP port 44818, the device processes it, and sends a response. It's conceptually identical to an HTTP GET — connected, reliable, and sequential.

How It Actually Works

  1. TCP handshake with the PLC on port 44818
  2. RegisterSession — establishes a CIP session, returns a session handle
  3. SendRRData (Send Request/Reply Data) — wraps your CIP service request
  4. Device processes the request and returns a response in the same TCP connection

For tag reads on Logix-family controllers, the path typically encodes:

  • Protocol type (e.g., ab-eip for Allen-Bradley EtherNet/IP)
  • Gateway IP — the PLC's network address
  • CPU type — Micro800, CompactLogix, ControlLogix, etc.
  • Tag name — the symbolic name of the data point
  • Element size and count — how many bytes per element, how many elements to read

A typical read might look like:

protocol=ab-eip
gateway=192.168.1.50
cpu=compactlogix
name=Temperature_Zone1
elem_size=4
elem_count=1

This tells the stack: "Connect to the CompactLogix at 192.168.1.50, find the tag named Temperature_Zone1, read one 4-byte (32-bit float) element."

Explicit Messaging Characteristics

  • Latency: 2-10ms per request on a quiet network, 20-50ms under load
  • Throughput: Sequential — you can't pipeline requests on a single connection
  • Best for: Configuration reads, diagnostics, infrequent data access
  • Max payload: 504 bytes per CIP service response (can be extended with Large Forward Open)
  • Reliability: TCP-based, guaranteed delivery

The Hidden Cost: Tag Creation Overhead

Here's something that catches people off guard. On Logix controllers, the first time you read a symbolic tag, the controller has to resolve the tag name to an internal address. This resolution can take 5-15ms. Subsequent reads on the same connection are faster because the tag handle is cached.

If your gateway creates and destroys connections frequently (say, on each poll cycle), you're paying this resolution cost every single time. A well-designed gateway keeps connections persistent and caches tag handles across read cycles. This alone can cut your effective read latency by 40-60%.

Implicit Messaging: The Real-Time Streaming Model

Implicit messaging is where EtherNet/IP earns its keep in real-time control. Instead of request-response, data flows continuously via UDP multicast or unicast without the overhead of individual requests.

The Connection Setup

Implicit connections are established through an explicit messaging sequence:

  1. Forward Open request (via TCP) — negotiates the connection parameters
  2. Both sides agree on:
    • RPI (Requested Packet Interval) — how often data is produced, in microseconds
    • Connection path — which assembly objects to bind
    • Transport type — Class 1 (with sequence counting) or Class 3
    • Connection size — max bytes per packet
  3. Once established, data flows via UDP port 2222 at the agreed RPI

RPI: The Most Misunderstood Parameter

The Requested Packet Interval is essentially your sampling rate. Set it too fast and you'll flood the network with redundant data. Set it too slow and you'll miss transient events.

RPI SettingTypical Use CaseNetwork Impact
2msMotion control, servo drives~500 packets/sec per connection
10msFast discrete I/O~100 packets/sec per connection
50msAnalog process values~20 packets/sec per connection
100-500msMonitoring, trendingMinimal
1000ms+Configuration dataNegligible

The golden rule: Your RPI should match your actual process dynamics, not your "just in case" anxiety. A temperature sensor that changes over minutes doesn't need a 10ms RPI — 500ms is plenty.

For IIoT monitoring scenarios, RPIs of 100ms to 1000ms are typically appropriate. You're tracking trends and detecting anomalies, not closing servo loops. Platforms like machineCDN are designed to ingest data at these intervals and apply server-side intelligence — the edge gateway doesn't need millisecond resolution to detect that a motor bearing temperature is trending upward.

Implicit Messaging Characteristics

  • Latency: Deterministic — data arrives every RPI interval (±jitter)
  • Throughput: Concurrent — hundreds of connections can stream simultaneously
  • Best for: Cyclic I/O data, real-time monitoring, control loops
  • Transport: UDP — no retransmission, but sequence numbers detect missed packets
  • Multicast: Multiple consumers can subscribe to the same producer

Scanner/Adapter Architecture

In EtherNet/IP, the device that initiates the implicit connection is the scanner (typically the PLC or an HMI), and the device that responds is the adapter (typically an I/O module, drive, or remote rack).

Why This Matters for Edge Gateways

When you connect an IIoT edge gateway to a PLC, the gateway typically acts as an explicit messaging client — it reaches out and reads tags on demand. It is not acting as a scanner or adapter in the implicit sense.

This is an important architectural distinction:

  • Scanner mode would require the gateway to manage Forward Open connections and consume I/O assemblies — complex, but gives you real-time streaming data
  • Explicit client mode is simpler — poll tags at your desired interval, get responses, publish to the cloud

Most IIoT gateways (including those powering machineCDN deployments) use explicit messaging with intelligent polling. Why? Because:

  1. Simplicity — No need to configure assembly objects on the PLC
  2. Flexibility — You can read any tag by name, not just pre-configured assemblies
  3. Non-intrusion — No modifications to the PLC program required
  4. Sufficient performance — For monitoring (not control), 1-60 second poll intervals are fine

When to Use Implicit Messaging for IIoT

There are cases where implicit messaging makes sense even for monitoring:

  • High tag counts — If you're reading 500+ tags from a single PLC, implicit is more efficient
  • Sub-second requirements — Process alarms that need under 100ms detection
  • Multicast scenarios — Multiple systems need the same data simultaneously
  • Deterministic timing — You need guaranteed delivery intervals for SPC/SQC

Data Types and Byte Ordering

EtherNet/IP inherits CIP's data type system. When reading tags, you need to know the data width:

CIP TypeWidthNotes
BOOL1 byteActually stored as uint8, 0 or 1
INT (SINT)1 byteSigned 8-bit
INT2 bytesSigned 16-bit
DINT4 bytesSigned 32-bit
REAL4 bytesIEEE 754 float
LINT8 bytesSigned 64-bit (ControlLogix only)

Byte order is little-endian for CIP. This trips up engineers coming from Modbus (which is big-endian). If you're bridging between the two protocols, you'll need byte-swap logic at the translation layer.

For array reads, the element size matters for offset calculation. Reading element N of a 32-bit array means the data starts at byte offset N * 4. Getting this wrong produces garbage values that look plausible (they're the right data type, just from the wrong array position), which makes debugging painful.

Connection Timeouts and Keepalive

One of the most common production issues with EtherNet/IP is connection timeout cascades. Here's how they happen:

  1. Network blip causes one packet to be delayed
  2. PLC times out the connection (default: 4x the RPI)
  3. Gateway has to re-register the session and re-read tags
  4. During re-establishment, tag handles are lost — all tag names need re-resolution
  5. While reconnecting, data gaps appear in your historian

Mitigation Strategies

  • Set realistic timeout multipliers. The CIP standard allows up to 255x the RPI as a timeout. For monitoring, use generous timeouts (e.g., 10-30 seconds) rather than tight ones.
  • Implement exponential backoff on reconnection. Hammering a PLC with connection requests during a network event makes things worse.
  • Cache tag handles and attempt to reuse them after reconnection. Some PLCs allow this; others invalidate all handles on session reset.
  • Use a connection watchdog — if no data arrives for N intervals, proactively reconnect rather than waiting for the timeout to expire.
  • Monitor connection statistics at the Ethernet Link object (Class 0xF6) — rising error counters often predict connection failures before they happen.

Practical Performance Benchmarks

Based on real-world deployments across plastics manufacturing, HVAC, and process control:

ScenarioTagsPoll IntervalAvg LatencyCPU Load on PLC
Single gateway, 50 tags501 sec3-5ms/tagUnder 1%
Single gateway, 200 tags2005 sec5-8ms/tag2-3%
Three gateways, 500 tags total50010 sec8-15ms/tag5-8%
One gateway, 50 tags, aggressive50100ms2-4ms/tag3-5%

Key insight: PLC CPU impact scales with request frequency, not tag count. Reading 200 tags in one optimized request every 5 seconds has less impact than reading 10 tags every 100ms.

Tag Grouping Optimization

When reading multiple tags, group them by:

  1. Data type and element count — Same-type tags can sometimes be read more efficiently
  2. Program scope — Tags in the same program/task on the PLC share routing paths
  3. Read interval — Don't poll slow-changing configuration values at the same rate as process variables

A well-optimized gateway might use three polling groups:

  • Fast (1-5 sec): Machine state booleans, alarm bits, running status — values that trigger immediate action
  • Medium (30-60 sec): Process variables — temperatures, pressures, flow rates, RPMs
  • Slow (5-60 min): Configuration and identity — firmware version, serial number, device type

This tiered approach reduces network traffic by 60-80% compared to polling everything at the fastest interval.

Common Pitfalls

1. Forgetting About CPU Type

The CIP service path differs by controller family. A request formatted for CompactLogix won't work on a Micro800, even though both speak EtherNet/IP. Always verify the CPU type during gateway configuration.

2. Array Index Confusion

Some PLCs use zero-based array indexing, others use one-based. If you request MyArray[0] and get an error, try [1]. Better yet, test with known values during commissioning.

3. String Tags

CIP string tags have a length prefix followed by character data. The total allocation might be 82 bytes (2-byte length + 80 characters), but only the first length characters are valid. Reading the raw bytes without parsing the length field gives you garbage padding at the end.

4. Assuming All Controllers Support Symbolic Access

Older SLC 500 and PLC-5 controllers use file-based addressing (e.g., N7:0, F8:3), not symbolic tag names. Your gateway needs to handle both addressing modes.

5. Ignoring Forward Open Limits

Every PLC has a maximum number of concurrent CIP connections (typically 32-128 for CompactLogix, more for ControlLogix). If your gateway, HMI, SCADA, historian, and three other systems all connect simultaneously, you can hit this limit — and the symptom is intermittent connection refusals.

Choosing Your Messaging Strategy

FactorUse ExplicitUse Implicit
Tag countUnder 200 per PLCOver 200 per PLC
Update rate neededOver 500msUnder 500ms
PLC modification allowedNoYes (assembly config)
Multiple consumersNoYes (multicast)
Deterministic timing requiredNoYes
Gateway complexity budgetLowHigh
IIoT monitoring use case✅ Almost alwaysRarely needed

For the vast majority of IIoT monitoring and predictive maintenance scenarios — the use cases machineCDN was built for — explicit messaging with smart polling is the right choice. It's simpler to deploy, doesn't require PLC program changes, and delivers the data fidelity you need for trend analysis and anomaly detection.

What's Next

EtherNet/IP continues to evolve. The Time-Sensitive Networking (TSN) extensions coming in the next revision will blur the line between implicit and explicit messaging by providing deterministic delivery guarantees at the Ethernet layer itself. This will make EtherNet/IP competitive with PROFINET IRT for hard real-time applications — but for monitoring and IIoT, the fundamentals covered here will remain relevant for years to come.


machineCDN connects to EtherNet/IP controllers natively, handling tag resolution, connection management, and data batching so your team can focus on process insights rather than protocol plumbing. Learn more →

EtherNet/IP and CIP: How Industrial Controllers Actually Communicate [2026 Guide]

· 12 min read

If you've spent time on a plant floor wiring up Allen-Bradley PLCs, you've used EtherNet/IP — whether you realized you were speaking CIP or not. But most engineers treat the protocol like a black box: plug in the cable, configure the scanner, pray the I/O updates arrive on time.

This guide breaks open how EtherNet/IP actually works at the protocol level — the CIP object model, the difference between implicit and explicit messaging, how tag-based addressing resolves data paths, and the real-world timing constraints that catch teams off guard during commissioning.

Modbus Polling Optimization: Register Grouping, Retry Logic, and Multi-Device Scheduling [2026 Guide]

· 15 min read

Modbus Register Polling Optimization

Modbus is 46 years old and still the most commonly deployed industrial protocol on the planet. It runs in power plants, water treatment facilities, HVAC systems, plastics factories, and pharmaceutical clean rooms. Its simplicity is its superpower — and its trap.

Because Modbus is conceptually simple (read some registers, write some registers), engineers tend to implement polling in the most straightforward way possible: loop through tags, read each one, repeat. This works fine for 10 tags on one device. It falls apart spectacularly at 200 tags across eight devices on a congested RS-485 bus.

This guide covers the polling optimization techniques that separate hobbyist implementations from production-grade edge gateways — the kind that power platforms like machineCDN across thousands of connected machines.

The Four Function Codes That Matter

Before optimizing anything, you need to understand how Modbus maps register addresses to function codes. This mapping is the foundation of every optimization strategy.

Address RangeFunction CodeRead TypeRegister Type
0 – 65,535FC 01Read CoilsDiscrete Output (1-bit)
100,000 – 165,536FC 02Read Discrete InputsDiscrete Input (1-bit)
300,000 – 365,536FC 04Read Input RegistersAnalog Input (16-bit)
400,000 – 465,536FC 03Read Holding RegistersAnalog Output (16-bit)

The critical insight: You cannot mix function codes in a single Modbus request. A read of holding registers (FC 03) and a read of input registers (FC 04) are always two separate transactions, even if the registers are numerically adjacent when you strip the prefix.

This means your first optimization step is grouping tags by function code. A tag list with 50 holding registers and 10 input registers requires at minimum 2 requests, not 1 — no matter how clever your batching.

Address Decoding in Practice

Many Modbus implementations use the address prefix convention to encode both the register type and the function code:

  • Address 404000 → Function Code 3, register 4000
  • Address 304000 → Function Code 4, register 4000
  • Address 4000 → Function Code 1, coil 4000
  • Address 104000 → Function Code 2, discrete input 4000

The register address sent on the wire is the address modulo the range base. So 404000 becomes register 4000 in the actual Modbus PDU. Getting this decoding wrong is the #1 cause of "I can read the same register in my Modbus scanner tool but not in my gateway" issues.

Contiguous Register Grouping

The single most impactful optimization in Modbus polling is contiguous register grouping — combining multiple sequential register reads into a single bulk read.

Why It Matters: The Overhead Math

Every Modbus transaction has fixed overhead:

ComponentRTU (Serial)TCP
Request frame8 bytes12 bytes (MBAP header + PDU)
Response frame header5 bytes9 bytes
Turnaround delay3.5 char times (RTU)~1ms (TCP)
Response data2 × N registers2 × N registers
Inter-frame gap3.5 char times (RTU)N/A
Total overhead per request~50ms at 9600 baud~2-5ms

For RTU at 9600 baud, each individual register read (request + response + delays) takes roughly 50ms. Reading 50 registers individually = 2.5 seconds. Reading them as one bulk request of 50 contiguous registers = ~120ms. That's a 20x improvement.

Grouping Algorithm

The practical algorithm for contiguous grouping:

  1. Sort tags by function code, then by register address within each group
  2. Walk the sorted list and identify contiguous runs (where addr[n+1] <= addr[n] + elem_count[n])
  3. Enforce a maximum group size — the Modbus spec allows up to 125 registers (250 bytes) per FC 03/04 read, but practical implementations should cap at 50-100 to stay within device buffer limits
  4. Handle gaps intelligently — if two tags are separated by 3 unused registers, it's cheaper to read the gap (3 extra registers × 2 bytes = 6 bytes) than to issue a separate request (50ms+ overhead)

Gap Tolerance: The Break-Even Point

When should you read through a gap versus splitting into two requests?

For Modbus TCP, the overhead of a separate request is ~2-5ms. Each extra register costs ~0.02ms. Break-even: ~100-250 register gap — almost always worth reading through.

For Modbus RTU at 9600 baud, the overhead is ~50ms. Each register costs ~2ms. Break-even: ~25 registers — read through anything smaller, split anything larger.

For Modbus RTU at 19200 baud, overhead drops to ~25ms, each register ~1ms. Break-even: ~25 registers — similar ratio holds.

Practical recommendation: Set your gap tolerance to 20 registers for RTU and 100 registers for TCP. You'll read a few bytes of irrelevant data but dramatically reduce transaction count.

Multi-Register Data Types

Many industrial values span multiple consecutive registers:

Data TypeRegistersBytesCommon Use
INT16 / UINT1612Discrete values, status codes
INT32 / UINT3224Counters, accumulated values
FLOAT32 (IEEE 754)24Temperatures, pressures, flows
FLOAT6448High-precision measurements

A 32-bit float at register 4002 occupies registers 4002 and 4003. Your grouping algorithm must account for elem_count — reading only register 4002 gives you half a float, which decodes to a nonsensical value.

Byte Ordering Nightmares

This is where Modbus gets genuinely painful. The Modbus spec defines big-endian register ordering, but says nothing about how multi-register values should be assembled. Different manufacturers use different conventions:

Byte OrderRegister OrderNameWho Uses It
Big-endianHigh word firstAB CDMost European PLCs, Siemens
Big-endianLow word firstCD ABSome Asian manufacturers
Little-endianHigh word firstBA DCRare
Little-endianLow word firstDC BASome legacy equipment

A temperature reading of 42.5°C stored as IEEE 754 float 0x42AA0000:

  • AB CD: Register 4002 = 0x422A, Register 4003 = 0x0000 → ✅ 42.5
  • CD AB: Register 4002 = 0x0000, Register 4003 = 0x422A → ✅ 42.5 (if you swap)
  • BA DC: Register 4002 = 0x2A42, Register 4003 = 0x0000 → ❌ Garbage without byte-swap

The only reliable approach: During commissioning, write a known value (e.g., 100.0 = 0x42C80000) to a test register and verify that your gateway decodes it correctly. Document the byte order per device — it will save you hours later.

Production-grade platforms like machineCDN handle byte ordering at the device configuration level, so each connected machine can have its own byte-order profile without requiring custom parsing logic.

Intelligent Retry Logic

Network errors happen. Serial bus collisions happen. PLCs get busy and respond late. Your retry strategy determines whether a transient error becomes a data gap or a transparent recovery.

The Naive Approach (Don't Do This)

for each tag:
result = read_register(tag)
if failed:
retry 3 times immediately

Problems:

  • Hammers a struggling device with back-to-back requests
  • Blocks all other reads while retrying
  • Doesn't distinguish between transient errors (timeout) and permanent errors (wrong address)

A Better Approach: Error-Classified Retry

Different errors deserve different responses:

ErrorSeverityAction
Timeout (ETIMEDOUT)TransientRetry with backoff, reconnect if persistent
Connection reset (ECONNRESET)ConnectionClose connection, reconnect, resume
Connection refused (ECONNREFUSED)InfrastructureBack off significantly (device may be rebooting)
Broken pipe (EPIPE)ConnectionReconnect immediately
Bad file descriptor (EBADF)InternalRecreate context from scratch
Illegal data address (Modbus exception 02)PermanentDon't retry — tag is misconfigured
Device busy (Modbus exception 06)TransientRetry after delay

Retry Count and Timing

For batch reads (contiguous groups), a reasonable strategy:

  1. First attempt: Read the full register group
  2. On failure: Wait 50ms (RTU) or 10ms (TCP), retry the same group
  3. Second failure: Wait 100ms, retry once more
  4. Third failure: Log the error, mark the group as failed for this cycle, move to next group
  5. After a connection-level error (timeout, reset, refused):
    • Close the Modbus context
    • Set device state to "disconnected"
    • On next poll cycle, attempt reconnection
    • If reconnection succeeds, flush any stale data in the serial buffer before resuming reads

Critical detail for RTU: After a timeout or error, always flush the serial buffer before retrying. Stale bytes from a partial response can corrupt the next transaction's framing, causing a cascade of CRC errors.

Inter-Request Delay

Modbus RTU requires a 3.5 character-time silence between frames. At 9600 baud, this is approximately 4ms. At 19200 baud, it's 2ms.

Many implementations add a fixed 50ms delay between requests as a safety margin. This works but is wasteful — on a 100-tag system, you're spending 5 seconds just on inter-request delays.

Better approach: Use a 5ms delay at 9600 baud and a 2ms delay at 19200 baud. Monitor CRC error rates — if they increase, lengthen the delay. Some older devices need more silence time than the spec requires.

Multi-Device Polling Scheduling

When your edge gateway talks to multiple Modbus devices (common in manufacturing — one PLC per machine line, plus temperature controllers, VFDs, and meters), polling strategy becomes a scheduling problem.

Round-Robin: Simple but Wasteful

The naive approach:

while true:
for each device:
for each tag_group in device:
read(tag_group)
sleep(poll_interval)

Problem: If you have 8 devices with different priorities, the critical machine's data is delayed by reads to 7 other devices.

Priority-Based with Interval Tiers

A better model uses per-tag read intervals:

TierIntervalTypical TagsPurpose
Critical1-5 secMachine running/stopped, alarm bits, emergency statesImmediate operational awareness
Process30-60 secTemperatures, pressures, RPMs, flow rates, power consumptionTrend analysis, anomaly detection
Diagnostic5-60 minFirmware version, serial numbers, configuration values, cumulative countersAsset management

Implementation: Maintain a last_read_timestamp per tag (or per tag group). On each poll loop, only read groups where now - last_read > interval.

This dramatically reduces bus traffic. In a typical plastics manufacturing scenario with 8 machines:

  • Without tiers: 400 register reads every 5 seconds = 80 reads/sec
  • With tiers: 80 critical + 40 process + 2 diagnostic = ~14 reads/sec average

That's a 5.7x reduction in bus utilization.

Change-Based Transmission

For data going to the cloud, there's another optimization layer: compare-on-change. Many industrial values don't change between reads — a setpoint stays at 350°F for hours, a machine status stays "running" for the entire shift.

The strategy:

  1. Read the register at its configured interval (always — you need to know the current value)
  2. Compare the new value against the last transmitted value
  3. Only transmit to the cloud if:
    • The value has changed, OR
    • A maximum time-without-update has elapsed (heartbeat)

For boolean tags (machine running, alarm active), compare every read and transmit immediately on change — these are the signals that matter most for operational response.

For analog tags (temperature, pressure), you can add a deadband: only transmit if the value has changed by more than X% or Y absolute units. A temperature reading that bounces between 349.8°F and 350.2°F doesn't need to generate 60 cloud messages per hour.

machineCDN's edge agent implements this compare-and-transmit pattern natively, batching changed values into optimized payloads that minimize both bandwidth and cloud ingestion costs.

RTU vs TCP: Polling Strategy Differences

Modbus RTU (Serial RS-485)

  • Half-duplex: Only one device can transmit at a time
  • Single bus: All devices share the same wire pair
  • Addressing: Slave address 1-247 (broadcast at 0)
  • Speed: Typically 9600-115200 baud
  • Critical constraint: Bus contention — you MUST wait for a complete response (or timeout) before addressing another device
  • CRC: 16-bit CRC appended to every frame

RTU polling tips:

  • Set timeouts based on maximum expected response size. For 125 registers at 9600 baud: (125 * 2 bytes * 10 bits/byte) / 9600 = ~260ms plus overhead ≈ 300-500ms timeout
  • Never set the timeout below the theoretical transmission time — you'll get phantom timeouts
  • If one device on the bus goes unresponsive, its timeout blocks ALL other devices. Aggressive timeout + retry is better than patient timeout + no retry.

Modbus TCP

  • Full-duplex: Request and response can overlap (on different connections)
  • Multi-connection: Each device gets its own TCP socket
  • No contention: Parallel reads to different devices
  • Speed: 100Mbps+ network bandwidth (practically unlimited for Modbus payloads)
  • Transaction ID: The MBAP header includes a transaction ID for matching responses to requests

TCP polling tips:

  • Use persistent connections — TCP handshake + Modbus connection setup adds 10-50ms per connection. Reconnect only on error.
  • You CAN poll multiple TCP devices simultaneously using non-blocking sockets or threads. This is a massive advantage over RTU.
  • Set TCP keepalive on the socket — some industrial firewalls and managed switches close idle connections after 60 seconds.
  • The Modbus/TCP unit identifier field is usually ignored (set to 0xFF) for direct device connections, but matters if you're going through a TCP-to-RTU gateway.

Bandwidth Optimization: Binary vs JSON Batching

Once you've read your registers, the data needs to get to the cloud. The payload format matters enormously at scale.

JSON Format (Human-Readable)

{
"groups": [{
"ts": 1709330400,
"device_type": 5000,
"serial_number": 1106336053,
"values": [
{"id": 1, "v": 4.4, "st": 0},
{"id": 2, "v": 162.5, "st": 0},
{"id": 3, "v": 158.3, "st": 0}
]
}]
}

For 3 values: ~200 bytes.

Binary Format (Machine-Optimized)

A well-designed binary format encodes the same data in:

  • 4 bytes: timestamp (uint32)
  • 2 bytes: device type (uint16)
  • 4 bytes: serial number (uint32)
  • Per value: 2 bytes (tag ID) + 1 byte (status) + 4 bytes (value) = 7 bytes

For 3 values: 31 bytes — an 84% reduction.

Over cellular connections (common for remote industrial sites), this difference is enormous. A machine reporting 50 values every 60 seconds:

  • JSON: ~3.3 KB/min → 4.8 MB/day → 144 MB/month
  • Binary: ~0.5 KB/min → 0.7 MB/day → 21 MB/month

That's the difference between a $10/month cellular plan and a $50/month plan — multiplied by every connected machine.

The batching approach also matters. Instead of transmitting each read result immediately, accumulate values into a batch and transmit when either:

  • The batch reaches a size threshold (e.g., 4KB)
  • A time threshold expires (e.g., 30 seconds since batch started)

This amortizes the MQTT/HTTP overhead across many data points and enables efficient compression.

Store-and-Forward: Surviving Connectivity Gaps

Industrial environments have unreliable connectivity — cellular modems reboot, VPN tunnels flap, WiFi access points go down during shift changes. Your polling shouldn't stop when the cloud connection drops.

A robust edge gateway implements a local buffer:

  1. Poll and batch as normal, regardless of cloud connectivity
  2. When connected: Transmit batches immediately
  3. When disconnected: Store batches in a ring buffer (sized to the available memory)
  4. When reconnected: Drain the buffer in chronological order before transmitting live data

Buffer Sizing

The buffer should be sized to survive your typical outage duration:

Data Rate1-Hour Buffer8-Hour Buffer24-Hour Buffer
1 KB/min60 KB480 KB1.4 MB
10 KB/min600 KB4.8 MB14.4 MB
100 KB/min6 MB48 MB144 MB

For embedded gateways with 256MB RAM, a 2MB ring buffer comfortably handles 8-24 hours of typical industrial data at modest polling rates. The key design decision is what happens when the buffer fills: either stop accepting new data (gap in the oldest data) or overwrite the oldest data (gap in the newest data). For most IIoT use cases, overwriting oldest is preferred — recent data is more actionable than historical data.

Putting It All Together: A Production Polling Architecture

Here's what a production-grade Modbus polling pipeline looks like:

┌─────────────────────────────────────────────────────┐
│ POLL SCHEDULER │
│ Per-tag intervals → Priority queue → Due tags │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ REGISTER GROUPER │
│ Sort by FC → Find contiguous runs → Build groups │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ MODBUS READER │
│ Execute reads → Retry on error → Reconnect logic │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ VALUE PROCESSOR │
│ Byte-swap → Type conversion → Scaling → Compare │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ BATCH ENCODER │
│ Group by timestamp → Encode (JSON/binary) → Size │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ STORE-AND-FORWARD BUFFER │
│ Ring buffer → Page management → Drain on connect │
└──────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────┐
│ MQTT PUBLISHER │
│ QoS 1 → Async delivery → ACK tracking │
└─────────────────────────────────────────────────────┘

This is the architecture that machineCDN's edge agent implements. Each layer is independently testable, and failures at any layer don't crash the pipeline — they produce graceful degradation (data gaps in the cloud, not system crashes on the factory floor).

Benchmarks: Before and After Optimization

Real numbers from a plastics manufacturing deployment with 8 Modbus RTU devices on a 19200 baud RS-485 bus, 320 total tags:

MetricUnoptimizedOptimizedImprovement
Full poll cycle time48 sec6.2 sec7.7x faster
Requests per cycle3204287% fewer
Bus utilization94%12%Room for growth
Data gaps per day15-200-1Near-zero
Cloud bandwidth (daily)180 MB28 MB84% reduction
Avg tag staleness48 sec6 sec8x fresher

The unoptimized system couldn't even complete a poll cycle in under a minute. The optimized system polls every 6 seconds with headroom to add more devices.

Final Recommendations

  1. Always group contiguous registers — this is non-negotiable for production systems
  2. Use tiered polling intervals — not every tag needs the same update rate
  3. Implement error-classified retry — don't retry permanent errors, do retry transient ones
  4. Use binary encoding for cellular — JSON is fine for LAN-connected gateways
  5. Size your store-and-forward buffer for your realistic outage window
  6. Flush the serial buffer after errors — this prevents CRC cascades on RTU
  7. Document byte ordering per device — test with known values during commissioning
  8. Monitor bus utilization — stay below 30% to leave headroom for retries and growth

Modbus isn't going away. But the difference between a naive implementation and an optimized one is the difference between a system that barely works and one that scales to hundreds of machines without breaking a sweat.


machineCDN's edge agent handles Modbus RTU and TCP with optimized register grouping, binary batching, and store-and-forward buffering out of the box. Connect your PLCs in minutes, not weeks. Get started →

Modbus RTU vs Modbus TCP: A Practical Comparison for Plant Engineers [2026]

· 12 min read

Modbus RTU vs TCP comparison

If you've spent time on the factory floor, you've touched Modbus. It's the lingua franca of industrial automation — older than most engineers who use it, yet still embedded in nearly every PLC, VFD, sensor, and temperature controller shipping today.

But "Modbus" isn't one protocol. It's two very different beasts that happen to share a register model. Understanding when to use Modbus RTU over RS-485 serial versus Modbus TCP over Ethernet isn't academic — it directly impacts your polling throughput, your wiring costs, your alarm response times, and whether your edge gateway can actually keep up with your machines.

This guide breaks down both protocols at the wire level, compares real-world performance, and gives you a decision framework for your next deployment.

The Frame: What Actually Goes On The Wire

Modbus RTU Frame Structure

Modbus RTU (Remote Terminal Unit) sends binary data over a serial connection — typically RS-485, sometimes RS-232 for point-to-point. The frame is compact:

[Slave Address: 1 byte] [Function Code: 1 byte] [Data: N bytes] [CRC-16: 2 bytes]

A typical "read holding registers" request looks like this on the wire:

01 03 00 00 00 0A C5 CD
│ │ │ │ └── CRC-16 (little-endian)
│ │ │ └──────── Quantity: 10 registers
│ │ └────────────── Starting address: 0x0000
│ └───────────────── Function code: 0x03 (Read Holding Registers)
└──────────────────── Slave address: 1

That's 8 bytes for the request. The response carries 20 bytes of register data plus 5 bytes of overhead — 25 bytes total. Clean. Efficient. Zero wasted bandwidth.

The silent interval problem: RTU uses timing to delimit frames. A gap of 3.5 character times (approximately 3.6ms at 9600 baud) signals the end of one frame. This means:

  • You cannot have pauses inside a frame
  • Multitasking operating systems (Linux, Windows) can introduce jitter that corrupts framing
  • At 9600 baud, one character takes ~1.04ms, so the inter-frame gap is ~3.6ms
  • At 19200 baud, the gap shrinks to ~1.8ms — tighter timing requirements

Modbus TCP Frame Structure

Modbus TCP wraps the same function codes in a TCP/IP packet with an MBAP (Modbus Application Protocol) header:

[Transaction ID: 2 bytes] [Protocol ID: 2 bytes] [Length: 2 bytes] [Unit ID: 1 byte] [Function Code: 1 byte] [Data: N bytes]

The same read request becomes:

00 01 00 00 00 06 01 03 00 00 00 0A
│ │ │ │ │ │ └── Quantity: 10 registers
│ │ │ │ │ └──────── Starting address: 0x0000
│ │ │ │ └─────────── Function code: 0x03
│ │ │ └────────────── Unit ID: 1
│ │ └──────────────────── Remaining bytes: 6
│ └────────────────────────── Protocol ID: 0x0000 (Modbus)
└──────────────────────────────── Transaction ID: 0x0001

Key difference: No CRC. TCP handles error detection at the transport layer. The Transaction ID is huge — it lets you pipeline multiple requests without waiting for responses, something RTU physically cannot do.

Serial Configuration: Getting the Basics Right

When you configure a Modbus RTU connection, you're setting up a serial port. The classic configuration that works with most PLCs:

ParameterTypical ValueNotes
Baud Rate9600Some devices support 19200, 38400, even 115200
Data Bits8Almost universally 8
ParityNoneSome devices default to Even — check documentation
Stop Bits1Use 2 when parity is None (per Modbus spec, though 1 works for most devices)
Byte Timeout4msTime between individual bytes within a frame
Response Timeout100msMaximum wait for slave response

The byte timeout and response timeout are where most deployment issues hide. Set the byte timeout too low on a noisy RS-485 bus and you'll get fragmented frames. Set the response timeout too high and your polling cycle slows to a crawl when a device goes offline.

Real-world rule: On a clean RS-485 bus with less than 100 meters of cable, 4ms byte timeout and 100ms response timeout works reliably. Add 20ms to the response timeout for every 100 meters of additional cable, and double both values if you're running near VFDs or welding equipment.

Modbus TCP: Port 502 and What Lives Behind It

Modbus TCP devices listen on port 502 by default. When you configure a gateway to talk to a PLC over TCP, you're specifying:

  • IP address of the PLC or protocol converter
  • TCP port (502 is standard)
  • Unit ID (equivalent to the slave address — matters when a single IP serves multiple logical devices)

The connection lifecycle matters more than most engineers realize:

  1. TCP handshake: ~1ms on a local network, but can spike to 50ms+ through managed switches with port security
  2. Keep-alive: Modbus TCP doesn't define keep-alive. Some PLCs will drop idle connections after 30-60 seconds
  3. Connection pooling: A well-designed gateway maintains persistent connections rather than reconnecting per poll cycle

The Unit ID trap: When you have a Modbus TCP-to-RTU bridge (common when retrofitting serial devices onto Ethernet), the Unit ID maps to the RTU slave address on the serial side. If you set Unit ID to 0 or 255, many bridges interpret this as "send to all devices" — which can cause chaos on a shared RS-485 bus.

Performance: Real Numbers, Not Spec Sheet Fantasy

Here's what actually matters — how fast can you poll data?

Modbus RTU at 9600 Baud

Reading 10 holding registers from a single device:

  • Request frame: 8 bytes → 8.3ms
  • Slave processing time: 2-10ms (PLC-dependent)
  • Response frame: 25 bytes → 26ms
  • Inter-frame gap: 3.6ms × 2 = 7.2ms
  • Total per device: ~45-55ms

With 10 devices on an RS-485 bus, one complete poll cycle takes 450-550ms. That's roughly 2 polls per second — acceptable for temperature monitoring, too slow for motion control.

Bumping to 19200 baud cuts transmission time in half, getting you to ~30ms per device or about 3.3 polls per second across 10 devices.

Modbus TCP on 100Mbps Ethernet

The same 10-register read over TCP:

  • Request frame: 12 bytes (+ TCP overhead) → under 1ms
  • Slave processing time: 2-10ms
  • Response frame: 29 bytes → under 1ms
  • TCP ACK overhead: ~0.5ms
  • Total per device: ~5-15ms

But here's where TCP shines: pipelining. With the Transaction ID, you can fire 10 requests without waiting for responses. A well-optimized gateway can poll 10 devices in 15-25ms total — nearly 40-60 polls per second.

The Contiguous Register Advantage

Whether RTU or TCP, reading contiguous registers in a single request is dramatically faster than individual reads. Reading 50 contiguous registers costs roughly the same as reading 1 register — the overhead is in the framing, not the data payload.

If your PLC stores related data in registers 40001-40050, read them all in one Function Code 03 request. If the data is scattered across registers 40001, 40200, 40500, and 41000, you need four separate requests — four times the overhead.

Smart IIoT platforms like machineCDN optimize this automatically, grouping contiguous register reads into batch requests that minimize round-trips to the PLC.

Function Codes: The Ones That Actually Matter

The Modbus spec defines 20+ function codes, but in practice you'll use five:

CodeNameUse Case
0x01Read CoilsDigital outputs (on/off states)
0x02Read Discrete InputsDigital inputs (sensor contacts)
0x03Read Holding RegistersThe workhorse — analog values, setpoints, configuration
0x04Read Input RegistersRead-only process values (some PLCs put sensor data here)
0x06Write Single RegisterSending commands or setpoints to the PLC

The register type confusion: Modbus defines four data spaces — coils (1-bit R/W), discrete inputs (1-bit RO), holding registers (16-bit R/W), and input registers (16-bit RO). Different PLC manufacturers map data differently. A temperature reading might be in holding register 40001 on one brand and input register 30001 on another. Always check the PLC's register map.

Error Handling: Where Deployments Break

RTU Error Detection

RTU uses CRC-16 (polynomial 0xA001). If a single bit flips during transmission — common on electrically noisy factory floors — the CRC fails and the master discards the frame. The master then retries, burning another 45ms+.

Common RTU error scenarios:

  • No response (timeout): Device is offline, wrong slave address, or cable broken. The master waits for the full response timeout before moving on.
  • CRC mismatch: Electrical noise. Check cable shielding, termination resistors (120Ω at each end of the RS-485 bus), and distance from high-power equipment.
  • Exception response: The slave responds with function code + 0x80, indicating an error (illegal address, illegal data value, slave device failure). This is actually good — it means the device is alive and communicating.

TCP Error Handling

TCP's built-in retry and checksum mechanisms handle bit errors transparently. Your Modbus TCP errors are typically:

  • Connection refused: Device is down or port 502 is blocked
  • Connection timeout: Network issue, VLAN misconfiguration, or firewall
  • No response on established connection: PLC is overloaded or has crashed — the TCP connection stays open but the application layer is dead

The zombie connection problem: A PLC might crash while the TCP connection remains technically open (no FIN packet sent). Your gateway keeps sending requests into the void, timing out on each one. Implement application-level heartbeats — if you don't get a valid Modbus response within 3 consecutive poll cycles, tear down the connection and reconnect.

Wiring and Physical Layer Considerations

RS-485 for RTU

  • Max cable length: 1,200 meters (4,000 feet) at 9600 baud
  • Max devices per bus: 32 (standard drivers) or 256 (with high-impedance receivers)
  • Topology: Multi-drop bus (daisy-chain, NOT star)
  • Termination: 120Ω resistors at both ends of the bus
  • Cable: Shielded twisted pair (STP), 24 AWG minimum

The star topology trap: RS-485 is designed for daisy-chain (bus) topology. Running cables from a central hub to each device in a star pattern creates reflections that corrupt signals. If your plant layout forces star wiring, use an RS-485 hub/repeater at the center.

Ethernet for TCP

  • Max cable length: 100 meters per segment (Cat5e/Cat6)
  • Devices: Limited only by switch capacity and IP addressing
  • Topology: Star (standard Ethernet)
  • Switches: Use industrial-rated managed switches. Consumer switches will die in a factory environment within months.

When to Use Each Protocol

Choose Modbus RTU when:

  • Connecting to legacy devices that only have serial ports
  • Cable runs are long (200m+) and Ethernet infrastructure doesn't exist
  • You need simplicity — two wires, no switches, no IP configuration
  • Budget is tight and the device count is low (under 10 per bus)
  • Temperature controllers, VFDs, and simple sensors with RS-485 ports

Choose Modbus TCP when:

  • You need high poll rates (>5 Hz per device)
  • Connecting 10+ devices at one location
  • Ethernet infrastructure already exists
  • You want to pipeline requests for maximum throughput
  • Remote access or cloud connectivity is needed (TCP routes through firewalls more easily)
  • The PLC supports it (most modern PLCs do)

The hybrid reality: Most IIoT deployments end up with both. A Modbus TCP-capable PLC talks to the edge gateway over Ethernet while older serial devices connect through an RS-485 port on the same gateway. Platforms like machineCDN handle this natively — the edge gateway manages both protocol stacks and normalizes the data into a unified model before it leaves the plant floor.

Configuration Pitfalls That Will Waste Your Time

  1. Baud rate mismatch: Every device on an RTU bus must use the same baud rate. One device at 19200 on a 9600 bus will generate garbage that confuses everything.

  2. Duplicate slave addresses: Two devices with the same address on the same RS-485 bus will both try to respond simultaneously, corrupting each other's frames.

  3. Polling too fast: If your poll interval is shorter than the total round-trip time for all devices, requests will pile up and timeouts cascade. Calculate your minimum cycle time before setting the poll interval.

  4. Byte ordering (endianness): A 32-bit float spanning two 16-bit Modbus registers can be arranged as Big-Endian (AB CD), Little-Endian (CD AB), Big-Endian byte-swapped (BA DC), or Little-Endian byte-swapped (DC BA). The spec doesn't mandate an order. Each manufacturer chooses their own. Test with known values before assuming.

  5. Register addressing: Some documentation uses 0-based addressing (register 0 = first register), others use 1-based (register 1 = first register), and some use Modbus convention addressing (40001 = first holding register). Off-by-one errors here will give you data from the wrong register — and the values might look plausible enough to be dangerous.

Scaling Factors and Unit Conversion

PLCs store numbers as integers — typically 16-bit signed or unsigned values. A temperature of 72.5°F might be stored as:

  • 7250 with an implicit scale factor of ÷100
  • 725 with a scale factor of ÷10
  • 73 rounded to the nearest integer

The register map documentation should specify the scale factor, but many don't. When you see register values like 7250, 1472, or 2840, you need to figure out the engineering units.

Temperature conversions are common in multi-vendor environments:

  • Fahrenheit to Celsius: (F - 32) × 5/9
  • Weight (lbs to kg): lbs ÷ 2.205
  • Pressure (PSI to kPa): PSI ÷ 0.145
  • Length (feet to meters): ft ÷ 3.281

A robust IIoT platform handles these conversions at the edge, storing normalized SI values in the cloud regardless of what the PLC natively reports.

Conclusion: The Protocol Doesn't Matter as Much as the Architecture

Modbus RTU and Modbus TCP are both viable for modern IIoT deployments. The protocol choice is a physical-layer decision — what ports does the equipment have, how far away is it, and how fast do you need data?

The real challenge is what happens after the data leaves the register: normalizing values from heterogeneous equipment, handling connectivity loss gracefully, batching telemetry for efficient cloud delivery, and turning raw register data into actionable insights.

Whether your machines speak RTU over serial or TCP over Ethernet, the goal is the same — get reliable, normalized data off the plant floor and into the hands of engineers who can act on it.


machineCDN connects to both Modbus RTU and Modbus TCP devices through its edge gateway, handling protocol translation, data normalization, and store-and-forward buffering automatically. Learn how it works →

MQTT Broker Architecture for Industrial Deployments: Clustering, Persistence, and High Availability [2026]

· 11 min read

MQTT Broker Architecture

Every IIoT tutorial makes MQTT look simple: connect, subscribe, publish. Three calls and you're streaming telemetry. What those tutorials don't tell you is what happens when your broker goes down at 2 AM, your edge gateway's cellular connection drops for 40 minutes, or your plant generates 50,000 messages per second and you need every single one to reach the historian.

Industrial MQTT isn't a protocol problem. It's an architecture problem. The protocol itself is elegant and well-specified. The hard part is designing the broker infrastructure — clustering, persistence, session management, and failover — so that zero messages are lost when (not if) something fails.

This article is for engineers who've gotten past "hello world" and need to build MQTT infrastructure that meets manufacturing reliability requirements. We'll cover the internal mechanics that matter, the failure modes you'll actually hit, and the architecture patterns that work at scale.

How MQTT Brokers Actually Handle Messages

Before discussing architecture, let's nail down what the broker is actually doing internally. This understanding is critical for sizing, troubleshooting, and making sensible design choices.

The Session State Machine

When a client connects with CleanSession=false (MQTT 3.1.1) or CleanStart=false with a non-zero SessionExpiryInterval (MQTT 5.0), the broker creates a persistent session bound to the client ID. This session maintains:

  • The set of subscriptions (topic filters + QoS levels)
  • QoS 1 and QoS 2 messages queued while the client is offline
  • In-flight QoS 2 message state (PUBLISH received, PUBREC sent, waiting for PUBREL)
  • The packet identifier namespace

This is the mechanism that makes MQTT suitable for unreliable networks — and it's the mechanism that will eat your broker's memory and disk if you don't manage it carefully.

Message Flow at QoS 1

Most industrial deployments use QoS 1 (at least once delivery). Here's what actually happens inside the broker:

  1. Publisher sends PUBLISH with QoS 1 and a packet identifier
  2. Broker receives the message and must:
    • Match the topic against all active subscription filters
    • For each matching subscription, enqueue the message
    • For connected subscribers with matching QoS, deliver immediately
    • For disconnected subscribers with persistent sessions, store in the session queue
    • Persist the message to disk (if persistence is enabled) before acknowledging
  3. Broker sends PUBACK to the publisher — only after all storage operations complete
  4. For each connected subscriber, broker sends PUBLISH and waits for PUBACK
  5. If PUBACK isn't received, broker retransmits on reconnection

The critical detail: step 3 is the durability guarantee. If the broker crashes between receiving the PUBLISH and sending the PUBACK, the publisher will retransmit. If the broker crashes after PUBACK but before delivering to all subscribers, the message must survive the crash — which means it must be on disk.

QoS 2: The Four-Phase Handshake

QoS 2 (exactly once) uses a four-message handshake: PUBLISH → PUBREC → PUBREL → PUBCOMP. The broker must maintain state for each in-flight QoS 2 transaction. In industrial settings, this is occasionally used for critical state changes (machine start/stop commands, recipe downloads) where duplicate delivery would cause real damage.

The operational cost: each QoS 2 message requires 4x the network round trips of QoS 0, and the broker must maintain per-message transaction state. For high-frequency telemetry, this is almost never worth the overhead. QoS 1 with application-level deduplication (using message timestamps or sequence numbers) is the standard industrial approach.

Broker Persistence: What Gets Stored and Where

In-Memory vs Disk-Backed

A broker with no persistence is a broker that loses messages on restart. Period. For development and testing, in-memory operation is fine. For production industrial deployments, you need disk-backed persistence.

What needs to be persisted:

DataPurposeStorage Impact
Retained messagesLast-known-good value per topicGrows with topic count
Session stateOffline subscriber queuesGrows with offline duration × message rate
Inflight messagesQoS 1/2 messages awaiting acknowledgmentUsually small, bounded by max_inflight
Will messagesLast-will-and-testament per clientOne per connected client

The session queue is where most storage problems originate. Consider: an edge gateway publishes 100 tags at 1-second intervals. Each message is ~200 bytes. If the cloud subscriber goes offline for 1 hour, that's 360,000 messages × 200 bytes = ~72 MB queued for that single client. Now multiply by 50 gateways across a plant.

Practical Queue Management

Every production broker deployment needs queue limits:

  • Maximum queue depth — Cap the number of messages per session queue. When the queue is full, either drop the oldest message (most common for telemetry) or reject new publishes (appropriate for control messages).
  • Maximum queue size in bytes — A secondary safeguard when message sizes vary.
  • Message expiry — MQTT 5.0 supports per-message expiry intervals. For telemetry data, 1-hour expiry is typical — a temperature reading from 3 hours ago has no operational value.

A well-configured broker with 4 GB of RAM can handle approximately:

  • 100,000 active sessions
  • 500,000 subscriptions
  • 10,000 messages/second throughput
  • 50 MB of retained messages

These are ballpark figures that vary enormously with message size, topic tree depth, and subscription overlap. Always benchmark with your actual traffic profile.

Clustering: Why and How

A single broker is a single point of failure. For industrial deployments where telemetry loss means blind spots in production monitoring, you need broker clustering.

Active-Active vs Active-Passive

Active-passive (warm standby): One broker handles all traffic. A secondary broker synchronizes state and takes over on failure. Failover time: typically 5-30 seconds depending on detection mechanism.

Active-active (load sharing): Multiple brokers share the client load. Messages published to any broker are replicated to subscribers on other brokers. This provides both high availability and horizontal scalability.

The Shared Subscription Problem

In a clustered setup, if three subscribers share a subscription (e.g., three historian instances for redundancy), each message should be delivered to exactly one of them — not all three. MQTT 5.0's shared subscriptions ($share/group/topic) handle this, distributing messages round-robin among group members.

Without shared subscriptions, each historian instance receives every message, tripling your write load. This is one of the strongest arguments for MQTT 5.0 over 3.1.1 in industrial architectures.

Message Ordering Guarantees

MQTT guarantees message ordering per publisher, per topic, per QoS level. In a clustered broker, maintaining this guarantee across brokers requires careful replication design. Most broker clusters provide:

  • Strong ordering for messages within a single broker node
  • Eventual ordering for messages replicated across nodes (typically < 100ms delay)

For industrial telemetry where timestamps are embedded in the payload, eventual ordering is almost always acceptable. For control messages where sequencing matters, route the publisher and subscriber to the same broker node.

Designing the Edge-to-Cloud Pipeline

The most common industrial MQTT architecture has three layers:

Layer 1: Edge Broker (On-Premises)

Runs on the edge gateway or a local server within the plant network. Responsibilities:

  • Local subscribers — HMI panels, local alarm engines, historian
  • Store-and-forward buffer — Queues messages when cloud connectivity is lost
  • Protocol translation — Accepts data from Modbus/EtherNet/IP collectors and publishes to MQTT
  • Data reduction — Filters unchanged values, aggregates high-frequency data

The edge broker must run on reliable storage (SSD, not SD card) because it's your buffer against network outages. Size the storage for your worst-case outage duration:

Storage needed = (messages/sec) × (avg message size) × (max outage seconds)

Example: 500 msg/s × 200 bytes × 3600 sec = 360 MB per hour of outage

Layer 2: Bridge to Cloud

The edge broker bridges selected topics to a cloud-hosted broker or IoT hub. Key configuration decisions:

  • Bridge QoS — Use QoS 1 for the bridge connection. QoS 0 means any TCP reset loses messages in transit. QoS 2 adds overhead with minimal benefit since telemetry is naturally idempotent.
  • Topic remapping — Prefix bridged topics with a plant/location identifier. A local topic machines/chiller-01/temperature becomes plant-detroit/machines/chiller-01/temperature in the cloud.
  • Bandwidth throttling — Limit the bridge's publish rate to avoid saturating the WAN link. If local collection runs at 500 msg/s but your link can sustain 200 msg/s, the edge broker must buffer or aggregate the difference.

Layer 3: Cloud Broker Cluster

Receives bridged data from all plants. Serves cloud-hosted consumers: analytics pipelines, dashboards, ML training jobs. This layer typically uses a managed service (Azure IoT Hub, AWS IoT Core, HiveMQ Cloud) or a self-hosted cluster.

Key sizing for cloud brokers:

  • Concurrent connections — One per edge gateway, plus cloud consumers
  • Message throughput — Sum of all edge bridge rates
  • Retention — Typically short (minutes to hours). Long-term storage is the historian's job.

Connection Management: The Details That Bite You

Keep-Alive and Half-Open Connections

MQTT's keep-alive mechanism is your primary tool for detecting dead connections. When a client sets keepAlive=60, it must send a PINGREQ within 60 seconds if no other packets are sent. The broker will close the connection after 1.5× the keep-alive interval with no activity.

In industrial environments, be aware of:

  • NAT timeouts — Many firewalls and NAT devices close idle TCP connections after 30-120 seconds. Set keep-alive below your NAT timeout.
  • Cellular networks — 4G/5G connections can silently disconnect. A keep-alive of 30 seconds is aggressive but appropriate for cellular gateways.
  • Half-open connections — The TCP connection is dead but neither side has detected it. Until keep-alive expires, the broker maintains the session and queues messages that will never be delivered. This is why aggressive keep-alive matters.

Last Will and Testament for Device Health

Configure every edge gateway with a Last Will and Testament (LWT):

Topic: devices/{device-id}/status
Payload: {"status": "offline", "timestamp": 1709251200}
QoS: 1
Retain: true

On clean connection, publish a retained "online" message to the same topic. Now any subscriber can check device status by reading the retained message on the status topic. If the device disconnects uncleanly (network failure, power loss), the broker publishes the LWT automatically.

This pattern provides a real-time device health map across your entire fleet without any polling or heartbeat logic in your application.

Authentication and Authorization at Scale

Certificate-Based Authentication

For fleets of 100+ edge gateways, username/password authentication becomes an operational burden. Certificate-based TLS client authentication scales better:

  • Issue each gateway a unique X.509 certificate from your PKI
  • Configure the broker to extract the client identity from the certificate's Common Name (CN) or Subject Alternative Name (SAN)
  • Revoke compromised devices by updating the Certificate Revocation List (CRL) — no password rotation needed

Topic-Level Authorization

Not every device should publish to every topic. A well-designed ACL (Access Control List) restricts:

  • Each gateway can only publish to plants/{plant-id}/devices/{device-id}/#
  • Each gateway can only subscribe to plants/{plant-id}/devices/{device-id}/commands/#
  • Cloud services can subscribe to plants/+/devices/+/# (wildcard across all plants)
  • No device can subscribe to another device's command topics

This contains the blast radius of a compromised device. It can only pollute its own data stream, not inject false data into other devices' telemetry.

Monitoring Your Broker: The Metrics That Matter

$SYS Topics

Most MQTT brokers expose internal metrics via $SYS/ topics:

  • $SYS/broker/messages/received — Total messages received (track rate, not absolute)
  • $SYS/broker/clients/connected — Current connected client count
  • $SYS/broker/subscriptions/count — Active subscription count
  • $SYS/broker/retained/messages/count — Retained message store size
  • $SYS/broker/heap/current — Memory usage

Operational Alerts

Set alerts for:

  • Connected client count drops > 10% in 5 minutes → possible network issue
  • Message rate drops > 50% vs rolling average → possible edge gateway failure
  • Heap usage > 80% of available → approaching memory limit, check session queue sizes
  • Subscription count anomaly → possible subscription leak (client reconnecting without cleaning up)

Where machineCDN Fits

All of this broker infrastructure complexity is why industrial IIoT platforms exist. machineCDN's edge software handles the protocol collection layer (Modbus, EtherNet/IP, and more), implements the store-and-forward buffering that keeps data safe during connectivity gaps, and manages the secure delivery pipeline to cloud infrastructure. The goal is to let plant engineers focus on what the data means rather than how to transport it reliably.

Whether you build your own MQTT infrastructure or use a managed platform, the principles in this article apply. Understand your persistence requirements, size your queues for realistic outage durations, and test failover before you need it in production. The protocol is simple. The architecture is where the engineering happens.

Quick Reference: Broker Sizing Calculator

Plant SizeEdge GatewaysTags/GatewayMsgs/sec (total)Min Broker RAMStorage (1hr buffer)
Small10505001 GB360 MB
Medium501005,0004 GB3.6 GB
Large20020040,00016 GB28.8 GB
Enterprise500+500250,00064 GB+180 GB+

These assume 200-byte average message size, QoS 1, and 1-second publishing intervals per tag. Your mileage will vary — always benchmark with representative traffic.

OPC-UA Pub/Sub vs Client/Server: Choosing the Right Pattern for Your Plant Floor [2026]

· 10 min read

OPC-UA Architecture

If you've spent any time connecting PLCs to cloud dashboards, you've run into OPC-UA. The protocol dominates industrial interoperability conversations — and for good reason. Its information model, security architecture, and cross-vendor compatibility make it the lingua franca of modern manufacturing IT.

But here's what trips up most engineers: OPC-UA isn't a single communication pattern. It's two fundamentally different paradigms sharing one information model. Client/server has been the workhorse since OPC-UA's inception. Pub/sub, ratified in Part 14 of the specification, is the newer pattern designed for one-to-many data distribution. Picking the wrong one can mean the difference between a system that scales to 500 machines and one that falls over at 50.

Let's break down when you need each, how they actually behave on the wire, and where the real-world performance boundaries lie.

The Client/Server Model: What You Already Know (and What You Don't)

OPC-UA client/server follows a familiar request-response paradigm. A client establishes a secure channel to a server, opens a session, creates one or more subscriptions, and receives notifications when monitored item values change.

How Subscriptions Actually Work

This is where many engineers have an incomplete mental model. A subscription isn't a simple "tell me when X changes." It's a multi-layered construct:

  1. Monitored Items — Each tag you want to observe becomes a monitored item with its own sampling interval (how often the server checks the underlying data source) and queue size (how many values to buffer between publish cycles).

  2. Publishing Interval — The subscription itself has a publishing interval that determines how frequently the server packages up change notifications and sends them to the client. This is independent of the sampling interval.

  3. Keep-alive — If no data changes occur within the publishing interval, the server sends a keep-alive message. After a configurable number of missed keep-alives, the subscription is considered dead.

The key insight: sampling and publishing are decoupled. You might sample a temperature sensor at 100ms but only publish aggregated notifications every 1 second. This reduces network traffic without losing fidelity at the source.

Real-World Performance Characteristics

In practice, a single OPC-UA server can typically handle:

  • 50-200 concurrent client sessions (depending on hardware)
  • 5,000-50,000 monitored items per server across all sessions
  • Publishing intervals down to ~50ms before CPU becomes the bottleneck
  • Secure channel negotiation takes 200-800ms depending on security policy

The bottleneck isn't usually bandwidth — it's the server's CPU. Every subscription requires the server to maintain state, evaluate sampling queues, and serialize notification messages for each connected client independently. This is the fan-out problem.

When Client/Server Breaks Down

Consider a plant with 200 machines, each exposing 100 tags. A central historian, a real-time dashboard, an analytics engine, and an alarm system all need access. That's four clients × 200 servers × 100 tags each.

Every server must maintain four independent subscription contexts. Every data change gets serialized and transmitted four times — once per client. The server doesn't know or care that all four clients want the same data. It can't share work between them.

At moderate scale, this works fine. At plant-wide scale with hundreds of devices and dozens of consumers, you're asking each embedded OPC-UA server on a PLC to handle work that grows linearly with the number of consumers. That's the architectural tension pub/sub was designed to resolve.

The Pub/Sub Model: How It Actually Differs

OPC-UA Pub/Sub fundamentally changes the relationship between data producers and consumers. Instead of maintaining per-client connections, a publisher emits data to a transport (typically UDP multicast or an MQTT broker) and subscribers independently consume from that transport.

The Wire Format: UADP vs JSON

Pub/sub messages can be encoded in two ways:

UADP (UA Data Protocol) — A compact binary encoding optimized for bandwidth-constrained networks. A typical dataset message with 50 variables fits in ~400 bytes. Headers contain security metadata, sequence numbers, and writer group identifiers. This is the format you want for real-time control loops.

JSON encoding — Human-readable, easier to debug, but 3-5x larger on the wire. Useful when messages need to traverse IT infrastructure (firewalls, API gateways, log aggregators) where binary inspection is impractical.

Publisher Configuration

A publisher organizes its output into a hierarchy:

Publisher
└── WriterGroup (publishing interval, transport settings)
└── DataSetWriter (maps to a PublishedDataSet)
└── PublishedDataSet (the actual variables)

Each WriterGroup controls the publishing cadence and encoding. A single publisher might have one WriterGroup at 100ms for critical process variables and another at 10 seconds for auxiliary measurements.

DataSetWriters bind the data model to the transport. They define which variables go into which messages and how they're sequenced.

Subscriber Discovery

One of pub/sub's elegant features is publisher-subscriber decoupling. A subscriber doesn't need to know the publisher's address. It subscribes to a multicast group or MQTT topic and discovers available datasets from the messages themselves. DataSet metadata (field names, types, engineering units) can be embedded in the message or discovered via a separate metadata channel.

In practice, this means you can add a new analytics consumer to a running plant network without touching a single PLC configuration. The publisher doesn't even know the new subscriber exists.

Head-to-Head: The Numbers That Matter

DimensionClient/ServerPub/Sub (UADP/UDP)Pub/Sub (JSON/MQTT)
Latency (typical)5-50ms1-5ms10-100ms
Connection setup200-800msNone (connectionless)Broker-dependent
Bandwidth per 100 tags~2-4 KB/s~0.5-1 KB/s~3-8 KB/s
Max consumers per dataset~50 practicalUnlimited (multicast)Broker-limited
SecuritySession-level encryptionMessage-level signing/encryptionTLS + message-level
Firewall traversalEasy (single TCP)Hard (multicast)Easy (TCP to broker)
Deterministic timingNoYes (with TSN)No

The Latency Story

Client/server latency is bounded by the publishing interval plus network round-trip plus serialization overhead. The server must evaluate all monitored items in the subscription, package the notification, encrypt it, and transmit it — for each client independently.

Pub/sub with UADP over UDP can achieve sub-millisecond delivery when combined with Time-Sensitive Networking (TSN). The publisher serializes the dataset once, and the network fabric handles delivery to all subscribers simultaneously. There's no per-subscriber work on the publisher side.

Security Trade-offs

Client/server has the more mature security story. Each session negotiates its own secure channel with certificate-based authentication, message signing, and encryption. The server knows exactly who's connected and can enforce fine-grained access control.

Pub/sub security is message-based. Publishers sign and optionally encrypt messages using security keys distributed through a Security Key Server (SKS). Subscribers must obtain the appropriate keys to decrypt and verify messages. This works, but key distribution and rotation add operational complexity that client/server doesn't have.

Practical Architecture Patterns

Pattern 1: Client/Server for Configuration, Pub/Sub for Telemetry

The most common hybrid approach uses client/server for interactive operations — reading configuration parameters, writing setpoints, browsing the address space, acknowledging alarms — while pub/sub handles the high-frequency telemetry stream.

This plays to each model's strengths. Configuration operations are infrequent, require acknowledgment, and benefit from the request/response guarantee. Telemetry is high-volume, one-directional, and needs to scale to many consumers.

Pattern 2: Edge Aggregation with Pub/Sub Fan-out

Deploy an edge gateway that connects to PLCs via client/server (or native protocols like Modbus or EtherNet/IP), normalizes the data, and re-publishes it via OPC-UA pub/sub. The gateway absorbs the per-device connection complexity while providing a clean, scalable distribution layer.

This is exactly the pattern that platforms like machineCDN implement — the edge software handles the messy reality of multi-protocol PLC communication while providing a unified data stream that any number of consumers can tap into.

Pattern 3: MQTT Broker as Pub/Sub Transport

If your plant network can't support UDP multicast (many can't, due to switch configurations or security policies), use an MQTT broker as the pub/sub transport. The publisher sends OPC-UA pub/sub messages (JSON-encoded) to MQTT topics. Subscribers consume from those topics.

You lose the latency advantage of raw UDP, but you gain:

  • Standard IT infrastructure compatibility
  • Built-in persistence (retained messages)
  • Existing monitoring and management tools
  • Firewall-friendly TCP connections

The overhead is measurable — expect 10-50ms additional latency per hop through the broker — but for most monitoring and analytics use cases, this is perfectly acceptable.

Migration Strategy: Moving from Pure Client/Server

If you're running a pure client/server architecture today and hitting scale limits, don't rip and replace. Migrate incrementally:

  1. Identify high-fan-out datasets — Which datasets have 3+ consumers? Those are your first pub/sub candidates.

  2. Deploy an edge pub/sub gateway — Stand up a gateway that subscribes to your existing OPC-UA servers (via client/server) and republishes via pub/sub. Existing consumers continue to work unchanged.

  3. Migrate consumers one at a time — Move each consumer from direct server connections to the pub/sub stream. Monitor for data quality and latency differences.

  4. Push pub/sub to the source — Once proven, configure PLCs and servers that support native pub/sub to publish directly, eliminating the gateway hop for those devices.

When to Use Which: The Decision Matrix

Choose Client/Server when:

  • You need request/response semantics (writes, method calls)
  • Consumer count is small and stable (< 10 per server)
  • You need to browse and discover the address space interactively
  • Security audit requirements demand per-session access control
  • Your network doesn't support multicast

Choose Pub/Sub when:

  • You have many consumers for the same dataset
  • You need deterministic, low-latency delivery (especially with TSN)
  • Publishers are resource-constrained (embedded PLCs)
  • You're distributing data across network boundaries (IT/OT convergence)
  • You want to decouple publisher lifecycle from consumer lifecycle

Choose both when:

  • You're building a plant-wide platform (this is most real deployments)
  • Configuration and telemetry have different reliability requirements
  • You need to scale consumers independently of device count

The Future: TSN + Pub/Sub

The convergence of OPC-UA Pub/Sub with IEEE 802.1 Time-Sensitive Networking is arguably the most significant development in industrial networking since Ethernet hit the plant floor. TSN provides guaranteed bandwidth allocation, bounded latency, and time synchronization at the network switch level. Combined with UADP encoding, this enables OPC-UA to replace proprietary fieldbus protocols in deterministic control applications.

We're not there yet for most brownfield deployments. TSN-capable switches are expensive, and PLC vendor support is still rolling out. But for greenfield installations making architecture decisions today, TSN-ready pub/sub infrastructure is worth designing for.

Getting Started

If you're evaluating OPC-UA patterns for your plant:

  1. Audit your current fan-out — Count how many consumers connect to each data source. If any source serves 5+ consumers, pub/sub will reduce its load.

  2. Test your network for multicast — Many industrial Ethernet switches support multicast, but it may not be configured. Work with your network team to test IGMP snooping and multicast routing.

  3. Start with MQTT transport — If multicast isn't viable, MQTT-based pub/sub is the lowest-friction path. You can always migrate to UADP/UDP later.

  4. Consider an edge platform — Platforms like machineCDN handle the protocol translation and data normalization layer, letting you focus on the analytics and business logic rather than wrestling with transport plumbing.

The choice between client/server and pub/sub isn't either/or. It's understanding which pattern serves which data flow — and designing your architecture accordingly.