5 posts tagged with "industrial-automation"

EtherNet/IP Device Auto-Discovery: How Edge Gateways Identify PLCs on the Plant Floor [2026]

March 3, 2026 · 9 min read

Walk onto any modern plant floor and you'll find a patchwork of controllers — Allen-Bradley Micro800 series running EtherNet/IP, Modbus TCP devices from half a dozen vendors, maybe a legacy RTU on a serial port somewhere. The edge gateway sitting in that control cabinet needs to figure out what it's talking to, what protocol to use, and how to pull the right data — ideally without a technician manually configuring every register.

This is the device auto-discovery problem, and solving it well is the difference between a two-hour commissioning versus a two-day one.

The Discovery Sequence: Try EtherNet/IP First, Fall Back to Modbus

The most reliable approach follows a dual-protocol detection pattern. When an edge gateway powers up and finds a PLC at a known IP address, it shouldn't assume which protocol that device speaks. Instead, it runs a detection sequence:

Step 1: Attempt EtherNet/IP (CIP) Connection

EtherNet/IP uses the Common Industrial Protocol (CIP) over TCP port 44818. The gateway attempts to create a connection to a known tag — typically a device_type identifier that the PLC firmware exposes as a readable tag.

Protocol: ab-eip
Gateway: 192.168.1.100
CPU: micro800
Tag: device_type
Element Size: 2 bytes (uint16)
Element Count: 1
Timeout: 2000ms

If this connection succeeds and returns a non-zero value, the gateway knows it's talking to an EtherNet/IP device and can proceed to read the serial number components.

Step 2: If EtherNet/IP fails, try Modbus TCP

If the CIP connection returns an error (typically error code -32, indicating no route to host at the CIP layer), the gateway falls back to Modbus TCP on port 502.

For Modbus detection, the gateway reads input register 800 (address 0x300320 in the full Modbus address space — function code 4). This register holds the device type identifier by convention in many industrial equipment families.

Protocol: Modbus TCP
Port: 502
Function Code: 4 (Read Input Registers)
Start Address: 800
Register Count: 1

Step 3: Extract Serial Number

Once the device type is known, the gateway reads serial number components. Here's where things get vendor-specific. Different PLC families store their serial numbers in completely different register locations:

Device Type	Protocol	Month Register	Year Register	Unit Register
Micro800 PLC	EtherNet/IP	Tag: `serial_number_month`	Tag: `serial_number_year`	Tag: `serial_number_unit`
GP Chiller (1017)	Modbus TCP	Input Reg 22	Input Reg 23	Input Reg 24
HE Chiller (1018)	Modbus TCP	Holding Reg 520	Holding Reg 510	Holding Reg 500
TS5 TCU (1021)	Modbus TCP	Holding Reg 1039	Holding Reg 1038	Holding Reg 1040

Notice the inconsistency — even within the same protocol, each device family stores its serial number in different registers, uses different function codes (input registers vs. holding registers), and sometimes the year/month/unit ordering isn't sequential in memory. This is real-world industrial automation, not a textbook.

Serial Number Encoding: Packing Identity into 32 Bits

Once you have the three components (year, month, unit number), they're packed into a single 32-bit serial number for efficient transport:

Byte 3 (bits 31-24): Year  (0x00-0xFF)
Byte 2 (bits 23-16): Month (0x00-0xFF)  
Bytes 1-0 (bits 15-0): Unit Number (0x0000-0xFFFF)

This encoding allows up to 65,535 units per month per year — more than sufficient for any production line. A serial number of 0x18031A2B decodes to: year 0x18 (24), month 0x03 (March), unit 0x1A2B (6699).

Validation Matters

A serial number where the year byte is zero is invalid — it almost certainly means the PLC hasn't been properly commissioned or the register read returned garbage data. Your gateway should reject these and report a "bad serial number" status rather than silently accepting a device with identity 0x00000000.

The Configuration Lookup Pattern

Once the gateway knows the device type (e.g., type 1018 = HE Central Chiller), it needs to load the right tag configuration. The proven pattern is a directory scan:

Maintain a directory of JSON configuration files (one per device type)
On detection, scan the directory and match the device_type field in each JSON
Load the matched configuration, which defines all tags, their data types, read intervals, and batching behavior

{
  "device_type": 1018,
  "version": "2.4.1",
  "name": "HE Central Chiller",
  "protocol": "modbus-tcp",
  "plctags": [
    {
      "name": "supply_temp",
      "id": 1,
      "type": "float",
      "addr": 400100,
      "ecount": 2,
      "interval": 5,
      "compare": true
    },
    {
      "name": "compressor_status",
      "id": 2, 
      "type": "uint16",
      "addr": 400200,
      "interval": 1,
      "compare": true,
      "do_not_batch": true
    }
  ]
}

Key design decisions in this configuration:

compare: true means only transmit when the value changes — critical for reducing bandwidth on cellular connections
do_not_batch: true means send immediately rather than accumulating in a batch — used for status changes and alarms that need real-time delivery
interval defines the polling frequency in seconds — fast-changing temperatures might be 5 seconds, while a compressor on/off status needs sub-second reads
ecount: 2 for floats means reading two consecutive 16-bit Modbus registers and combining them into an IEEE 754 float

Handling Modbus Address Conventions

One of the trickiest aspects of Modbus auto-discovery is the address-to-function-code mapping. Different vendors use different conventions, but the most common maps addresses to function codes like this:

Address Range	Function Code	Register Type
0–65536	FC 1	Coils (read/write bits)
100000–165536	FC 2	Discrete Inputs (read-only bits)
300000–365536	FC 4	Input Registers (read-only 16-bit)
400000–465536	FC 3	Holding Registers (read/write 16-bit)

When you see a configured address of 400100, the gateway strips the prefix: the actual Modbus register address sent on the wire is 100, using function code 3.

Register Grouping Optimization

Smart gateways don't read one register at a time. They scan the sorted tag list and identify contiguous address ranges that share the same function code and polling interval. These get combined into a single Modbus read request:

Tags at addresses: 400100, 400101, 400102, 400103, 400104
→ Single request: FC3, start=100, count=5

But grouping has limits. Exceeding ~50 registers per request risks timeouts, especially on Modbus RTU over slow serial links. And you can't group across function code boundaries — a tag at address 300050 (FC4) and 400050 (FC3) must be separate requests, even though they're "near" each other numerically.

Multi-Protocol Detection: The Real-World Sequence

In practice, a gateway on a plant floor often needs to detect multiple devices simultaneously — a PLC on EtherNet/IP and a temperature control unit on Modbus RTU via RS-485. The detection sequence runs in parallel:

EtherNet/IP detection happens over the plant's Ethernet network — standard TCP/IP, fast, usually succeeds or fails within 2 seconds
Modbus TCP detection uses the same Ethernet interface but different port (502) — also fast
Modbus RTU detection happens over a serial port (/dev/ttyUSB0 or similar) — much slower, constrained by baud rate (typically 9600–115200), with byte timeouts around 50ms and response timeouts of 400ms

The serial link parameters are critical and often misconfigured:

Port: /dev/ttyUSB0
Baud Rate: 9600
Parity: None ('N')
Data Bits: 8
Stop Bits: 1
Slave Address: 1
Byte Timeout: 50ms
Response Timeout: 400ms

Getting the parity wrong is the #1 commissioning mistake with Modbus RTU. If the slave expects Even parity and the master sends None, every frame will be rejected silently — no error message, just timeouts.

Connection Resilience: The Watchdog Pattern

Discovery isn't a one-time event. Industrial connections drop — cables get unplugged during maintenance, PLCs get rebooted, network switches lose power. A robust gateway implements a multi-layer resilience strategy:

Link State Tracking: Every successful read sets the link state to "up." Any read error (timeout, connection reset, broken pipe, bad file descriptor) sets it to "down" and triggers a reconnection sequence.

Connection Error Counting: For EtherNet/IP, if you get three consecutive error-32 responses (no CIP route), stop hammering the network and wait for the next polling cycle. For Modbus, error codes like ETIMEDOUT, ECONNRESET, ECONNREFUSED, or EPIPE trigger a modbus_close() followed by reconnection on the next cycle.

Modbus Flush on Error: After a failed Modbus read, always flush the serial/TCP buffer before the next attempt. Stale response bytes from a partial read can corrupt subsequent responses.

Configuration Hot-Reload: The gateway watches its configuration files with stat(). If a file's modification time changes, it triggers a full re-initialization — destroy existing PLC tag handles, reload the JSON configuration, and re-establish all connections. This allows field engineers to update tag configurations without restarting the gateway service.

What machineCDN Brings to the Table

machineCDN's edge infrastructure handles this entire discovery and connection management lifecycle automatically. When you deploy a machineCDN gateway on the plant floor:

It auto-detects PLCs across EtherNet/IP and Modbus TCP/RTU simultaneously
It loads the correct device configuration from its library of supported equipment types
It manages connection resilience with automatic reconnection and buffer management
It optimizes Modbus reads by grouping contiguous registers and minimizing request count
Tag data flows through a batched delivery pipeline to the cloud, with store-and-forward buffering during connectivity gaps

For plant engineers, this means going from "cable plugged in" to "live data flowing" in minutes rather than days of manual register mapping.

Key Takeaways

Always try EtherNet/IP first — it's faster and provides richer device identity information than Modbus
Don't hardcode serial number locations — they vary wildly across equipment families, even from the same vendor
Validate serial numbers before accepting a device — zero year values indicate bad reads
Group Modbus reads by contiguous address and function code, but cap at 50 registers per request
Implement connection watchdogs — industrial networks are unreliable; your gateway must recover automatically
Flush after errors — stale buffer bytes from partial Modbus reads are the silent killer of data integrity

The device discovery problem isn't glamorous, but getting it right is what separates an IIoT platform that works in the lab from one that survives on a real plant floor.

Cloud Connection Watchdogs for IIoT Edge Gateways: Designing Self-Healing MQTT Pipelines [2026]

March 2, 2026 · 12 min read

The edge gateway powering your factory floor monitoring has exactly one job that matters: get data from PLCs to the cloud. Everything else — protocol translation, tag mapping, batch encoding — is just preparation for that moment when bits leave the gateway and travel to your cloud backend.

And that's exactly where things break. MQTT connections go stale. TLS certificates expire silently. Cloud endpoints restart for maintenance. Cellular modems drop carrier. The gateway's connection looks alive — the TCP socket is open, the MQTT client reports "connected" — but nothing is actually getting delivered.

This is the silent failure problem, and it kills more IIoT deployments than any protocol misconfiguration ever will. This guide covers how to design watchdog systems that detect, diagnose, and automatically recover from every flavor of connectivity failure.

Why MQTT Connections Fail Silently

To understand why watchdogs are necessary, you need to understand what MQTT's keep-alive mechanism does and — more importantly — what it doesn't do.

MQTT keep-alive is a bi-directional ping. The client sends a PINGREQ, the broker responds with PINGRESP. If the broker doesn't hear from the client within 1.5× the keep-alive interval, it considers the client dead and closes the session. If the client doesn't get a PINGRESP, it knows the connection is lost.

Sounds robust, right? Here's where it falls apart:

The Half-Open Connection Problem

TCP connections can enter a "half-open" state where one side thinks the connection is alive, but the other side has already dropped it. This happens when a NAT gateway times out the session, a cellular modem roams to a new tower, or a firewall silently drops the route. The MQTT client's operating system still shows the socket as ESTABLISHED. The keep-alive PINGREQ gets queued in the kernel's send buffer — and sits there, never actually reaching the wire.

The Zombie Session Problem

The gateway reconnects after an outage and gets a new TCP session, but the broker still has the old session's resources allocated. Depending on the clean session flag and broker implementation, you might end up with duplicate subscriptions, missed messages on the command channel, or a broker that refuses the new connection because the old client ID is still "active."

The Token Expiration Problem

Cloud IoT platforms (Azure IoT Hub, AWS IoT Core, Google Cloud IoT) use SAS tokens or JWT tokens for authentication. These tokens have expiration timestamps. When a token expires, the MQTT connection stays open until the next reconnection attempt — which then fails with an authentication error. If your reconnection logic doesn't refresh the token before retrying, you'll loop forever: connect → auth failure → reconnect → auth failure.

The Backpressure Problem

The MQTT client library reports "connected," publishes succeed (they return a message ID), but the broker is under load and takes 30 seconds to acknowledge the publish. Your QoS 1 messages pile up in the client's outbound queue. Eventually the client's memory is exhausted, publishes start failing, but the connection is technically alive.

Designing a Proper Watchdog

A production-grade edge watchdog doesn't just check "am I connected?" It monitors three independent health signals:

Signal 1: Connection State

Track the MQTT on_connect and on_disconnect callbacks. Maintain a state machine:

States:
  DISCONNECTED → CONNECTING → CONNECTED → DISCONNECTING → DISCONNECTED
  
Transitions:
  DISCONNECTED + config_available → CONNECTING (initiate async connect)
  CONNECTING + on_connect(status=0) → CONNECTED
  CONNECTING + on_connect(status≠0) → DISCONNECTED (log error, wait backoff)
  CONNECTED + on_disconnect → DISCONNECTING → DISCONNECTED

The key detail: initiate MQTT connections asynchronously in a dedicated thread. A blocking mqtt_connect() call in the main data collection loop will halt PLC reads during the TCP handshake — which on a cellular link with 2-second RTT means 2 seconds of missed data. Use a semaphore or signal to coordinate: the connection thread posts "I'm ready" when it finishes, and the main loop picks it up on the next cycle.

Signal 2: Delivery Confirmation

This is the critical signal that catches silent failures. Track the timestamp of the last successfully delivered message (acknowledged by the broker, not just sent by the client).

For QoS 1: the on_publish callback fires when the broker acknowledges receipt with a PUBACK. Record this timestamp every time it fires.

Last Delivery Tracking:
  on_publish(packet_id) → last_delivery_timestamp = now()

Watchdog Check (every main loop cycle):
  if (now() - last_delivery_timestamp > WATCHDOG_TIMEOUT):
    trigger_reconnection()

What's the right watchdog timeout? It depends on your data rate:

Data Rate	Suggested Timeout	Rationale
Every 1s	30–60s	30 missed deliveries before alert
Every 5s	60–120s	12–24 missed deliveries
Every 30s	120–300s	4–10 missed deliveries

The timeout should be significantly longer than your maximum expected inter-delivery interval. If your batch timeout is 30 seconds, a 120-second watchdog timeout gives you 4 batch cycles of tolerance before concluding something is wrong.

Signal 3: Token/Certificate Validity

Before attempting reconnection, check the authentication material:

Token Check:
  if (token_expiration_timestamp ≠ 0):
    if (current_time > token_expiration_timestamp):
      log("WARNING: Cloud auth token may be expired")
    else:
      log("Token valid until {expiration_time}")

If your deployment uses SAS tokens with expiration timestamps, parse the se= (signature expiry) parameter from the connection string at startup. Log a warning when the token is approaching expiry. Some platforms provide token refresh mechanisms; others require a redeployment. Either way, knowing the token is expired before the first reconnection attempt saves you from debugging phantom connection failures at 3 AM.

Buffer-Aware Recovery: Don't Lose Data During Outages

The watchdog triggers a reconnection. But what happens to the data that was collected while the connection was down?

This is where most IIoT platforms quietly drop data. The naïve approach: if the MQTT publish call fails, discard the message and move on. This means any network outage, no matter how brief, creates a permanent gap in your historical data.

A proper store-and-forward buffer works like this:

Page-Based Buffer Architecture

Instead of a simple FIFO queue, divide a fixed memory region into pages. Each page holds multiple messages packed sequentially. Three page lists manage the lifecycle:

Free Pages: Empty, available for new data
Work Page: Currently being filled with new messages
Used Pages: Full pages waiting for delivery

Data Flow:
  PLC Read → Batch Encoder → Work Page (append)
  Work Page Full → Move to Used Pages queue
  
  MQTT Connected:
    Used Pages front → Send first message → Wait for PUBACK
    PUBACK received → Advance read pointer
    Page fully delivered → Move to Free Pages
    
  MQTT Disconnected:
    Used Pages continue accumulating
    Work Page continues filling
    If Free Pages exhausted → Reclaim oldest Used Page (overflow warning)

Why Pages, Not Individual Messages

Individual message queuing has per-message overhead that becomes significant at high data rates: pointer storage, allocation/deallocation, fragmentation. A page-based buffer pre-allocates a contiguous memory block (typically 1–2 MB on embedded edge hardware) and manages it as fixed-size pages. No dynamic allocation after startup. No fragmentation. Predictable memory footprint.

The overflow behavior is also better. When the buffer is full and the connection is still down, you sacrifice the oldest complete page — losing, say, 60 seconds of data from 10 minutes ago rather than randomly dropping individual messages from different time periods. The resulting data gap is clean and contiguous, which is much easier for downstream analytics to handle than scattered missing points.

Disconnect Recovery Sequence

When the MQTT on_disconnect callback fires:

Mark connection as down immediately — the buffer stops trying to send
Reset "packet in flight" flag — the pending PUBACK will never arrive
Continue accepting data from PLC reads into the buffer
Do NOT flush or clear the buffer — all unsent data stays queued

When on_connect fires after reconnection:

Mark connection as up
Begin draining Used Pages from the front of the queue
Send first queued message, wait for PUBACK, then send next
Simultaneously accept new data into the Work Page

This "catch-up" phase is important to handle correctly. New real-time data is still flowing into the buffer while old data is being drained. The buffer must handle concurrent writes (from the PLC reading thread) and reads (for MQTT delivery) safely. Mutex protection on the page list operations is essential.

Async Connection Threads: The Pattern That Saves You

Network operations block. DNS resolution blocks. TCP handshakes block. TLS negotiation blocks. On a cellular connection with packet loss, a single connection attempt can take 5–30 seconds.

If your edge gateway has a single thread doing both PLC reads and MQTT connections, that's 5–30 seconds of missed PLC data every time the connection drops. For an injection molding machine with a 15-second cycle, you could miss an entire shot.

The solution is a dedicated connection thread:

Main Thread:
  loop:
    read_plc_tags()
    encode_and_buffer()
    dispatch_command_queue()
    check_watchdog()
    if watchdog_triggered:
      post_job_to_connection_thread()
    sleep(1s)

Connection Thread:
  loop:
    wait_for_job()  // blocks on semaphore
    destroy_old_connection()
    create_new_mqtt_client()
    configure_tls()
    set_callbacks()
    mqtt_connect_async(host, port)
    signal_job_complete()  // post semaphore

Two semaphores coordinate this:

Job semaphore: Main thread posts to trigger reconnection, connection thread waits on it
Completion semaphore: Connection thread posts when done, main thread checks (non-blocking) before posting next job

Critical detail: check that the connection thread isn't already running before posting a new job. If the main thread fires the watchdog timeout every 120 seconds but the last reconnection attempt is still in progress (stuck in a 90-second TLS handshake), you'll get overlapping connection attempts that corrupt the MQTT client state.

Reconnection Backoff Strategy

When the cloud endpoint is genuinely down (maintenance window, region outage), aggressive reconnection attempts waste cellular data and CPU cycles. But when it's a transient network glitch, you want to reconnect immediately.

The right approach combines fixed-interval reconnect with watchdog escalation:

Reconnect Timing:
  Attempt 1: Immediate (transient glitch)
  Attempt 2: 5 seconds
  Attempt 3: 5 seconds (cap at 5s for constant backoff)
  
  Watchdog escalation:
    if no successful delivery in 120 seconds despite "connected" state:
      force full reconnection (destroy + recreate client)

Why not exponential backoff? In industrial settings, the most common failure mode is a brief network interruption — a cell tower handoff, a router reboot, a firewall session timeout. These resolve in 5–15 seconds. Exponential backoff would delay your reconnection to 30s, 60s, 120s, 240s... meaning you could be offline for 4+ minutes after a 2-second glitch. Constant 5-second retry with watchdog escalation provides faster recovery for the common case while still preventing connection storms during genuine outages.

Device Status Broadcasting

Your edge gateway should periodically broadcast its own health status via MQTT. This serves two purposes: it validates the delivery pipeline end-to-end, and it gives the cloud platform visibility into the gateway fleet's health.

A well-designed status message includes:

System uptime (OS level — how long since last reboot)
Daemon uptime (application level — how long since last restart)
Connected device inventory (PLC types, serial numbers, link states)
Token expiration timestamp (proactive alerting for credential rotation)
Buffer utilization (how close to overflow)
Software version + build hash (for fleet management and OTA targeting)
Per-device tag counts and last-read timestamps (stale data detection)

Send a compact status on every connection establishment, and a detailed status periodically (every 5–10 minutes). The compact status acts as a "birth certificate" — the cloud platform immediately knows which gateway just came online and what equipment it's managing.

Real-World Failure Scenarios and How the Watchdog Handles Them

Scenario 1: Cellular Modem Roaming

Symptom: TCP connection goes half-open. MQTT client thinks it's connected. Publishes queue up in OS buffer. Detection: Watchdog timeout fires — no PUBACK received in 120 seconds despite continuous publishes. Recovery: Force reconnection. Buffer holds all unsent data. Reconnect on new cell tower, drain buffer. Data loss: Zero (buffer sized for 2-minute outage).

Scenario 2: Cloud Platform Maintenance Window

Symptom: MQTT broker goes offline. Client receives disconnection callback. Detection: Immediate — on_disconnect fires. Recovery: 5-second reconnect attempts. Buffer accumulates data. Connection succeeds when maintenance ends. Data loss: Zero if maintenance window is shorter than buffer capacity (typically 10–30 minutes at normal data rates).

Scenario 3: SAS Token Expiration

Symptom: Connection drops. Reconnection attempts fail with authentication error. Detection: Watchdog notices repeated connection failures. Token timestamp check confirms expiration. Recovery: Log critical alert. Wait for token refresh (manual or automated). Reconnect with new token. Data loss: Depends on token refresh time. Buffer provides bridge.

Scenario 4: PLC Goes Offline

Symptom: Tag reads start returning errors. Gateway loses link state to PLC. Detection: Link state monitoring fires immediately. Error delivered to cloud as a priority (unbatched) event. Recovery: Gateway continues attempting PLC reads. When PLC comes back, link state restored, reads resume. MQTT impact: None — the cloud connection is independent of PLC connections. Both failures are handled by separate watchdog systems.

Monitoring Your Watchdog (Yes, You Need to Watch the Watcher)

The watchdog itself needs observability:

Log every watchdog trigger with reason (no PUBACK, connection timeout, token expiry)
Count reconnection attempts per hour — a spike indicates infrastructure instability
Track buffer high-water marks — if the buffer repeatedly approaches capacity, your connectivity is too unreliable for the data rate
Alert on repeated authentication failures — this is almost always a credential rotation issue

Platforms like machineCDN build this entire watchdog system into the edge agent — monitoring cloud connections, managing store-and-forward buffers, handling reconnection with awareness of both the MQTT transport state and the buffer delivery state. The result is a self-healing data pipeline where network outages create brief delays in cloud delivery but never cause data loss.

Implementation Checklist

Before deploying your edge gateway to production, verify:

The unsexy truth about industrial IoT reliability is that it's not about the protocol choice or the cloud platform. It's about what happens in the 120 seconds after your connection drops. Get the watchdog right, and a 10-minute network outage is invisible to your operators. Get it wrong, and a 2-second glitch creates a permanent hole in your production data.

Build the self-healing pipeline. Your 3 AM self will thank you.

Modbus Address Conventions and Function Codes: The Practical Guide Every IIoT Engineer Needs [2026]

March 2, 2026 · 11 min read

If you've ever stared at a PLC register map wondering why address 300001 means something completely different from 400001, or why your edge gateway reads all zeros from a register that should contain temperature data — this guide is for you.

Modbus has been the lingua franca of industrial automation for nearly five decades. Its longevity comes from simplicity, but that simplicity hides a handful of conventions that trip up even experienced engineers. The addressing scheme and its relationship to function codes is the single most important concept to nail before you write a single line of polling logic.

Let's break it apart.

OPC-UA Subscriptions and Monitored Items: Engineering Low-Latency Data Pipelines for Manufacturing [2026]

March 2, 2026 · 10 min read

If you've worked with industrial protocols long enough, you know there are exactly two categories of data delivery: polling (you ask, the device answers) and subscriptions (the device tells you when something changes). OPC-UA's subscription model is one of the most sophisticated data delivery mechanisms in industrial automation — and one of the most frequently misconfigured.

This guide covers how OPC-UA subscriptions actually work at the wire level, how to configure monitored items for different manufacturing scenarios, and the real-world performance tradeoffs that separate a responsive factory dashboard from one that lags behind reality by minutes.

How OPC-UA Subscriptions Differ from Polling

In a traditional Modbus or EtherNet/IP setup, the client polls registers on a fixed interval — every 1 second, every 5 seconds, whatever the configuration says. This is simple and predictable, but it has fundamental limitations:

Wasted bandwidth: If a temperature value hasn't changed in 30 minutes, you're still reading it every second
Missed transients: If a pressure spike occurs between poll cycles, you'll never see it
Scaling problems: With 500 tags across 20 PLCs, fixed-interval polling creates predictable network congestion waves

OPC-UA subscriptions flip this model. Instead of the client pulling data, the server monitors values internally and notifies the client only when something meaningful changes. The key word is "meaningful" — and that's where the engineering gets interesting.

The Three Layers of OPC-UA Subscriptions

An OPC-UA subscription isn't a single thing. It's three nested concepts that work together:

1. The Subscription Object

A subscription is a container that defines the publishing interval — how often the server checks its monitored items and bundles any pending notifications into a single message. Think of it as the heartbeat of the data pipeline.

Publishing Interval: 500ms
Max Keep-Alive Count: 10
Max Notifications Per Publish: 0 (unlimited)
Priority: 100

The publishing interval is NOT the sampling rate. This is a critical distinction. The publishing interval only controls how often notifications are bundled and sent to the client. A 500ms publishing interval with a 100ms sampling rate means values are checked 5 times between each publish cycle.

2. Monitored Items

Each variable you want to track becomes a monitored item within a subscription. This is where the real configuration lives:

Sampling Interval: How often the server reads the underlying data source (PLC register, sensor, calculated value)
Queue Size: How many value changes to buffer between publish cycles
Discard Policy: When the queue overflows, do you keep the oldest or newest values?
Filter: What constitutes a "change" worth reporting?

3. Filters (Deadbands)

Filters determine when a monitored item's value has changed "enough" to warrant a notification. There are two types:

Absolute Deadband: Value must change by at least X units (e.g., temperature must change by 0.5°F)
Percent Deadband: Value must change by X% of its engineering range

Without a deadband filter, you'll get notifications for every single floating-point fluctuation — including ADC noise that makes a temperature reading bounce between 72.001°F and 72.003°F. That's not useful data. That's noise masquerading as signal.

Practical Configuration Patterns

Pattern 1: Critical Alarms (Boolean State Changes)

For alarm bits — compressor faults, pressure switch trips, flow switch states — you want immediate notification with zero tolerance for missed events.

Subscription:
  Publishing Interval: 250ms

Monitored Item (alarm_active):
  Sampling Interval: 100ms
  Queue Size: 10
  Discard Policy: DiscardOldest
  Filter: None (report every change)

Why a queue size of 10? Because boolean alarm bits can toggle rapidly during fault conditions. A compressor might fault, reset, and fault again within a single publish cycle. Without a queue, you'd only see the final state. With a queue, you see the full sequence — which is critical for root cause analysis.

Pattern 2: Process Temperatures (Slow-Moving Analog)

Chiller outlet temperature, barrel zone temps, coolant temperatures — these change gradually and generate enormous amounts of redundant data without deadbanding.

Subscription:
  Publishing Interval: 1000ms

Monitored Item (chiller_outlet_temp):
  Sampling Interval: 500ms
  Queue Size: 5
  Discard Policy: DiscardOldest
  Filter: AbsoluteDeadband(0.5)  // °F

A 0.5°F deadband means you won't get notifications from ADC noise, but you will catch meaningful process drift. At a 500ms sampling rate, the server checks the value twice per publish cycle, ensuring you don't miss a rapid temperature swing even with the coarser publishing interval.

Pattern 3: High-Frequency Production Counters

Cycle counts, part counts, shot counters — these increment continuously during production and need efficient handling.

Subscription:
  Publishing Interval: 5000ms

Monitored Item (cycle_count):
  Sampling Interval: 1000ms
  Queue Size: 1
  Discard Policy: DiscardOldest
  Filter: None

Queue size of 1 is intentional here. You only care about the latest count value — intermediate values are meaningless because the counter only goes up. A 5-second publishing interval means you update dashboards at a reasonable rate without flooding the network with every single increment.

Pattern 4: Energy Metering (Cumulative Registers)

Power consumption registers accumulate continuously. The challenge is capturing the delta accurately without drowning in data.

Subscription:
  Publishing Interval: 60000ms (1 minute)

Monitored Item (energy_kwh):
  Sampling Interval: 10000ms
  Queue Size: 1
  Discard Policy: DiscardOldest
  Filter: PercentDeadband(1.0)  // 1% of range

For energy data, minute-level resolution is typically sufficient for cost allocation and ESG reporting. The percent deadband prevents notifications from meter jitter while still capturing real consumption changes.

Queue Management: The Hidden Performance Killer

Here's what most OPC-UA deployments get wrong: they set queue sizes too small and wonder why their historical data has gaps.

Consider what happens during a network hiccup. The subscription's publish cycle fires, but the client is temporarily unreachable. The server holds notifications in the subscription's retransmission queue for a configurable number of keep-alive cycles. But the monitored item queue is independent — it continues filling with new samples.

If your monitored item queue size is 1 and the network is down for 10 seconds at a 100ms sampling rate, you've lost 100 samples. When the connection recovers, you get exactly one value — the last one. The history is gone.

Rule of thumb: Set the queue size to at least (expected_max_outage_seconds × 1000) / sampling_interval_ms for any tag where you can't afford data gaps.

For a process that needs 30-second outage tolerance at 500ms sampling:

Queue Size = (30 × 1000) / 500 = 60

That's 60 entries per monitored item. Multiply by your tag count and you'll understand why OPC-UA server memory sizing matters.

Sampling Interval vs. Publishing Interval: Getting the Ratio Right

The relationship between sampling interval and publishing interval determines your system's behavior:

Ratio	Behavior	Use Case
Sampling = Publishing	Sample once, publish once	Simple monitoring, low bandwidth
Sampling < Publishing	Multiple samples per publish, deadband filtering effective	Process control, drift detection
Sampling << Publishing	High-resolution capture, batched delivery	Vibration, power quality

Anti-pattern: Setting sampling interval to 0 (fastest possible). This tells the server to sample at its maximum rate, which on some implementations means every scan cycle of the underlying PLC. A Siemens S7-1500 scanning at 1ms will generate 1,000 samples per second per tag. With 200 tags, that's 200,000 data points per second — most of which are identical to the previous value.

Better approach: Match the sampling interval to the physical process dynamics. A barrel heater zone that takes 30 seconds to change 1°F doesn't need 10ms sampling. A pneumatic valve that opens in 50ms does.

Subscription Diagnostics and Health Monitoring

OPC-UA provides built-in diagnostics that most deployments ignore:

Subscription-Level Counters

NotificationCount: Total notifications sent since subscription creation
PublishRequestCount: How many publish requests the client has outstanding
RepublishCount: How many times the server had to retransmit (indicates network issues)
TransferredCount: Subscriptions transferred between sessions (cluster failover)

Monitored Item Counters

SamplingCount: How many times the item was sampled
QueueOverflowCount: How many values were discarded due to full queues — this is your canary
FilteredCount: How many samples were suppressed by deadband filters

If QueueOverflowCount is climbing, your queue is too small for the sampling rate and publish interval combination. If FilteredCount is near SamplingCount, your deadband is too aggressive — you're suppressing real data.

How This Compares to Change-Based Polling in Other Protocols

OPC-UA subscriptions aren't the only way to get change-driven data from PLCs. In practice, many IIoT platforms — including machineCDN — implement intelligent change detection at the edge, regardless of the underlying protocol.

The pattern works like this: the edge gateway reads register values on a schedule, compares them to the previously read values, and only transmits data upstream when a meaningful change occurs. Critical state changes (alarms, link state transitions) bypass batching entirely and are sent immediately. Analog values are batched on configurable intervals and compared using value-based thresholds.

This approach brings subscription-like efficiency to protocols that don't natively support it (Modbus, older EtherNet/IP devices). The tradeoff is latency — you're still polling, so maximum detection latency equals your polling interval. But for processes where sub-second change detection isn't required, it's remarkably effective and dramatically reduces cloud ingestion costs.

Real-World Performance Numbers

From production deployments across plastics, packaging, and discrete manufacturing:

Configuration	Tags	Bandwidth	Update Latency
Fixed 1s polling, no filtering	500	2.1 Mbps	1s
OPC-UA subscriptions, 500ms publish, deadband	500	180 Kbps	250ms–500ms
Edge change detection + batching	500	95 Kbps	1s–5s (configurable)
OPC-UA subs + edge batching combined	500	45 Kbps	500ms–5s (priority dependent)

The bandwidth savings from proper subscription configuration are typically 10–20x compared to naive polling. Combined with edge-side batching for cloud delivery, you can achieve 40–50x reduction — which matters enormously on cellular connections at remote facilities.

Common Pitfalls

1. Ignoring the Revised Sampling Interval

When you request a sampling interval, the server may revise it to a supported value. Always check the response — if you asked for 100ms and the server gave you 1000ms, your entire timing assumption is wrong.

2. Too Many Subscriptions

Each subscription has overhead: keep-alive traffic, retransmission buffers, and a dedicated publish thread on some implementations. Don't create one subscription per tag — group tags by priority class and use 3–5 subscriptions total.

3. Forgetting Lifetime Count

The subscription's lifetime count determines how many publish cycles can pass without a successful client response before the server kills the subscription. On unreliable networks, set this high enough to survive outages without losing your subscription state.

4. Not Monitoring Queue Overflows

If you're not checking QueueOverflowCount, you have no idea whether you're losing data. This is especially insidious because everything looks fine on your dashboard — you just have invisible gaps in your history.

Wrapping Up

OPC-UA subscriptions are the most capable data delivery mechanism in industrial automation today, but capability without proper configuration is just complexity. The fundamentals come down to:

Match sampling intervals to process dynamics, not to what feels fast enough
Use deadbands aggressively on analog values — noise isn't data
Size queues for your worst-case outage, not your average case
Monitor the diagnostics — OPC-UA tells you when things are wrong, if you're listening

For manufacturing environments where protocols like Modbus and EtherNet/IP dominate the device layer, an edge platform like machineCDN provides change-based detection and intelligent batching that delivers subscription-like efficiency regardless of the underlying protocol — bridging the gap between legacy equipment and modern analytics pipelines.

The protocol layer is just plumbing. What matters is getting the right data, at the right time, to the right system — without burying your network or your cloud budget under a mountain of redundant samples.

PLC Alarm Word Decoding: How to Extract Bit-Level Alarm States for IIoT Monitoring [2026]

March 2, 2026 · 12 min read

Most plant engineers understand alarms at the HMI level — a red indicator lights up, a buzzer sounds, someone walks over to the machine. But when you connect PLCs to an IIoT platform for remote monitoring, you hit a fundamental data representation problem: PLCs don't store alarms as individual boolean values. They pack them into 16-bit registers called alarm words.

A single uint16 register can encode 16 different alarm conditions. A chiller with 10 refrigeration circuits might have 30+ alarm word registers — encoding hundreds of individual alarm states. If your IIoT platform doesn't understand this encoding, you'll either miss critical alarms or drown in meaningless raw register values.

This guide explains how alarm word decoding works at the edge, why it matters for reliable remote monitoring, and how to implement it without flooding your cloud platform with unnecessary data.

The Discovery Sequence: Try EtherNet/IP First, Fall Back to Modbus​

Serial Number Encoding: Packing Identity into 32 Bits​

Validation Matters​

The Configuration Lookup Pattern​

Handling Modbus Address Conventions​

Register Grouping Optimization​

Multi-Protocol Detection: The Real-World Sequence​

Connection Resilience: The Watchdog Pattern​

What machineCDN Brings to the Table​

Key Takeaways​

Why MQTT Connections Fail Silently​

The Half-Open Connection Problem​

The Zombie Session Problem​

The Token Expiration Problem​

The Backpressure Problem​

Designing a Proper Watchdog​

Signal 1: Connection State​

Signal 2: Delivery Confirmation​

Signal 3: Token/Certificate Validity​

Buffer-Aware Recovery: Don't Lose Data During Outages​

Page-Based Buffer Architecture​

Why Pages, Not Individual Messages​

Disconnect Recovery Sequence​

Async Connection Threads: The Pattern That Saves You​

Reconnection Backoff Strategy​

Device Status Broadcasting​

Real-World Failure Scenarios and How the Watchdog Handles Them​

Scenario 1: Cellular Modem Roaming​

Scenario 2: Cloud Platform Maintenance Window​

Scenario 3: SAS Token Expiration​

Scenario 4: PLC Goes Offline​

Monitoring Your Watchdog (Yes, You Need to Watch the Watcher)​

Implementation Checklist​

How OPC-UA Subscriptions Differ from Polling​

The Three Layers of OPC-UA Subscriptions​

1. The Subscription Object​

2. Monitored Items​

3. Filters (Deadbands)​

Practical Configuration Patterns​

Pattern 1: Critical Alarms (Boolean State Changes)​

Pattern 2: Process Temperatures (Slow-Moving Analog)​

Pattern 3: High-Frequency Production Counters​

Pattern 4: Energy Metering (Cumulative Registers)​

Queue Management: The Hidden Performance Killer​

Sampling Interval vs. Publishing Interval: Getting the Ratio Right​

Subscription Diagnostics and Health Monitoring​

Subscription-Level Counters​

Monitored Item Counters​

How This Compares to Change-Based Polling in Other Protocols​

Real-World Performance Numbers​

Common Pitfalls​

1. Ignoring the Revised Sampling Interval​

2. Too Many Subscriptions​

3. Forgetting Lifetime Count​

4. Not Monitoring Queue Overflows​

Wrapping Up​