Skip to main content

187 posts tagged with "Industrial IoT"

Industrial Internet of Things insights and best practices

View All Tags

Total Cost of Ownership for IIoT Platforms: The Complete Guide to What You'll Actually Spend

· 9 min read
MachineCDN Team
Industrial IoT Experts

When manufacturing leaders evaluate Industrial IoT platforms, the conversation usually starts with license cost. "What's it per device? Per user? Per data point?" These are the wrong first questions. License fees typically represent only 20-40% of your total IIoT investment over three years. The rest hides in implementation, infrastructure, engineering time, change management, and ongoing operations.

Time-Sensitive Networking (TSN) for Industrial Ethernet: Why Deterministic Communication Is the Future of IIoT [2026]

· 11 min read

If you've spent any time on a factory floor, you know the fundamental tension: control traffic needs hard real-time guarantees (microsecond-level determinism), while monitoring and analytics traffic just needs "fast enough." For decades, the industry solved this by running separate networks — a PROFINET or EtherNet/IP fieldbus for control, and standard Ethernet for everything else.

Time-Sensitive Networking (TSN) eliminates that compromise. It brings deterministic, bounded-latency communication to standard IEEE 802.3 Ethernet — meaning your motion control packets and your IIoT telemetry can share the same physical wire without interfering with each other.

This isn't theoretical. TSN-capable switches are shipping from Cisco, Belden, Moxa, and Siemens. OPC-UA Pub/Sub over TSN is in production pilots. And if you're designing an IIoT architecture today, understanding TSN isn't optional — it's the foundation of where industrial networking is going.

The Problem TSN Solves

Standard Ethernet is "best effort." When you plug a switch into a network, frames are forwarded based on MAC address tables, and if two frames need the same port at the same time, one waits. That waiting — buffering, queueing, potential frame drops — is completely acceptable for web traffic. It's catastrophic for servo drives.

Consider a typical plastics manufacturing cell. An injection molding machine has:

  • Motion control loop running at 1ms cycle time (servo drives, hydraulic valves)
  • Process monitoring polling barrel temperatures every 2-5 seconds
  • Quality inspection sending 10MB camera images to an edge server
  • IIoT telemetry batching 500 tag values to MQTT every 30 seconds
  • MES integration exchanging production orders and counts

Before TSN, this required at minimum two separate networks — often three. The motion controller ran on a dedicated real-time fieldbus (PROFINET IRT, EtherCAT, or SERCOS III). Process monitoring lived on standard Ethernet. And the camera system had its own GigE network to avoid flooding the process network.

TSN says: one network, one wire, zero compromises.

The TSN Standards Stack

TSN isn't a single protocol — it's a family of IEEE 802.1 standards that work together. Understanding which ones matter for industrial deployments is critical.

IEEE 802.1AS: Time Synchronization

Everything in TSN starts with a shared clock. 802.1AS (generalized Precision Time Protocol, or gPTP) synchronizes all devices on the network to a common time reference with sub-microsecond accuracy.

Key differences from standard PTP (IEEE 1588):

FeatureIEEE 1588 PTPIEEE 802.1AS gPTP
ScopeAny IP networkLayer 2 only
Best Master ClockComplex negotiationSimplified selection
Peer delay measurementOptionalMandatory
TransportUDP (L3) or L2L2 only
Typical accuracy1-10 μs< 1 μs

For plant engineers, the practical implication is this: every TSN bridge (switch) participates in time synchronization. There's no "transparent clock" mode where a switch just passes PTP packets through. Every hop actively measures its own residence time and adjusts timestamps accordingly.

This gives you a synchronized time base across the entire network — which is what makes scheduled traffic possible.

IEEE 802.1Qbv: Time-Aware Shaper (TAS)

This is the core of TSN determinism. 802.1Qbv introduces the concept of time gates on each egress port of a switch. Every port has up to 8 priority queues (matching 802.1Q priority code points), and each queue has a gate that opens and closes on a precise schedule.

The schedule repeats on a fixed cycle — say, every 1ms. During the first 100μs, only the highest-priority queue (motion control) is open. During the next 300μs, process data queues open. The remaining 600μs is available for best-effort traffic (IIoT telemetry, file transfers, web browsing).

Time Cycle (1ms example):
├── 0-100μs: Gate 7 OPEN (motion control only)
├── 100-400μs: Gate 5-6 OPEN (process monitoring, alarms)
├── 400-1000μs: Gates 0-4 OPEN (IIoT, MES, IT traffic)
└── Cycle repeats...

The beauty of this approach is mathematical: if a motion control frame fits within its dedicated time slot, it's physically impossible for lower-priority traffic to delay it. No amount of IIoT telemetry bursts, camera image transfers, or IT traffic can interfere.

Practical consideration: TAS schedules must be configured consistently across all switches in the path. A motion control packet traversing 5 switches needs all 5 to have synchronized, compatible gate schedules. This is where centralized network configuration (via 802.1Qcc) becomes essential.

IEEE 802.1Qbu/802.3br: Frame Preemption

Even with scheduled gates, there's a problem: what if a low-priority frame is already being transmitted when the high-priority gate opens? On a 100Mbps link, a maximum-size Ethernet frame (1518 bytes) takes ~120μs to transmit. That's an unacceptable delay for a 1ms control loop.

Frame preemption solves this. It allows a switch to pause ("preempt") a low-priority frame mid-transmission, send the high-priority frame, then resume the preempted frame from where it left off.

The preempted frame is split into fragments, each with its own CRC for integrity checking. The receiving end reassembles them transparently. From the application's perspective, no frames are lost — the low-priority frame just arrives a bit later.

Why this matters in practice: Without preemption, you'd need to reserve guard bands — empty time slots before each high-priority window to ensure no large frame is in flight. Guard bands waste bandwidth. On a 100Mbps link with 1ms cycles, a 120μs guard band wastes 12% of available bandwidth. Preemption eliminates that waste entirely.

IEEE 802.1Qcc: Stream Reservation and Configuration

In a real plant, you don't manually configure gate schedules on every switch. 802.1Qcc defines a Centralized Network Configuration (CNC) model where a controller:

  1. Discovers the network topology
  2. Receives stream requirements from talkers (e.g., "I need to send 64 bytes every 1ms with max 50μs latency")
  3. Computes gate schedules across all switches in the path
  4. Programs the schedules into each switch

This is conceptually similar to how SDN (Software Defined Networking) works in data centers, adapted for the specific needs of industrial real-time traffic.

Current reality: CNC tooling is still maturing. As of early 2026, most TSN deployments use vendor-specific configuration tools (Siemens TIA Portal for PROFINET over TSN, Rockwell's Studio 5000 for EtherNet/IP over TSN). Full, vendor-agnostic CNC is coming but isn't plug-and-play yet.

IEEE 802.1CB: Frame Replication and Elimination

For safety-critical applications (emergency stops, protective relay controls), TSN supports seamless redundancy through 802.1CB. A talker sends duplicate frames along two independent paths through the network. Each receiving bridge eliminates the duplicate, passing only one copy to the application.

If one path fails, the other delivers the frame with zero switchover time. There's no spanning tree reconvergence, no RSTP timeout — the redundant frame was already there.

This gives you "zero recovery time" redundancy that's comparable to PRP (Parallel Redundancy Protocol) or HSR (High-availability Seamless Redundancy), but integrated into the TSN framework.

TSN vs. Existing Industrial Protocols

PROFINET IRT

PROFINET IRT (Isochronous Real-Time) achieves similar determinism to TSN, but it does so with proprietary hardware. IRT requires special ASICs in every switch and end device. Standard Ethernet switches don't work.

TSN-based PROFINET ("PROFINET over TSN") is Siemens' path forward. It preserves the PROFINET application layer while moving the real-time mechanism to TSN. The payoff: you can mix PROFINET devices with OPC-UA publishers, MQTT clients, and standard IT equipment on the same network.

EtherCAT

EtherCAT achieves extraordinary performance (sub-microsecond synchronization) by processing Ethernet frames "on the fly" — each slave modifies the frame as it passes through. This requires daisy-chain topology and dedicated EtherCAT hardware.

TSN can't match EtherCAT's raw performance in a daisy chain. But TSN supports standard star topologies with off-the-shelf switches, which is far more practical for plant-wide networks. The trend: EtherCAT for servo-level control within a machine, TSN for the plant-level network connecting machines.

Mitsubishi's CC-Link IE TSN was one of the first industrial protocols to adopt TSN natively. It demonstrates the model: keep the application-layer protocol (CC-Link IE Field), replace the real-time Ethernet mechanism with standard TSN. This lets CC-Link IE coexist with other TSN traffic on the same network.

Practical Architecture: TSN in a Manufacturing Plant

Here's how a TSN-based IIoT architecture looks in practice:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Servo Drives │ │ PLC / Motion│ │ Edge Gateway │
│ (TSN NIC) │────│ Controller │────│ (machineCDN) │
└─────────────┘ └─────────────┘ └──────┬───────┘
│ │
┌──────┴───────┐ │
│ TSN Switch │ │
│ (802.1Qbv) │────────────┘
└──────┬───────┘

┌────────────┼────────────┐
│ │ │
┌──────┴──┐ ┌────┴────┐ ┌────┴─────┐
│ HMI / │ │ Vision │ │ IT/Cloud │
│ SCADA │ │ System │ │ Traffic │
└─────────┘ └─────────┘ └──────────┘

The TSN switch runs 802.1Qbv with a gate schedule that guarantees:

  • Priority 7: Motion control frames — guaranteed 100μs slots at 1ms intervals
  • Priority 5-6: Process monitoring, alarms — 300μs slots
  • Priority 3-4: MES, HMI, SCADA — allocated bandwidth in best-effort window
  • Priority 0-2: IIoT telemetry, file transfers — fills remaining bandwidth

The edge gateway collecting IIoT telemetry operates in the best-effort tier. It polls PLC tags over EtherNet/IP or Modbus TCP, batches the data, and publishes to MQTT — all without any risk of interfering with the control loops sharing the same wire.

Platforms like machineCDN that bridge industrial protocols to cloud already handle the data collection side — Modbus register grouping, EtherNet/IP tag reads, change-of-value filtering. TSN just means that data collection traffic coexists safely with control traffic, eliminating the need for separate networks.

Performance Benchmarks

Real-world TSN deployments show consistent results:

MetricTypical Performance
Time sync accuracy200-800 ns across 10 hops
Minimum guaranteed cycle31.25 μs (with preemption)
Maximum jitter (scheduled traffic)< 1 μs
Maximum hops for < 10μs latency5-7 (at 1Gbps)
Bandwidth efficiency85-95% (vs 70-80% without preemption)
Frame preemption overhead~20 bytes per fragment (minimal)

Compare this to standard Ethernet QoS (802.1p priority queues without TAS): priority queuing gives you statistical priority, not deterministic guarantees. Under heavy load, even high-priority frames can experience hundreds of microseconds of jitter.

Common Pitfalls

1. Not All "TSN-Capable" Switches Are Equal

Some switches support 802.1AS (time sync) but not 802.1Qbv (scheduled traffic). Others support Qbv but not frame preemption. Check the specific IEEE profiles supported, not just the TSN marketing label.

The IEC/IEEE 60802 TSN Profile for Industrial Automation defines the mandatory feature set for industrial use. Look for compliance with this profile.

2. End-Device TSN Support Is Still Emerging

A TSN switch is only half the equation. For guaranteed determinism, the end device (PLC, drive, sensor) needs a TSN-capable Ethernet controller that can transmit frames at precisely scheduled times. Many current PLCs use standard Ethernet NICs — they benefit from TSN's traffic isolation but can't achieve sub-microsecond transmission timing.

3. Configuration Complexity

TSN gate schedules are powerful but complex. A misconfigured schedule can:

  • Create "dead time" where no queue is open (wasted bandwidth)
  • Allow large best-effort frames to overflow into scheduled slots
  • Cause frame drops if the schedule doesn't account for inter-frame gaps

Start simple: define two traffic classes (real-time and best-effort) before attempting multi-level scheduling.

4. Cabling and Distance

TSN doesn't change Ethernet's physical limitations. Standard Cat 5e/6 runs up to 100m per segment. For plant-wide TSN, you'll need fiber between buildings and proper cable management. Time synchronization accuracy degrades with asymmetric cable lengths — use equal-length cables for links between TSN bridges.

Getting Started

If you're designing a new IIoT deployment or modernizing an existing plant network:

  1. Audit your traffic classes. Map every communication flow to a priority level. Most plants have 3-4 distinct classes: hard real-time control, soft real-time monitoring, IT/business, and bulk transfers.

  2. Start with TSN-capable spine switches. Even if your end devices aren't TSN-ready, deploying TSN switches at the aggregation layer gives you traffic isolation today and a deterministic upgrade path for tomorrow.

  3. Deploy IIoT data collection at the appropriate priority. Edge gateways that poll PLCs and publish to MQTT typically operate fine at priority 3-4. They don't need deterministic guarantees — they need reliable throughput. TSN ensures that throughput is available even when control traffic is present.

  4. Plan for centralized configuration. As your TSN deployment grows beyond a single machine cell, manual switch configuration becomes untenable. Invest in network management tools that support 802.1Qcc configuration.

The Convergence Thesis

TSN's real impact isn't about making Ethernet faster — it's about eliminating the network boundaries between IT and OT.

Today, most factories have 3-5 separate network segments with firewalls, protocol converters, and data diodes between them. Each segment has its own switches, cables, management tools, and maintenance burden.

TSN collapses these into a single converged network where control traffic and IT traffic coexist with mathematical guarantees. That means:

  • Lower infrastructure cost (one network instead of three)
  • Simpler troubleshooting (one set of diagnostic tools)
  • Direct IIoT access to real-time data (no protocol conversion needed)
  • Unified security policy (one network to secure, one set of ACLs)

For plant engineers deploying IIoT platforms, TSN means the data you need is already on the same network — no bridging, no gateways, no proprietary converters. You connect your edge device, configure the right traffic priority, and start collecting data from machines that were previously on isolated control networks.

The deterministic network is coming. The question is whether your infrastructure will be ready for it.

Tulip Pricing in 2026: What Does Tulip Actually Cost for Manufacturing?

· 9 min read
MachineCDN Team
Industrial IoT Experts

Tulip has carved out a distinctive position in the manufacturing software market. Positioned as a "frontline operations platform," Tulip lets manufacturing engineers build custom apps for their factory floor — quality checks, work instructions, machine monitoring dashboards — using a no-code builder. It is a compelling pitch, especially for plants with unique processes that off-the-shelf software cannot handle.

But how much does Tulip actually cost? If you have tried to find Tulip pricing on their website, you have already discovered the answer: it is not there. Like most enterprise manufacturing platforms, Tulip uses a sales-driven pricing model with custom quotes. This guide breaks down what we know about Tulip's pricing structure, what drives costs up, and how it compares to alternatives.

5G Private Networks for Manufacturing: What They Mean for Industrial IoT in 2026

· 9 min read
MachineCDN Team
Industrial IoT Experts

Every major IIoT conference in 2025 and 2026 has had at least one vendor breathlessly promoting 5G private networks as the future of manufacturing connectivity. "Ultra-reliable low-latency communication! Network slicing! Massive machine-type communication! One million devices per square kilometer!"

The hype is real. But so is the technology — when applied to the right use cases. The problem is that most manufacturers don't need a 5G private network. They need reliable, low-latency connectivity to their PLCs. And for the vast majority of factory IIoT deployments, existing cellular (4G LTE) and industrial Ethernet already deliver that.

Let's separate the genuine use cases from the marketing noise.

Autonomous Maintenance in the IIoT Era: How Operators Become Your First Line of Defense

· 9 min read
MachineCDN Team
Industrial IoT Experts

Autonomous Maintenance (AM) — the TPM pillar where operators take ownership of basic equipment care — has been practiced in manufacturing for decades. The idea is sound: operators who run machines every day are best positioned to detect early signs of degradation. They hear subtle changes in sound, feel unusual vibrations, and notice when something doesn't look right.

The problem is execution. In most plants, autonomous maintenance means laminated checklists, clipboards, and handwritten logs that sit in a binder until audit time. Operators dutifully check boxes ("Lubrication points — OK") without the tools to quantify what "OK" actually means. Is the bearing temperature 65°C (fine) or 85°C (about to fail)? The clipboard doesn't say.

IIoT is transforming autonomous maintenance from a human-only discipline into a data-augmented system where operators combine their physical presence and intuition with real-time machine data. The result: better detection, faster response, and maintenance culture that actually sticks.

Batched vs. Immediate Telemetry Delivery: When to Use Each in Industrial Monitoring [2026]

· 11 min read

Every industrial IoT edge gateway faces a fundamental architectural decision for every data point it collects: ship it now, or hold it and ship a batch later?

Get this wrong and you either drown your MQTT broker in tiny messages or you miss a critical alarm because it was sitting in a buffer when the compressor caught fire. This guide covers the engineering behind both approaches, the real-world trade-offs, and a framework for deciding which to use where.

Calculated Tags in Industrial IoT: Deriving Boolean Alarms from Raw PLC Registers [2026]

· 9 min read

If you've ever tried to monitor 32 individual alarm conditions from a PLC, you've probably discovered an uncomfortable truth: polling each one as a separate tag creates a nightmarish amount of bus traffic. The solution — calculated tags — is one of the most powerful yet underexplained patterns in industrial data acquisition.

This guide breaks down exactly how calculated tags work, why they matter for alarm systems, and how to implement them efficiently at the edge.

Cloud Connection Watchdogs for IIoT Edge Gateways: Designing Self-Healing MQTT Pipelines [2026]

· 12 min read

The edge gateway powering your factory floor monitoring has exactly one job that matters: get data from PLCs to the cloud. Everything else — protocol translation, tag mapping, batch encoding — is just preparation for that moment when bits leave the gateway and travel to your cloud backend.

And that's exactly where things break. MQTT connections go stale. TLS certificates expire silently. Cloud endpoints restart for maintenance. Cellular modems drop carrier. The gateway's connection looks alive — the TCP socket is open, the MQTT client reports "connected" — but nothing is actually getting delivered.

This is the silent failure problem, and it kills more IIoT deployments than any protocol misconfiguration ever will. This guide covers how to design watchdog systems that detect, diagnose, and automatically recover from every flavor of connectivity failure.

Why MQTT Connections Fail Silently

To understand why watchdogs are necessary, you need to understand what MQTT's keep-alive mechanism does and — more importantly — what it doesn't do.

MQTT keep-alive is a bi-directional ping. The client sends a PINGREQ, the broker responds with PINGRESP. If the broker doesn't hear from the client within 1.5× the keep-alive interval, it considers the client dead and closes the session. If the client doesn't get a PINGRESP, it knows the connection is lost.

Sounds robust, right? Here's where it falls apart:

The Half-Open Connection Problem

TCP connections can enter a "half-open" state where one side thinks the connection is alive, but the other side has already dropped it. This happens when a NAT gateway times out the session, a cellular modem roams to a new tower, or a firewall silently drops the route. The MQTT client's operating system still shows the socket as ESTABLISHED. The keep-alive PINGREQ gets queued in the kernel's send buffer — and sits there, never actually reaching the wire.

The Zombie Session Problem

The gateway reconnects after an outage and gets a new TCP session, but the broker still has the old session's resources allocated. Depending on the clean session flag and broker implementation, you might end up with duplicate subscriptions, missed messages on the command channel, or a broker that refuses the new connection because the old client ID is still "active."

The Token Expiration Problem

Cloud IoT platforms (Azure IoT Hub, AWS IoT Core, Google Cloud IoT) use SAS tokens or JWT tokens for authentication. These tokens have expiration timestamps. When a token expires, the MQTT connection stays open until the next reconnection attempt — which then fails with an authentication error. If your reconnection logic doesn't refresh the token before retrying, you'll loop forever: connect → auth failure → reconnect → auth failure.

The Backpressure Problem

The MQTT client library reports "connected," publishes succeed (they return a message ID), but the broker is under load and takes 30 seconds to acknowledge the publish. Your QoS 1 messages pile up in the client's outbound queue. Eventually the client's memory is exhausted, publishes start failing, but the connection is technically alive.

Designing a Proper Watchdog

A production-grade edge watchdog doesn't just check "am I connected?" It monitors three independent health signals:

Signal 1: Connection State

Track the MQTT on_connect and on_disconnect callbacks. Maintain a state machine:

States:
DISCONNECTED → CONNECTING → CONNECTED → DISCONNECTING → DISCONNECTED

Transitions:
DISCONNECTED + config_available → CONNECTING (initiate async connect)
CONNECTING + on_connect(status=0) → CONNECTED
CONNECTING + on_connect(status≠0) → DISCONNECTED (log error, wait backoff)
CONNECTED + on_disconnect → DISCONNECTING → DISCONNECTED

The key detail: initiate MQTT connections asynchronously in a dedicated thread. A blocking mqtt_connect() call in the main data collection loop will halt PLC reads during the TCP handshake — which on a cellular link with 2-second RTT means 2 seconds of missed data. Use a semaphore or signal to coordinate: the connection thread posts "I'm ready" when it finishes, and the main loop picks it up on the next cycle.

Signal 2: Delivery Confirmation

This is the critical signal that catches silent failures. Track the timestamp of the last successfully delivered message (acknowledged by the broker, not just sent by the client).

For QoS 1: the on_publish callback fires when the broker acknowledges receipt with a PUBACK. Record this timestamp every time it fires.

Last Delivery Tracking:
on_publish(packet_id) → last_delivery_timestamp = now()

Watchdog Check (every main loop cycle):
if (now() - last_delivery_timestamp > WATCHDOG_TIMEOUT):
trigger_reconnection()

What's the right watchdog timeout? It depends on your data rate:

Data RateSuggested TimeoutRationale
Every 1s30–60s30 missed deliveries before alert
Every 5s60–120s12–24 missed deliveries
Every 30s120–300s4–10 missed deliveries

The timeout should be significantly longer than your maximum expected inter-delivery interval. If your batch timeout is 30 seconds, a 120-second watchdog timeout gives you 4 batch cycles of tolerance before concluding something is wrong.

Signal 3: Token/Certificate Validity

Before attempting reconnection, check the authentication material:

Token Check:
if (token_expiration_timestamp ≠ 0):
if (current_time > token_expiration_timestamp):
log("WARNING: Cloud auth token may be expired")
else:
log("Token valid until {expiration_time}")

If your deployment uses SAS tokens with expiration timestamps, parse the se= (signature expiry) parameter from the connection string at startup. Log a warning when the token is approaching expiry. Some platforms provide token refresh mechanisms; others require a redeployment. Either way, knowing the token is expired before the first reconnection attempt saves you from debugging phantom connection failures at 3 AM.

Buffer-Aware Recovery: Don't Lose Data During Outages

The watchdog triggers a reconnection. But what happens to the data that was collected while the connection was down?

This is where most IIoT platforms quietly drop data. The naïve approach: if the MQTT publish call fails, discard the message and move on. This means any network outage, no matter how brief, creates a permanent gap in your historical data.

A proper store-and-forward buffer works like this:

Page-Based Buffer Architecture

Instead of a simple FIFO queue, divide a fixed memory region into pages. Each page holds multiple messages packed sequentially. Three page lists manage the lifecycle:

  • Free Pages: Empty, available for new data
  • Work Page: Currently being filled with new messages
  • Used Pages: Full pages waiting for delivery
Data Flow:
PLC Read → Batch Encoder → Work Page (append)
Work Page Full → Move to Used Pages queue

MQTT Connected:
Used Pages front → Send first message → Wait for PUBACK
PUBACK received → Advance read pointer
Page fully delivered → Move to Free Pages

MQTT Disconnected:
Used Pages continue accumulating
Work Page continues filling
If Free Pages exhausted → Reclaim oldest Used Page (overflow warning)

Why Pages, Not Individual Messages

Individual message queuing has per-message overhead that becomes significant at high data rates: pointer storage, allocation/deallocation, fragmentation. A page-based buffer pre-allocates a contiguous memory block (typically 1–2 MB on embedded edge hardware) and manages it as fixed-size pages. No dynamic allocation after startup. No fragmentation. Predictable memory footprint.

The overflow behavior is also better. When the buffer is full and the connection is still down, you sacrifice the oldest complete page — losing, say, 60 seconds of data from 10 minutes ago rather than randomly dropping individual messages from different time periods. The resulting data gap is clean and contiguous, which is much easier for downstream analytics to handle than scattered missing points.

Disconnect Recovery Sequence

When the MQTT on_disconnect callback fires:

  1. Mark connection as down immediately — the buffer stops trying to send
  2. Reset "packet in flight" flag — the pending PUBACK will never arrive
  3. Continue accepting data from PLC reads into the buffer
  4. Do NOT flush or clear the buffer — all unsent data stays queued

When on_connect fires after reconnection:

  1. Mark connection as up
  2. Begin draining Used Pages from the front of the queue
  3. Send first queued message, wait for PUBACK, then send next
  4. Simultaneously accept new data into the Work Page

This "catch-up" phase is important to handle correctly. New real-time data is still flowing into the buffer while old data is being drained. The buffer must handle concurrent writes (from the PLC reading thread) and reads (for MQTT delivery) safely. Mutex protection on the page list operations is essential.

Async Connection Threads: The Pattern That Saves You

Network operations block. DNS resolution blocks. TCP handshakes block. TLS negotiation blocks. On a cellular connection with packet loss, a single connection attempt can take 5–30 seconds.

If your edge gateway has a single thread doing both PLC reads and MQTT connections, that's 5–30 seconds of missed PLC data every time the connection drops. For an injection molding machine with a 15-second cycle, you could miss an entire shot.

The solution is a dedicated connection thread:

Main Thread:
loop:
read_plc_tags()
encode_and_buffer()
dispatch_command_queue()
check_watchdog()
if watchdog_triggered:
post_job_to_connection_thread()
sleep(1s)

Connection Thread:
loop:
wait_for_job() // blocks on semaphore
destroy_old_connection()
create_new_mqtt_client()
configure_tls()
set_callbacks()
mqtt_connect_async(host, port)
signal_job_complete() // post semaphore

Two semaphores coordinate this:

  • Job semaphore: Main thread posts to trigger reconnection, connection thread waits on it
  • Completion semaphore: Connection thread posts when done, main thread checks (non-blocking) before posting next job

Critical detail: check that the connection thread isn't already running before posting a new job. If the main thread fires the watchdog timeout every 120 seconds but the last reconnection attempt is still in progress (stuck in a 90-second TLS handshake), you'll get overlapping connection attempts that corrupt the MQTT client state.

Reconnection Backoff Strategy

When the cloud endpoint is genuinely down (maintenance window, region outage), aggressive reconnection attempts waste cellular data and CPU cycles. But when it's a transient network glitch, you want to reconnect immediately.

The right approach combines fixed-interval reconnect with watchdog escalation:

Reconnect Timing:
Attempt 1: Immediate (transient glitch)
Attempt 2: 5 seconds
Attempt 3: 5 seconds (cap at 5s for constant backoff)

Watchdog escalation:
if no successful delivery in 120 seconds despite "connected" state:
force full reconnection (destroy + recreate client)

Why not exponential backoff? In industrial settings, the most common failure mode is a brief network interruption — a cell tower handoff, a router reboot, a firewall session timeout. These resolve in 5–15 seconds. Exponential backoff would delay your reconnection to 30s, 60s, 120s, 240s... meaning you could be offline for 4+ minutes after a 2-second glitch. Constant 5-second retry with watchdog escalation provides faster recovery for the common case while still preventing connection storms during genuine outages.

Device Status Broadcasting

Your edge gateway should periodically broadcast its own health status via MQTT. This serves two purposes: it validates the delivery pipeline end-to-end, and it gives the cloud platform visibility into the gateway fleet's health.

A well-designed status message includes:

  • System uptime (OS level — how long since last reboot)
  • Daemon uptime (application level — how long since last restart)
  • Connected device inventory (PLC types, serial numbers, link states)
  • Token expiration timestamp (proactive alerting for credential rotation)
  • Buffer utilization (how close to overflow)
  • Software version + build hash (for fleet management and OTA targeting)
  • Per-device tag counts and last-read timestamps (stale data detection)

Send a compact status on every connection establishment, and a detailed status periodically (every 5–10 minutes). The compact status acts as a "birth certificate" — the cloud platform immediately knows which gateway just came online and what equipment it's managing.

Real-World Failure Scenarios and How the Watchdog Handles Them

Scenario 1: Cellular Modem Roaming

Symptom: TCP connection goes half-open. MQTT client thinks it's connected. Publishes queue up in OS buffer. Detection: Watchdog timeout fires — no PUBACK received in 120 seconds despite continuous publishes. Recovery: Force reconnection. Buffer holds all unsent data. Reconnect on new cell tower, drain buffer. Data loss: Zero (buffer sized for 2-minute outage).

Scenario 2: Cloud Platform Maintenance Window

Symptom: MQTT broker goes offline. Client receives disconnection callback. Detection: Immediate — on_disconnect fires. Recovery: 5-second reconnect attempts. Buffer accumulates data. Connection succeeds when maintenance ends. Data loss: Zero if maintenance window is shorter than buffer capacity (typically 10–30 minutes at normal data rates).

Scenario 3: SAS Token Expiration

Symptom: Connection drops. Reconnection attempts fail with authentication error. Detection: Watchdog notices repeated connection failures. Token timestamp check confirms expiration. Recovery: Log critical alert. Wait for token refresh (manual or automated). Reconnect with new token. Data loss: Depends on token refresh time. Buffer provides bridge.

Scenario 4: PLC Goes Offline

Symptom: Tag reads start returning errors. Gateway loses link state to PLC. Detection: Link state monitoring fires immediately. Error delivered to cloud as a priority (unbatched) event. Recovery: Gateway continues attempting PLC reads. When PLC comes back, link state restored, reads resume. MQTT impact: None — the cloud connection is independent of PLC connections. Both failures are handled by separate watchdog systems.

Monitoring Your Watchdog (Yes, You Need to Watch the Watcher)

The watchdog itself needs observability:

  1. Log every watchdog trigger with reason (no PUBACK, connection timeout, token expiry)
  2. Count reconnection attempts per hour — a spike indicates infrastructure instability
  3. Track buffer high-water marks — if the buffer repeatedly approaches capacity, your connectivity is too unreliable for the data rate
  4. Alert on repeated authentication failures — this is almost always a credential rotation issue

Platforms like machineCDN build this entire watchdog system into the edge agent — monitoring cloud connections, managing store-and-forward buffers, handling reconnection with awareness of both the MQTT transport state and the buffer delivery state. The result is a self-healing data pipeline where network outages create brief delays in cloud delivery but never cause data loss.

Implementation Checklist

Before deploying your edge gateway to production, verify:

  • Watchdog timer runs independently of MQTT callback threads
  • Connection establishment is fully asynchronous (dedicated thread)
  • Buffer survives connection loss (no flush on disconnect)
  • Buffer overflow discards oldest data, not newest
  • Token/certificate expiration is checked before reconnection
  • Reconnection doesn't overlap with in-progress connection attempts
  • Device status is broadcast on every successful reconnection
  • Buffer drain and new data accept can operate concurrently
  • All watchdog events are logged with timestamps for post-mortem analysis
  • PLC read loop continues uninterrupted during reconnection

The unsexy truth about industrial IoT reliability is that it's not about the protocol choice or the cloud platform. It's about what happens in the 120 seconds after your connection drops. Get the watchdog right, and a 10-minute network outage is invisible to your operators. Get it wrong, and a 2-second glitch creates a permanent hole in your production data.

Build the self-healing pipeline. Your 3 AM self will thank you.

How to Build Custom Machine Reports for Manufacturing: A Guide to Data-Driven Production Analysis

· 8 min read
MachineCDN Team
Industrial IoT Experts

Standard canned reports answer the questions your vendor thought to ask. Custom reports answer the questions that actually keep you up at night. When a plant manager needs to know why Machine 14's cycle times drifted 8% last Tuesday between 2pm and 4pm, no pre-built dashboard can help. Here's how modern IIoT platforms enable manufacturing engineers to build custom machine reports — and why this capability separates serious platforms from expensive dashboards.

EtherNet/IP and CIP: A Practical Guide for Plant Engineers [2026]

· 11 min read

If you've ever connected to an Allen-Bradley Micro800 or CompactLogix PLC, you've used EtherNet/IP — whether you knew it or not. It's one of the most widely deployed industrial Ethernet protocols in North America, and for good reason: it runs on standard Ethernet hardware, supports TCP/IP natively, and handles everything from high-speed I/O updates to configuration and diagnostics over a single cable.

But EtherNet/IP is more than just "Modbus over Ethernet." Its underlying protocol — the Common Industrial Protocol (CIP) — is a sophisticated object-oriented messaging framework that fundamentally changes how edge devices, gateways, and cloud platforms interact with PLCs.

This guide covers what plant engineers and IIoT architects actually need to know.