Skip to main content

125 posts tagged with "Manufacturing"

Smart manufacturing and Industry 4.0

View All Tags

IIoT for Aerospace Manufacturing: Monitoring CNC Machining, Heat Treatment, and NDT Equipment in Real Time

· 10 min read
MachineCDN Team
Industrial IoT Experts

In aerospace manufacturing, a tolerance deviation of 0.001 inches on a turbine blade can ground a fleet. A heat treatment furnace that overshoots by 15°F for 3 minutes during a titanium solution treatment cycle creates a latent metallurgical defect that might not manifest for 10 years — when the part is at 35,000 feet.

This is the fundamental tension of aerospace manufacturing: the margins for error are measured in thousandths, the consequences of error are measured in lives, and the production pressure is measured in billions of dollars of backlogged orders.

Boeing and Airbus currently have a combined backlog of over 13,000 aircraft. Tier 1 suppliers like Spirit AeroSystems, Safran, and GE Aerospace are running at capacity. Every hour of unplanned downtime on a 5-axis CNC machining center or a vacuum heat treatment furnace ripples through a supply chain that's already stretched to its limits.

IIoT doesn't solve the backlog. But it solves the equipment reliability, process compliance, and quality traceability challenges that make aerospace manufacturing so demanding — and so expensive when things go wrong.

IIoT for Beverage Bottling Lines: Monitoring Fill Levels, Cap Torque, and Label Accuracy in Real Time

· 11 min read
MachineCDN Team
Industrial IoT Experts

A modern beverage bottling line runs at 600-1,200 bottles per minute. At that speed, a fill level variance of 2ml goes undetected for 30 seconds and you've just sent 450 bottles downstream with the wrong volume — triggering quality holds, potential recalls, and guaranteed retail chargebacks.

Cap torque drifts by 5 in-lbs? You won't know until the line produces 2,000 units with loose caps and consumers find flat soda on shelves. Label misalignment? Your brand manager sees it on Instagram before QA catches it on the floor.

Beverage bottling is one of the highest-speed, lowest-margin manufacturing environments in the world. The difference between a profitable line and a money-losing one often comes down to catching micro-deviations in real time — before they compound into batch rejections.

This is exactly where Industrial IoT transforms operations. Not by replacing your filling machines, but by adding a continuous data layer that catches what human inspection can't at 800 bottles per minute.

Why Most Manufacturing AI Projects Stall After the Pilot Phase (And the 5 Fixes That Actually Work)

· 11 min read
MachineCDN Team
Industrial IoT Experts

The pilot worked beautifully. Your AI model predicted bearing failures on Line 3 with 94% accuracy. The CEO saw the demo. The board heard about "digital transformation." Budget was approved for a plant-wide rollout.

That was eighteen months ago. The model still runs on Line 3. Maintenance still uses clipboards everywhere else. The data scientist who built the pilot left for a fintech startup. And nobody can explain why a model that worked perfectly on one line won't work on the other seven.

If this sounds familiar, you're not alone. According to a McKinsey survey on AI in manufacturing, 87% of manufacturing AI projects never make it past the pilot phase. Not because the AI doesn't work — but because the organizational, data, and infrastructure challenges of scaling from one line to a full plant were never addressed.

The AI isn't the problem. The pilot model is the problem.

Modbus TCP Gateway Failover: Building Redundant PLC Communication for Manufacturing [2026]

· 14 min read

Modbus TCP gateway failover architecture

Modbus TCP remains the most widely deployed industrial protocol in manufacturing. Despite being a 1979 design extended to Ethernet in 1999, its simplicity — request/response over TCP, 16-bit registers, four function codes that cover 90% of use cases — makes it the lowest common denominator that virtually every PLC, VFD, and sensor hub supports.

But simplicity has a cost: Modbus TCP has zero built-in redundancy. No heartbeats. No automatic reconnection. No session recovery. When the TCP connection drops — and in a factory environment with electrical noise, cable vibrations, and switch reboots, it will drop — your data collection goes dark until someone manually restarts the gateway or the application logic handles recovery.

This guide covers the architecture patterns for building resilient Modbus TCP gateways that maintain data continuity through link failures, PLC reboots, and network partitions.

Understanding Why Modbus TCP Connections Fail

Before designing failover, you need to understand the failure modes. In a year of operating Modbus TCP gateways across manufacturing floors, you'll encounter all of these:

Failure Mode 1: TCP Connection Reset (ECONNRESET)

The PLC or an intermediate switch drops the TCP connection. Common causes:

  • PLC firmware update or watchdog reboot
  • Switch port flap (cable vibration, loose connector)
  • PLC connection limit exceeded (most support 6-16 simultaneous TCP connections)
  • Network switch spanning tree reconvergence (can take 30-50 seconds on older managed switches)

Detection time: Immediate — the next modbus_read_registers() call returns ECONNRESET.

Failure Mode 2: Connection Timeout (ETIMEDOUT)

The PLC stops responding but doesn't close the connection. The TCP socket remains open, but reads time out. Common causes:

  • PLC CPU overloaded (complex ladder logic consuming all scan cycles)
  • Network congestion (broadcast storms, misconfigured VLANs)
  • IP conflict (another device grabbed the PLC's address)
  • PLC in STOP mode (program halted, communication stack still partially active)

Detection time: Your configured response timeout (typically 500ms-2s) per read operation. For a 100-tag poll cycle, a full timeout can mean 50-200 seconds of dead time before you confirm the link is down.

Failure Mode 3: Connection Refused (ECONNREFUSED)

The PLC's TCP stack is active but Modbus is not. Common causes:

  • PLC in bootloader mode after firmware flash
  • Modbus TCP server disabled in PLC configuration
  • Firewall rule change on managed switch blocking port 502

Detection time: Immediate on the next connection attempt.

Failure Mode 4: Silent Failure (EPIPE/EBADF)

The connection appears open from the gateway's perspective, but the PLC has already closed it. The first write or read on a stale socket triggers EPIPE or EBADF. This happens when:

  • PLC reboots cleanly but the gateway missed the FIN packet (common with UDP-accelerated switches)
  • OS socket cleanup runs asynchronously

Detection time: Only on the next read/write attempt — could be seconds to minutes if polling intervals are long.

The Connection Recovery State Machine

A resilient Modbus TCP gateway implements a state machine with five states:

                    ┌─────────────┐
│ CONNECTING │
│ (backoff) │
└──────┬──────┘
│ modbus_connect() success
┌──────▼──────┐
┌─────│ CONNECTED │─────┐
│ │ (polling) │ │
│ └──────┬──────┘ │
│ │ │
timeout/error link_state=1 read error
│ │ │
┌────────▼───┐ ┌─────▼─────┐ ┌──▼──────────┐
│ RECONNECT │ │ READING │ │ LINK_DOWN │
│ (flush + │ │ (normal) │ │ (notify + │
│ close) │ │ │ │ reconnect) │
└────────┬───┘ └───────────┘ └──┬──────────┘
│ │
└────────────────────────┘
close + backoff

Key Implementation Details

1. Always close before reconnecting. A stale Modbus context will leak file descriptors and eventually exhaust the OS socket table. When any error occurs in the ETIMEDOUT/ECONNRESET/EPIPE/EBADF family, the correct sequence is:

modbus_flush(context)   → drain pending data
modbus_close(context) → close the TCP socket
sleep(backoff_ms) → prevent reconnection storms
modbus_connect(context) → establish new connection

Never call modbus_connect() on a context that hasn't been closed first. The libmodbus library doesn't handle this gracefully — you'll get zombie sockets.

2. Implement exponential backoff with a ceiling. After a connection failure, don't retry immediately — the PLC may be rebooting and needs time. A practical backoff schedule:

AttemptDelayCumulative Time
11 second1s
22 seconds3s
34 seconds7s
48 seconds15s
5+10 seconds (ceiling)25s+

The 10-second ceiling is important — you don't want the backoff growing to minutes. PLC reboots typically complete in 15-45 seconds. A 10-second retry interval means you'll reconnect within one retry cycle after the PLC comes back.

3. Flush serial buffers for Modbus RTU. If your gateway also supports Modbus RTU (serial), always call modbus_flush() before reading after a reconnection. Serial buffers can contain stale response fragments from before the disconnection, and these will corrupt the first read's response parsing.

4. Track link state as a first-class data point. Don't just log connection status — deliver it to the cloud alongside your tag data. A special "link state" tag (boolean: 0 = disconnected, 1 = connected) transmitted immediately (not batched) gives operators real-time visibility into gateway health. When the link transitions from 1→0, send a notification. When it transitions from 0→1, force-read all tags to establish current values.

Register Grouping: Minimizing Round Trips

Modbus TCP's request/response model means each read operation incurs a full TCP round trip (~0.5-5ms on a local network, 50-200ms over cellular). Reading 100 individual registers one at a time takes 100 round trips — potentially 500ms on a good day.

The optimization is contiguous register grouping — instead of reading registers one at a time, read blocks of contiguous registers in a single request.

The Grouping Algorithm

Given a sorted list of register addresses to read, the gateway walks through them and groups contiguous registers that meet three criteria:

  1. Same function code — you can't mix input registers (FC 4, 3xxxxx) with holding registers (FC 3, 4xxxxx) in one request
  2. Contiguous addresses — register N+1 immediately follows register N (with appropriate gaps filled)
  3. Same polling interval — don't group a 1-second alarm tag with a 60-second temperature tag
  4. Maximum register count ≤ 50 — while Modbus allows up to 125 registers per read, keeping requests under 50 registers (~100 bytes) prevents fragmentation issues on constrained networks and limits the blast radius of a single failed read

Example: Optimized vs Naive Polling

Consider a chiller with 10 compressor circuits, each reporting 16 process variables:

Naive approach: 160 individual reads = 160 round trips

Read register 300003 → 1 register  (CQT1 Condenser Inlet Temp)
Read register 300004 → 1 register (CQT1 Approach Temp)
Read register 300005 → 1 register (CQT1 Chill In Temp)
...
Read register 300016 → 1 register (CQT1 Superheat Temp)

Grouped approach: Registers 300003-300018 are contiguous, same function code (FC 4), same interval (60s)

Read registers 300003 → 16 registers (all CQT1 process data in ONE request)
Read registers 300350 → 16 registers (all CQT2 process data in ONE request)
...

Result: 160 round trips → 10 round trips. On a 2ms RTT network, that's 320ms → 20ms.

Handling Non-Contiguous Gaps

Real PLC register maps aren't perfectly contiguous. The chiller above has CQT1 data at registers 300003-300018 and CQT2 data starting at 300350 — a gap of 332 registers. Don't try to read 300003-300695 in one request to "fill the gap" — you'll read hundreds of irrelevant registers and waste bandwidth.

Instead, break at non-contiguous boundaries:

Group 1: 300003-300018  (16 registers, CQT1 process data)
Group 2: 300022-300023 (2 registers, CQT1 alarm bits)
Group 3: 300038-300043 (6 registers, CQT1 expansion + version)
Group 4: 300193-300194 (2 registers, CQT1 status words)
Group 5: 300260-300278 (19 registers, CQT2-10 alarm bits)
Group 6: 300350-300366 (17 registers, CQT2-3 temperatures)
...

The 50ms Inter-Read Delay

Between consecutive Modbus read requests, insert a 50ms delay. This sounds counterintuitive — why slow down? — but it serves two purposes:

  1. PLC scan cycle breathing room. Many PLCs process Modbus requests in their communication interrupt, which competes with the main scan cycle. Rapid-fire requests can extend the scan cycle, triggering watchdog timeouts on safety-critical programs.

  2. TCP congestion avoidance. On constrained networks (especially cellular gateways), bursting 50 reads in 100ms can overflow buffers. The 50ms spacing distributes the load evenly.

Dual-Path Failover Architecture

For mission-critical data collection (pharmaceutical batch records, automotive quality traceability), a single gateway represents a single point of failure. The dual-path architecture uses two independent gateways polling the same PLC:

Architecture

                    ┌──────────┐
│ PLC │
│ (Modbus) │
└──┬───┬──┘
│ │
Port 502 │ │ Port 502
│ │
┌────────▼┐ ┌▼────────┐
│Gateway A│ │Gateway B│
│(Primary)│ │(Standby)│
└────┬────┘ └────┬────┘
│ │
▼ ▼
┌────────────────────┐
│ MQTT Broker │
│ (cloud/edge) │
└────────────────────┘

Active/Standby vs Active/Active

Active/Standby: Gateway A polls the PLC. Gateway B monitors A's heartbeat (via MQTT LWT or a shared health topic). If A goes silent for >30 seconds, B starts polling. When A recovers, it checks B's status and either resumes as primary or remains standby.

  • Pro: Only one gateway reads from the PLC, respecting the PLC's connection limit
  • Con: 30-second failover gap

Active/Active: Both gateways poll the PLC simultaneously. The cloud platform deduplicates data based on timestamps and device serial numbers. If one gateway fails, the other's data is already flowing.

  • Pro: Zero-downtime failover, no coordination needed
  • Con: Doubles PLC connection count and network traffic. Most PLCs support this (6-16 connections), but verify.

Recommendation: Active/Active with cloud-side deduplication. The PLC connection overhead is negligible compared to the operational cost of a 30-second data gap. Cloud-side deduplication is trivial — tag ID + timestamp + device serial number provides a natural composite key.

Store-and-Forward: Surviving Cloud Disconnections

Gateway-to-PLC failover handles half the problem. The other half is cloud connectivity — cellular links drop, VPN tunnels restart, and MQTT brokers undergo maintenance. During these outages, the gateway must buffer data locally and forward it when connectivity returns.

The Paged Ring Buffer

A production-grade store-and-forward buffer uses a paged ring buffer — pre-allocated memory divided into fixed-size pages, with separate write and read pointers:

┌──────────┐
│ Page 0 │ ← read_pointer (next to transmit)
│ [data] │
├──────────┤
│ Page 1 │
│ [data] │
├──────────┤
│ Page 2 │ ← write_pointer (next to fill)
│ [empty] │
├──────────┤
│ Page 3 │
│ [empty] │
└──────────┘

When the MQTT connection is healthy:

  1. Tag data is written to the current work page
  2. When the page fills, it moves to the "used" queue
  3. The buffer transmits the oldest used page to MQTT (QoS 1 for delivery confirmation)
  4. On publish acknowledgment, the page moves to the "free" queue

When the MQTT connection drops:

  1. Tag data continues writing to pages (the PLC doesn't stop producing data)
  2. Used pages accumulate in the queue
  3. If the queue fills, the oldest used page is recycled as a work page — accepting data loss of the oldest data to preserve the newest

This design guarantees:

  • Constant memory usage — no dynamic allocation on an embedded device
  • Graceful degradation — oldest data is sacrificed first
  • Thread safety — mutex-protected page transitions prevent race conditions between the reading thread (PLC poller) and writing thread (MQTT publisher)

Sizing the Buffer

Buffer size depends on your data rate and expected maximum outage duration:

buffer_size = data_rate_bytes_per_second × max_outage_seconds × 1.2 (overhead)

For a typical deployment:

  • 100 tags × 4 bytes/value = 400 bytes per poll cycle
  • 1 poll per second = 400 bytes/second
  • Binary encoding with batch overhead: ~500 bytes/second
  • Target 4 hours of offline buffering: 500 × 14,400 = 7.2MB

With 512KB pages, that's ~14 pages. Allocate 16 pages (minimum 3 needed for operation: one writing, one transmitting, one free) for an 8MB buffer.

Binary vs JSON Encoding for Buffered Data

JSON is wasteful for buffered data. The same 100-tag reading:

  • JSON: {"groups":[{"ts":1709500800,"device_type":1018,"serial_number":23456,"values":[{"id":1,"values":[245]},{"id":2,"values":[312]},...]}]} → ~2KB
  • Binary: Header (0xF7 + group count + timestamp + device info) + packed tag values → ~500 bytes

Binary encoding uses a compact format:

[0xF7] [num_groups:4] [timestamp:4] [device_type:2] [serial_num:4] 
[num_values:4] [tag_id:2] [status:1] [value_count:1] [value_size:1] [values...]

Over a cellular connection billing at $5/GB, the 4× bandwidth savings of binary encoding pays for itself within days on a busy gateway.

Alarm Tag Priority: Batched vs Immediate Delivery

Not all tags are created equal. A temperature reading that's 0.1°C different from the last poll can wait for the next batch. An alarm bit that just flipped from 0 to 1 cannot.

The gateway should support two delivery modes per tag:

Batched Delivery (Default)

Tags are accumulated in the batch buffer and delivered on the batch timeout (typically 5-30 seconds) or batch size limit (typically 10-500KB). This is efficient for process variables that change slowly.

Configuration:

{
"name": "Tank Temperature",
"id": 1,
"addr": 300202,
"type": "int16",
"interval": 60,
"compare": false
}

Immediate Delivery (do_not_batch)

Tags bypass the batch buffer entirely. When the value changes, a single-value batch is created, serialized, and pushed to the output buffer immediately. This is essential for:

  • Alarm words — operators need sub-second alarm notification
  • Machine state transitions — running/stopped/faulted changes trigger downstream actions
  • Safety interlocks — any safety-relevant state change must be delivered without batching delay

Configuration:

{
"name": "CQT 1 Alarm Bits 1",
"id": 163,
"addr": 300022,
"type": "uint16",
"interval": 1,
"compare": true,
"do_not_batch": true
}

The compare: true flag is critical for immediate-delivery tags — without it, the gateway would transmit on every read cycle (every 1 second), flooding the network. With comparison enabled, the gateway only transmits when the alarm word actually changes — zero bandwidth during normal operation, instant delivery when an alarm fires.

Calculated Tags: Extracting Bit-Level Alarms from PLC Words

Many PLCs pack multiple alarm states into a single 16-bit register. Bit 0 might indicate "high temperature," bit 1 "low flow," bit 2 "compressor fault," etc. Rather than requiring the cloud platform to perform bitwise decoding, a production gateway extracts individual bits and delivers them as separate boolean tags.

The extraction uses shift-and-mask arithmetic:

alarm_word = 0xA5 = 10100101 in binary

bit_0 = (alarm_word >> 0) & 0x01 = 1 → "High Temperature" = TRUE
bit_1 = (alarm_word >> 1) & 0x01 = 0 → "Low Flow" = FALSE
bit_2 = (alarm_word >> 2) & 0x01 = 1 → "Compressor Fault" = TRUE
...

These calculated tags are defined as children of the parent alarm word. When the parent tag changes value (detected by the compare flag), all child calculated tags are re-evaluated and delivered. If the parent doesn't change, no child processing occurs — zero CPU overhead during steady state.

This architecture keeps the PLC configuration simple (one alarm word per circuit) while giving cloud consumers individual, addressable alarm signals.

Putting It All Together: A Production Gateway Checklist

Before deploying a Modbus TCP gateway to production, verify:

  • Connection recovery handles all five error codes (ETIMEDOUT, ECONNRESET, ECONNREFUSED, EPIPE, EBADF)
  • Exponential backoff with 10-second ceiling prevents reconnection storms
  • Link state is delivered as a first-class tag (not just logged)
  • Register grouping batches contiguous same-function-code registers (max 50 per read)
  • 50ms inter-read delay protects PLC scan cycle integrity
  • Store-and-forward buffer sized for target offline duration
  • Binary encoding used for buffered data (not JSON)
  • Alarm tags configured with compare: true and immediate delivery
  • Calculated tags extract individual bits from alarm words
  • Force-read on reconnection ensures fresh values after any link recovery
  • Hourly full re-read resets all "read once" flags to catch any drift

machineCDN and Modbus TCP

machineCDN's edge gateway implements these patterns natively — connection state management, contiguous register grouping, binary batch encoding, paged ring buffers, and calculated alarm tags — so that plant engineers can focus on which tags to monitor rather than how to keep the data flowing. The gateway's JSON-based tag configuration maps directly to the PLC's register map, and the dual-format delivery system (binary for efficiency, JSON for interoperability) adapts to whatever network path is available.

For manufacturing teams running Modbus TCP equipment — from chillers and dryers to injection molding machines and conveying systems — getting the gateway layer right is the difference between a monitoring system that works in the lab and one that survives a year on the factory floor.


Building a Modbus TCP monitoring system? machineCDN handles protocol translation, buffering, and cloud delivery for manufacturing equipment — so your data keeps flowing even when your network doesn't.

How to Monitor Hydraulic Press Systems with IIoT: A Practical Guide for Maintenance Engineers

· 10 min read
MachineCDN Team
Industrial IoT Experts

A hydraulic press failure doesn't give you a gentle warning. One day the press is forming 800-ton stampings at 12 cycles per minute. The next day, a seal blows, hydraulic fluid sprays across the floor, production stops, and you're looking at $50,000 in emergency repairs, lost production, and hazmat cleanup.

The tragedy is that every hydraulic press failure tells the same story in hindsight: the pressure was drifting for weeks, the oil temperature was climbing for months, and the pump vibration had been elevated since the last oil change. The data was there — nobody was watching it.

IIoT transforms hydraulic press maintenance from reactive firefighting to predictive precision. By continuously monitoring the parameters that precede failure, you can schedule repairs during planned downtime and eliminate the catastrophic failures that shut down your stamping, forming, and molding operations.

This guide covers the specific monitoring points, threshold values, and implementation approach for hydraulic press systems in manufacturing — based on what actually predicts failures, not what vendors think you should monitor.

How to Build a Machine Health Scoring System for Manufacturing: From Raw Sensor Data to Actionable Scores

· 9 min read
MachineCDN Team
Industrial IoT Experts

A maintenance manager walks into the daily production meeting. The plant manager asks: "How are our machines doing?"

The honest answer — "Well, the hydraulic pump on Press 4 is showing elevated vibration in the 3× RPM harmonic, suggesting possible misalignment, and the spindle motor on CNC-7 has been drawing 12% more current than baseline, which could indicate bearing degradation, and..." — puts the room to sleep by sentence two.

What the plant manager actually wants is a number. A score. A simple indicator that says: this machine is healthy, this one needs attention, this one is going to break.

That's what a machine health scoring system provides. Here's how to build one that's practical, accurate, and actually used.

The Business Case for Cellular IIoT Connectivity: Why Smart Manufacturers Are Bypassing Plant Networks

· 10 min read
MachineCDN Team
Industrial IoT Experts

Every IIoT deployment hits the same wall within the first week: IT. The factory floor needs to send machine data to the cloud. The IT department needs to approve network access, configure firewall rules, set up VLANs, conduct security reviews, and integrate the new traffic into their existing network architecture. What should be a two-day deployment becomes a three-month project — not because the technology is complex, but because the organizational process around network access was designed to prevent exactly the kind of connectivity that IIoT requires.

Cellular IIoT connectivity eliminates this wall entirely. Instead of routing machine data through the plant network, cellular-connected edge devices use their own mobile data connection to send data directly to the cloud. No IT involvement. No network configuration. No security review. No firewall rules. The machine data never touches the plant network at all.

This is not a workaround or a compromise. For a growing number of manufacturers, cellular connectivity is the architecturally superior approach to IIoT deployment — faster to deploy, more secure in practice, and cheaper when you account for the true cost of IT-dependent deployments.

Edge Gateway Lifecycle Architecture: From Boot to Steady-State Telemetry in Industrial IoT [2026]

· 14 min read

Most IIoT content treats the edge gateway as a black box: PLC data goes in, cloud data comes out. That's fine for a sales deck. It's useless for the engineer who needs to understand why their gateway loses data during a network flap, or why configuration changes require a full restart, or why it takes 90 seconds after boot before the first telemetry packet reaches the cloud.

This article breaks down the complete lifecycle of a production industrial edge gateway — from the moment it powers on to steady-state telemetry delivery, including every decision point, failure mode, and recovery mechanism in between. These patterns are drawn from real-world gateways running on resource-constrained hardware (64MB RAM, MIPS processors) in plastics manufacturing plants, monitoring TCUs, chillers, blenders, and dryers 24/7.

Phase 1: Boot and Configuration Load

When a gateway boots (or restarts after a configuration change), the first task is loading its configuration. In production deployments, there are typically two configuration layers:

The Daemon Configuration

This is the central configuration that defines what equipment to talk to:

{
"plc": {
"ip": "192.168.5.5",
"modbus_tcp_port": 502
},
"serial_device": {
"port": "/dev/rs232",
"baud": 9600,
"parity": "none",
"data_bits": 8,
"stop_bits": 1,
"byte_timeout_ms": 4,
"response_timeout_ms": 100
},
"batch_size": 4000,
"batch_timeout_sec": 60,
"startup_delay_sec": 30
}

The startup delay is a critical design choice. When a gateway boots simultaneously with the PLCs it monitors (common after a power outage), the PLCs may need 10-30 seconds to initialize their communication stacks. If the gateway immediately tries to connect, it fails, marks the PLC as unreachable, and enters a slow retry loop. A 30-second startup delay avoids this race condition.

The serial link parameters (baud, parity, data bits, stop bits) must match the PLC exactly. A mismatch here produces zero error feedback — you just get silence. The byte timeout (time between consecutive bytes) and response timeout (time to wait for a complete response) are tuned per equipment type. TCUs with slower processors may need 100ms+ response timeouts; modern PLCs respond in 10-20ms.

The Device Configuration Files

Each equipment type gets its own configuration file that defines which registers to read, what data types to expect, and how often to poll. These files are loaded dynamically based on the device type detected during the discovery phase.

A real device configuration for a batch blender might define 40+ tags, each with:

  • A unique tag ID (1-32767)
  • The Modbus register address or EtherNet/IP tag name
  • Data type (bool, int8, uint8, int16, uint16, int32, uint32, float)
  • Element count (1 for scalars, 2+ for arrays or multi-register values)
  • Poll interval in seconds
  • Whether to compare with previous value (change-based delivery)
  • Whether to send immediately or batch with other values

Hot-reload capability is essential for production systems. The gateway should monitor configuration file timestamps and automatically detect changes. When a configuration file is modified (pushed via MQTT from the cloud, or copied via SSH during maintenance), the gateway reloads it without requiring a full restart. This means configuration updates can be deployed remotely to gateways in the field without disrupting data collection.

Phase 2: Device Detection

After configuration loads successfully, the gateway enters the device detection phase. This is where protocol-level intelligence matters.

Multi-Protocol Discovery

A well-designed gateway doesn't assume which protocol the PLC speaks. Instead, it tries multiple protocols in order of preference:

Step 1: Try EtherNet/IP

The gateway sends a CIP (Common Industrial Protocol) request to the configured IP address, attempting to read a device_type tag. EtherNet/IP uses the ab-eip protocol with a micro800 CPU profile (for Allen-Bradley Micro8xx series). If the PLC responds with a valid device type, the gateway knows this is an EtherNet/IP device.

Connection path: protocol=ab-eip, gateway=192.168.5.5, cpu=micro800
Target tag: device_type (uint16)
Timeout: 2000ms

Step 2: Fall back to Modbus TCP

If EtherNet/IP fails (error code -32 = "no connection"), the gateway tries Modbus TCP on port 502. It reads input register 800 (address 300800) which, by convention, stores the device type identifier.

Function code: 4 (Read Input Registers)
Register: 800
Count: 1
Expected: uint16 device type code

Step 3: Serial detection for Modbus RTU

If TCP protocols fail, the gateway probes the serial port for Modbus RTU devices. RTU detection is trickier because there's no auto-discovery mechanism — you must know the slave address. Production gateways typically configure a default address (slave ID 1) and attempt a read.

Serial Number Extraction

After identifying the device type, the gateway reads the equipment's serial number. This is critical for fleet management — each physical machine needs a unique identifier for cloud-side tracking.

Different equipment types store serial numbers in different registers:

Equipment TypeProtocolMonth RegisterYear RegisterUnit Register
Portable ChillerModbus TCPInput 22Input 23Input 24
Central ChillerModbus TCPHolding 520Holding 510Holding 500
TCUModbus RTUEtherNet/IPEtherNet/IPEtherNet/IP
Batch BlenderEtherNet/IPCIP tagCIP tagCIP tag

The serial number is packed into a 32-bit value:

Byte 3: Year  (0x40=2010, 0x41=2011, ...)
Byte 2: Month (0x00=Jan, 0x01=Feb, ...)
Bytes 0-1: Unit number (sequential)

Example: 0x002A0050 = January 2010, unit #80

Fallback serial generation: If the PLC doesn't have a programmed serial number (common with newly installed equipment), the gateway generates one using the router's serial number as a seed, with a prefix byte distinguishing PLCs (0x7F) from TCUs (0x7E). This ensures every device in the fleet has a unique identifier even before the serial number is programmed.

Configuration Loading by Device Type

Once the device type is known, the gateway searches for a matching configuration file. If type 1010 is detected, it loads the batch blender configuration. If type 5000, it loads the TCU configuration. If no matching configuration exists, the gateway logs an error and continues monitoring other ports.

This pattern — detect → identify → configure — means a single gateway binary handles dozens of equipment types. Adding support for a new machine is a configuration file change, not a firmware update.

With devices detected and configured, the gateway establishes its cloud connection via MQTT.

Connection Architecture

Production IIoT gateways use MQTT 3.1.1 over TLS (port 8883) for cloud connectivity. The connection setup involves:

  1. Certificate verification — the gateway validates the cloud broker's certificate against a CA root cert stored locally
  2. SAS token authentication — using a device-specific Shared Access Signature that encodes the hostname, device ID, and expiration timestamp
  3. Topic subscription — after connecting, the gateway subscribes to its command topic for receiving configuration updates and control commands from the cloud
Publish topic:  devices/{deviceId}/messages/events/
Subscribe topic: devices/{deviceId}/messages/devicebound/#
QoS: 1 (at least once delivery)

QoS 1 is the standard choice for industrial telemetry — it guarantees message delivery while avoiding the overhead and complexity of QoS 2 (exactly once). Since the data pipeline is designed to handle duplicates (via timestamp deduplication at the cloud layer), QoS 1 provides the right balance of reliability and performance.

The Async Connection Thread

MQTT connection can take 5-30 seconds depending on network conditions, DNS resolution, and TLS handshake time. A naive implementation blocks the main loop during connection, which means no PLC data is read during this time.

The solution: run mosquitto_connect_async() in a separate thread. The main loop continues reading PLC tags and buffering data while the MQTT connection establishes in the background. Once the connection callback fires, buffered data starts flowing to the cloud.

This is implemented using a semaphore-based producer-consumer pattern:

  1. Main thread prepares connection parameters and posts to a semaphore
  2. Connection thread wakes up, calls connect_async(), and signals completion
  3. Main thread checks semaphore state before attempting reconnection (prevents double-connect)

Connection Watchdog

Network connections fail. Cell modems lose signal. Cloud brokers restart. A production gateway needs a watchdog that detects stale connections and forces reconnection.

The watchdog pattern:

Every 120 seconds:
1. Check: have we received ANY confirmation from the broker?
(delivery ACK, PUBACK, SUBACK — anything)
2. If yes → connection is healthy, reset watchdog timer
3. If no → connection is stale. Destroy MQTT client and reinitiate.

The 120-second timeout is tuned for cellular networks where intermittent connectivity is expected. On wired Ethernet, you could reduce this to 30-60 seconds. The key insight: don't just check "is the TCP socket open?" — check "has the broker confirmed any data delivery recently?" A half-open socket can persist for hours without either side knowing.

Phase 4: Steady-State Tag Reading

Once PLC connections and MQTT are established, the gateway enters its main polling loop. This is where it spends 99.9% of its runtime.

The Main Loop (1-second resolution)

The core loop runs every second and performs three operations:

  1. Configuration check — detect if any configuration file has been modified (via file stat monitoring)
  2. Tag read cycle — iterate through all configured tags and read those whose polling interval has elapsed
  3. Command processing — check the incoming command queue for cloud-side instructions (config updates, manual reads, interval changes)

Interval-Based Polling

Each tag has a polling interval in seconds. The gateway maintains a monotonic clock timestamp of the last read for each tag. On each loop iteration:

for each tag in device.tags:
elapsed = now - tag.last_read_time
if elapsed >= tag.interval_sec:
read_tag(tag)
tag.last_read_time = now

Typical intervals by data category:

Data TypeIntervalRationale
Temperatures, pressures60sSlow-changing process values
Alarm states (booleans)1sImmediate awareness needed
Machine state (running/idle)1sOEE calculation accuracy
Batch counts1sProduction tracking
Version, serial number3600sStatic values, verify hourly

Compare Mode: Change-Based Delivery

For many tags, sending the same value every second is wasteful. If a chiller alarm bit is false for 8 hours straight, that's 28,800 redundant messages.

Compare mode solves this: the gateway stores the last-read value and only delivers to the cloud when the value changes. This is configured per tag:

{
"name": "Compressor Fault Alarm",
"type": "bool",
"interval": 1,
"compare": true,
"do_not_batch": true
}

This tag is read every second, but only transmitted when it changes. The do_not_batch flag means changes are sent immediately rather than waiting for the next batch finalization — critical for alarm states where latency matters.

Hourly Full Refresh

There's a subtle problem with pure change-based delivery: if a value changes while the MQTT connection is down, the cloud never learns about the transition. And if a value stays constant for days, the cloud has no heartbeat confirming the sensor is still alive.

The solution: every hour (on the hour change), the gateway resets all "read once" flags, forcing a complete re-read and re-delivery of all tags. This guarantees the cloud has fresh values at least hourly, regardless of change activity.

Phase 5: Data Batching and Delivery

Raw tag values don't get sent individually (except high-priority alarms). Instead, they're collected into batches for efficient delivery.

Binary Encoding

Production gateways use binary encoding rather than JSON to minimize bandwidth. The binary format packs values tightly:

Header:        1 byte  (0xF7 = tag values)
Group count: 4 bytes (number of timestamp groups)

Per group:
Timestamp: 4 bytes
Device type: 2 bytes
Serial num: 4 bytes
Value count: 4 bytes

Per value:
Tag ID: 2 bytes
Status: 1 byte (0x00=OK, else error code)
Array size: 1 byte (if status=OK)
Elem size: 1 byte (1, 2, or 4 bytes per element)
Data: size × count bytes

A batch containing 20 float values uses about 200 bytes in binary vs. ~2,000 bytes in JSON — a 10× bandwidth reduction that matters on cellular connections billed per megabyte.

Batch Finalization Triggers

A batch is finalized (sent to MQTT) when either:

  1. Size threshold — the batch reaches the configured maximum size (default: 4,000 bytes)
  2. Time threshold — the batch has been collecting for longer than batch_timeout_sec (default: 60 seconds)

This ensures data reaches the cloud within 60 seconds even during low-activity periods, while maximizing batch efficiency during high-activity periods (like a blender running a batch cycle that triggers many dependent tag reads).

The Paged Ring Buffer

Between the batching layer and the MQTT publish layer sits a paged ring buffer. This is the gateway's resilience layer against network outages.

The buffer divides available memory into fixed-size pages. Each page holds one or more complete MQTT messages. The buffer operates as a queue:

  • Write side: Finalized batches are written to the current work page. When a page fills up, it moves to the "used" queue.
  • Read side: When MQTT is connected, the gateway publishes the oldest used page. Upon receiving a PUBACK (delivery confirmation), the page moves to the "free" pool.
  • Overflow: If all pages are used (network down too long), the gateway overwrites the oldest used page — losing the oldest data to preserve the newest.

This design means the gateway can buffer 15-60 minutes of telemetry data during a network outage (depending on available memory and data density), then drain the buffer once connectivity restores.

Disconnect Recovery

When the MQTT connection drops:

  1. The buffer's "connected" flag is cleared
  2. All pending publish operations are halted
  3. Incoming PLC data continues to be read, batched, and buffered
  4. The MQTT async thread begins reconnection
  5. On reconnection, the buffer's "connected" flag is set, and data delivery resumes from the oldest undelivered page

This means zero data loss during short outages (up to the buffer capacity), and newest-data-preserved during long outages (the overflow policy drops oldest data first).

Phase 6: Remote Configuration and Control

A production gateway accepts commands from the cloud over its MQTT subscription topic. This enables remote management without SSH access.

Supported Command Types

CommandDirectionDescription
daemon_configCloud → DeviceUpdate central configuration (IP addresses, serial params)
device_configCloud → DeviceUpdate device-specific tag configuration
get_statusCloud → DeviceRequest current daemon/PLC/TCU status report
get_status_extCloud → DeviceRequest extended status with last tag values
read_now_plcCloud → DeviceForce immediate read of a specific tag
tag_updateCloud → DeviceChange a tag's polling interval remotely

Remote Interval Adjustment

This is a powerful production feature: the cloud can remotely change how often specific tags are polled. During a quality investigation, an engineer might temporarily increase temperature polling from 60s to 5s to capture rapid transients. After the investigation, they reset to 60s via another command.

The gateway applies interval changes immediately and persists them to the configuration file, so they survive a restart. The modified_intervals flag in status reports tells the cloud that intervals have been manually adjusted.

Designing for Constrained Hardware

These gateways often run on embedded Linux routers with severely constrained resources:

  • RAM: 64-128MB (of which 30-40MB is available after OS)
  • CPU: MIPS or ARM, 500-800 MHz, single core
  • Storage: 16-32MB flash (no disk)
  • Network: Cellular (LTE Cat 4/Cat M1) or Ethernet

Design constraints this imposes:

  1. Fixed memory allocation — allocate all buffers at startup, never malloc() during runtime. A memory fragmentation crash at 3 AM in a factory with no IT staff is unrecoverable.

  2. No floating-point unit — older MIPS processors do software float emulation. Keep float operations to a minimum; do heavy math in the cloud.

  3. Flash wear — don't write configuration changes to flash more than necessary. Batch writes, use write-ahead logging if needed.

  4. Watchdog timer — use the hardware watchdog timer. If the main loop hangs, the hardware reboots the gateway automatically.

How machineCDN Implements These Patterns

machineCDN's ACS (Auxiliary Communication System) gateway embodies all of these lifecycle patterns in a production-hardened implementation that's been running on thousands of plastics manufacturing machines for years.

The gateway runs on Teltonika RUT9XX industrial cellular routers, providing cellular connectivity for machines in facilities without available Ethernet. It supports EtherNet/IP and Modbus (both TCP and RTU) simultaneously, auto-detecting device types at boot and loading the appropriate configuration from a library of pre-built equipment profiles.

For manufacturers deploying machineCDN, the complexity described in this article — protocol detection, configuration management, MQTT buffering, recovery — is entirely handled by the platform. The result is that plant engineers get reliable, continuous telemetry from their equipment without needing to understand (or debug) the edge gateway's internal lifecycle.


Understanding how edge gateways actually work — not just what they do, but how they manage their lifecycle — is essential for building reliable IIoT infrastructure. The patterns described here (startup sequencing, multi-protocol detection, buffered delivery, watchdog recovery) separate toy deployments from production systems that run for years without intervention.

The Hidden Cost of Manual Data Collection on the Factory Floor: Why Clipboards Are Your Most Expensive Tool

· 9 min read
MachineCDN Team
Industrial IoT Experts

Walk through any manufacturing plant in 2026 and you'll still see them: clipboards. Stacks of paper forms. Operators writing down temperatures, pressures, cycle counts, and quality measurements every hour. Data that gets entered into a spreadsheet the next day — if it gets entered at all.

This ritual persists because it feels free. The forms cost pennies. The operators are already there. What's the harm in a few minutes per hour with a clipboard?

The harm is enormous. And it's invisible precisely because nobody tracks the cost of tracking.

How to Implement Multi-Zone Machine Monitoring: Organizing Your Factory Floor for Maximum Visibility

· 10 min read
MachineCDN Team
Industrial IoT Experts

Most factory floors are not organized the way IIoT platforms expect them to be. Machines are clustered by process, scattered across buildings, or arranged by historical accident — the CNC mill is next to the paint booth because that is where the power drop was when the building was renovated in 2003. When you deploy an IIoT monitoring platform, the way you organize machines into zones and locations determines whether your dashboards show actionable insight or meaningless noise.

Multi-zone machine monitoring is the practice of organizing your monitored equipment into logical groupings — by location, process area, product line, or function — so that your monitoring data tells a story your team can act on. This guide walks through how to plan, implement, and optimize a zone-based monitoring structure for manufacturing plants of any size.