Skip to main content

Edge Computing Architecture for IIoT: Store-and-Forward, Batch Processing, and Bandwidth Optimization [2026]

· 14 min read
MachineCDN Team
Industrial IoT Experts

Here's an uncomfortable truth about industrial IoT: your cloud platform is only as reliable as the worst cellular connection on your factory floor.

And in manufacturing environments — where concrete walls, metal enclosures, and electrical noise are the norm — that connection can drop for minutes, hours, or days. If your edge architecture doesn't account for this, you're not building an IIoT system. You're building a fair-weather dashboard that goes dark exactly when you need it most.

This guide covers the architecture patterns that separate production-grade edge gateways from science projects: store-and-forward buffering, intelligent batch processing, binary serialization, and the MQTT reliability patterns that actually work when deployed on a $200 industrial router with 256MB of RAM.

The Edge Gateway: What It Actually Does

An IIoT edge gateway sits between the plant floor and the cloud. Physically, it's often a ruggedized ARM device — Teltonika RUT series, Advantech, or similar — running embedded Linux. Its job sounds simple:

  1. Read data from PLCs via industrial protocols (Modbus RTU/TCP, EtherNet/IP)
  2. Buffer and batch that data locally
  3. Deliver batches to the cloud over MQTT

In reality, each of those steps has failure modes that can corrupt data, overflow memory, or silently lose readings during the exact moments they matter most.

Resource Constraints Are Real

A typical edge gateway runs with:

  • CPU: ARM Cortex-A7, single or dual core
  • RAM: 128-512MB (shared with OS and other services)
  • Storage: 16-128MB flash (wear-limited)
  • Network: 4G LTE modem, often metered
  • Power: 12-48V DC, sometimes with UPS

This means your edge software needs to be written in C or C++ (not Python, not Node.js), allocate memory carefully (preferably up-front at startup), and treat every byte of bandwidth as expensive.

Store-and-Forward: The Core Pattern

Store-and-forward is the foundational pattern for edge reliability. The concept: when the cloud connection is down, buffer data locally. When it comes back, drain the buffer.

Simple in theory. The implementation details are where it gets interesting.

Page-Based Ring Buffers

The most robust store-and-forward implementation uses a page-based ring buffer — a fixed region of pre-allocated memory divided into equal-sized pages that cycle in round-robin fashion.

Here's the architecture:

Memory Layout (2MB buffer, 4KB page size = ~500 pages)

┌──────────┐
│ Page 0 │ ← free_pages list
├──────────┤
│ Page 1 │ ← work_page (currently collecting data)
├──────────┤
│ Page 2 │ ← used_pages list (awaiting delivery)
├──────────┤
│ Page 3 │ ← used_pages list
├──────────┤
│ ... │
├──────────┤
│ Page 499 │ ← free_pages list
└──────────┘

Three linked lists manage the pages:

  1. Free pages: Empty pages ready for new data
  2. Work page: The page currently being written to
  3. Used pages: Filled pages waiting for MQTT delivery

The write path:

  1. Incoming data goes to the work page
  2. When the work page is full, it moves to the used pages queue
  3. A new free page becomes the work page

The delivery path:

  1. When MQTT is connected, send the first used page
  2. Wait for the MQTT PUBACK (QoS 1 acknowledgment)
  3. On confirmation, move the delivered page back to free pages
  4. Send the next used page

Handling Buffer Overflow

When all pages are used and free pages are exhausted, you have two choices:

  1. Drop new data: Simple, but you lose the most recent readings
  2. Overwrite oldest data: Reclaim the oldest used page for new writes

Option 2 is almost always correct for IIoT. Recent data is more valuable than old data, and if the connection has been down long enough to fill a 2MB buffer, the old data is already stale.

Buffer overflow recovery:
1. No free pages available
2. Extract the OLDEST used page (head of used_pages list)
3. Use it as the new work page (overwriting old data)
4. Log a WARNING — this is important for diagnostics

Thread Safety

In a real-time system, data collection (writing to the buffer) and MQTT delivery (reading from the buffer) happen on different threads. The buffer needs mutex protection, but the lock granularity matters:

  • Lock per operation (add data, send next, confirm delivery): Minimal contention
  • Never hold the lock during I/O: MQTT sends can take 100ms+; holding a lock that long blocks data collection

A well-designed buffer uses the mutex only to update the page list pointers — never during the actual MQTT publish call.

Batch Processing: Why and How

Sending every register value as an individual MQTT message is catastrophically inefficient. With 200 tags polled every second, that's 200 messages/second — each with MQTT headers, TLS overhead, and round-trip confirmation.

The Batch Collection Window

Industrial data naturally groups into collection windows — a set of readings taken at approximately the same time. A batch collects all tag readings within a window, then delivers them as a single payload.

Two triggers close a batch:

  1. Size limit: When the batch exceeds a configurable maximum (typically 4KB)
  2. Time limit: When the collection window expires (typically 60 seconds)
Batch lifecycle:

t=0s Start new batch
t=1s Add 15 tag values (PLC read cycle)
t=2s Add 15 tag values
...
t=60s Time limit hit → finalize and deliver batch
t=60s Start new batch

OR

t=0s Start new batch
t=1s Add 200 tag values
t=2s Add 200 tag values (batch size > 4KB)
t=2s Size limit hit → finalize and deliver batch
t=2s Start new batch

Grouping by Timestamp

Within a batch, readings are organized into groups — each group representing a single poll cycle with a shared timestamp. This eliminates timestamp redundancy:

{
"groups": [
{
"ts": 1709150400,
"device_type": 1017,
"serial_number": 2234567,
"values": [
{"id": 1, "values": [true]},
{"id": 2, "values": [23.5]},
{"id": 3, "values": [145, 148, 152]}
]
},
{
"ts": 1709150401,
"device_type": 1017,
"serial_number": 2234567,
"values": [
{"id": 1, "values": [true]},
{"id": 2, "values": [23.6]}
]
}
]
}

Bypassing the Batch for Critical Events

Some data cannot wait for the batch window to close. Alarm transitions, run/stop changes, and link state changes need immediate delivery — they're sent as standalone messages that bypass the batch entirely.

The decision tree:

New tag value received:
├── Is do_not_batch flag set?
│ └── YES → Wrap in a single-value batch → deliver immediately
├── Has value changed AND has dependent tags?
│ └── YES → Flush current batch → read dependents → deliver all
└── NO → Add to current batch → continue

This ensures that when a compressor trips, the alarm reaches the cloud in less than 1 second — not 60 seconds later when the batch window closes.

Binary vs. JSON Serialization

JSON is human-readable and easy to debug. It's also 3-10x larger than the same data in binary. On a metered cellular connection at $0.01/MB, that difference matters at scale.

JSON Format

Advantages:

  • Self-describing (field names included)
  • Easy to debug with tcpdump/Wireshark
  • Cloud-side parsing is trivial

Disadvantages:

  • High overhead: {"id":1,"values":[23.5]} = 27 bytes for a single float
  • String conversions for every value
  • No fixed schema — parsing is error-prone

Binary Format

A compact binary encoding uses a fixed structure:

Batch Header:
[1 byte] Magic marker (0xF7)
[4 bytes] Number of groups (big-endian uint32)

Group Header:
[4 bytes] Timestamp (unix epoch, big-endian)
[2 bytes] Device type (big-endian uint16)
[4 bytes] Serial number (big-endian uint32)
[4 bytes] Number of values (big-endian uint32)

Value Entry:
[2 bytes] Tag ID (big-endian uint16)
[1 byte] Status (0 = OK, >0 = error code)
[1 byte] Values count
[1 byte] Value size (1, 2, or 4 bytes)
[N bytes] Values data

The same single float value {"id":1,"values":[23.5]} now takes:

Tag ID:      00 01           (2 bytes)
Status: 00 (1 byte)
Count: 01 (1 byte)
Value size: 04 (1 byte)
Value: 41 BC 00 00 (4 bytes, IEEE 754)
Total: 9 bytes

That's 9 bytes vs. 27 bytes — a 3x reduction. For a batch of 200 values, the difference is roughly 1.8KB vs. 5.4KB per batch. Over a month of 60-second batches, that's:

  • Binary: ~78MB/month
  • JSON: ~234MB/month

At cellular data rates, that's a meaningful cost difference.

When to Use Which

ScenarioFormatReasoning
Development / debuggingJSONHuman readable
Cellular / satellite connectionBinaryBandwidth critical
WiFi / Ethernet backhaulJSON or BinaryEither works
Resource-constrained gateway (<256MB RAM)BinaryLess memory for serialization
Multi-vendor cloud integrationJSONUniversal parsing

Most production deployments start with JSON during commissioning and switch to binary once the system is validated.

MQTT Reliability at the Edge

MQTT is the standard transport for IIoT edge-to-cloud communication. But the default configuration is wrong for industrial use.

QoS 1: The Right Choice for IIoT

MQTT defines three QoS levels:

  • QoS 0 (At most once): Fire and forget. Message may be lost.
  • QoS 1 (At least once): Guaranteed delivery with possible duplicates.
  • QoS 2 (Exactly once): No duplicates, but 4x the round-trips.

For IIoT data, QoS 1 is the sweet spot. QoS 0 risks data loss during network blips. QoS 2's overhead is unjustifiable when your cloud platform can deduplicate by timestamp + device ID anyway.

The key behaviors with QoS 1:

Normal flow:
Gateway → PUBLISH → Broker
Gateway ← PUBACK ← Broker
Buffer: Mark page as delivered, move to free list

Network failure:
Gateway → PUBLISH → [lost]
[no PUBACK received]
Buffer: Page stays in used list
[network recovers]
Gateway → PUBLISH → Broker (re-sends)
Gateway ← PUBACK ← Broker
Buffer: Mark as delivered

The Async Connection Pattern

MQTT connections can block for 30+ seconds during DNS resolution or TLS handshake over slow cellular. Running this on the main thread freezes data collection.

The solution: async connection in a dedicated thread.

Main Thread          Async Connect Thread
│ │
├─ Signal connect ──────┤
├─ Continue reading │
├─ Continue reading ├── DNS resolve...
├─ Continue reading ├── TCP connect...
├─ Continue reading ├── TLS handshake...
├─ Continue reading ├── MQTT CONNECT...
├─ On connect callback ←┤
├─ Start delivering │
│ buffered data │

This pattern ensures that PLC reads happen on a strict 1-second cadence regardless of MQTT connection state.

The Watchdog Timer

What happens when MQTT appears connected but data isn't actually being delivered? This occurs more often than you'd think — the TCP connection is open, the broker isn't sending DISCONNECTs, but PUBACK confirmations have stopped.

A watchdog timer catches this:

Every 120 seconds:
Check: Has any packet been delivered since last check?

YES → Reset watchdog timer
NO → Force MQTT disconnect → reconnect cycle

The 120-second window gives enough slack for normal QoS 1 retransmissions while catching dead connections before the buffer fills up.

SAS Token Expiration (Azure IoT Hub)

If you're connecting to Azure IoT Hub, SAS tokens have a finite lifetime. The gateway must:

  1. Parse the se= (expiry) timestamp from the connection string
  2. Compare against current time at each connection attempt
  3. Log a WARNING when the token is approaching expiry
  4. Refuse to connect (and alert) when the token has expired

Expired tokens cause silent connection failures — the broker accepts the TCP connection but drops the MQTT CONNECT. Without explicit expiry checking, this looks like a network problem.

Multi-Protocol Device Handling

Real factories don't standardize on one protocol. A single gateway might connect to:

  • EtherNet/IP for Allen-Bradley PLCs (blenders, conveyors)
  • Modbus TCP for chillers and dryers
  • Modbus RTU (serial RS-485) for temperature control units

Each protocol requires different connection management:

EtherNet/IP (CIP over TCP)

  • Uses libplctag library with protocol=ab-eip connection strings
  • Tags addressed by name (e.g., capacity_utilization, energy_consumption)
  • 2-second read timeout per tag
  • Error code -32 indicates link failure → stop reading until reconnect

Modbus TCP

  • TCP socket to PLC's port 502
  • Function code determines register type (FC 01-04)
  • Registers are batched by contiguous address ranges
  • 50ms delay between batch reads
  • On ETIMEDOUT/ECONNRESET → close socket, reconnect

Modbus RTU

  • Serial port (RS-232/RS-485) at configured baud rate (typically 9600)
  • Same function codes as TCP, but with CRC framing
  • Slave address must be set before first read
  • Byte timeout (inter-character) and response timeout must be tuned per device
  • Serial buffer flush before read cycle to clear stale data

Auto-Detection

On startup, the gateway can attempt to auto-detect the PLC type:

  1. Try EtherNet/IP connection → read a "device_type" tag
  2. If that fails (error -32), try Modbus TCP → read input register 800
  3. Based on the device type value, load the correct tag configuration

This auto-detection allows the same gateway firmware to work with any supported machine type without manual configuration.

Configuration Management Over-the-Air

Edge gateways deployed in factories can't be physically accessed for configuration changes. The gateway needs to accept configuration updates via its MQTT subscription channel.

Hot-Reload Without Restart

The gateway monitors its configuration files for changes. When a file modification timestamp changes:

  1. Parse the new configuration
  2. Destroy existing PLC tag handles
  3. Rebuild tag lists with new parameters
  4. Resume reading

This file-watching approach also works for over-the-air updates: the cloud sends a new configuration JSON via MQTT, the gateway writes it to disk, and the main loop detects the change on the next iteration.

Configuration Versioning

Every device configuration includes a version hash. This serves two purposes:

  1. Change detection: The cloud can compare the reported version against the expected version
  2. Rollback: If a configuration update breaks communication, operators can push the previous version

Bandwidth Optimization Strategies

Beyond binary serialization, several strategies reduce bandwidth consumption:

1. Change-Only Transmission

Compare each new value against the previous reading. Only include changed values in the batch. For slowly-changing process values, this can reduce batch sizes by 50-80%.

2. Adaptive Poll Intervals

Not all tags need the same poll rate:

Tag TypeIntervalRationale
Alarms, run/stop1 secondSafety-critical, instant visibility
Process values (temp, pressure)60 secondsTrends visible, not chatty
Static info (serial numbers, firmware)3600 secondsRarely changes

3. Dependent Tag Chains

Some tags only matter when a parent tag changes. For example, hopper-level details are only relevant when the blender run status changes. The gateway defines dependent tag lists that are only read when the parent value transitions:

Parent: blender_run_status (poll every 1s, compare mode)
└── Dependents (only read when parent changes):
├── hopper_1_level
├── hopper_2_level
├── hopper_3_level
├── feeder_1_rpm
├── feeder_2_rpm
└── feeder_3_rpm

When the blender is running steadily, only the parent tag generates traffic. When it stops or starts, the full set of dependent tags is read and delivered immediately.

4. Hourly Forced Read

Despite change-only transmission, the gateway forces a full re-read of all tags every hour. This ensures:

  • Sensor drift is detected (slow changes below the comparison threshold)
  • "Stuck" values are confirmed as genuinely unchanged
  • Cloud time-series databases have periodic data points for interpolation

Putting It All Together: The Main Loop

A production edge gateway runs a surprisingly simple main loop:

1. Initialize: allocate buffers, start MQTT async thread
2. Load device configuration
3. Auto-detect PLC (EtherNet/IP or Modbus)
4. Enter main loop (runs forever):
a. Check if configuration file changed → reload
b. Read PLC tags (respecting poll intervals)
c. Read TCU tags (if present)
d. Process incoming MQTT commands (config updates, status requests)
e. Check MQTT watchdog timer → reconnect if stale
f. Sleep 1 second

Each iteration takes under 100ms of CPU time on an ARM processor. The 1-second sleep cadence ensures consistent timing while leaving over 90% of CPU available for the OS, MQTT threads, and other services.

How machineCDN Implements Edge Architecture

machineCDN's edge gateway implements all of these patterns in a production C daemon that runs on standard industrial routers. The system supports:

  • Multi-protocol: EtherNet/IP, Modbus TCP, and Modbus RTU from a single gateway
  • 12+ machine types: Chillers, blenders, dryers, granulators, conveying systems, and more — each with pre-built tag configurations
  • 2MB page-based ring buffer: Survives extended network outages without data loss
  • Binary batch encoding: 3x bandwidth reduction vs. JSON
  • QoS 1 MQTT with async connect: Data collection never pauses during reconnection
  • Over-the-air configuration: Tag lists, poll intervals, and scaling factors updated without physical access
  • Watchdog recovery: Auto-detect stale connections and force reconnect within 120 seconds

The result: plant managers get continuous machine telemetry over standard cellular connections, even in facilities where WiFi coverage is spotty and VPN tunnels are impractical.

Conclusion

Edge computing for IIoT isn't about running machine learning models on a router. It's about the engineering fundamentals: reliable data collection, efficient buffering, smart batching, and resilient cloud delivery.

The architecture patterns covered here — page-based ring buffers, batch collection windows, binary serialization, async MQTT connections, and dependent tag chains — aren't theoretical. They're the result of deploying hundreds of gateways across manufacturing facilities where a missed alarm or corrupted temperature reading has real operational consequences.

Build your edge layer to survive the worst-case network conditions, and everything upstream — analytics, alerts, dashboards — just works. Build it for the happy path, and you'll spend your career debugging intermittent data gaps that only happen on third shift.