Files
multi-agent-mux/MESSAGING.md
T

24 KiB

Messaging System Technical Analysis & Architecture Report

This report provides a comprehensive, deep-dive analysis of the messaging system implemented in the multi-agent-mux-delegate-job skill. It covers the MQTT broker architecture, event protocols, job lifecycles, codebase internals, cross-system integration, and a list of known limitations along with production recommendations.


1. MQTT Broker Architecture: PoC vs. TLS Production

The messaging system is designed with a clear, decoupled transition pathway from a Proof of Concept (PoC) public broker setup to a secured, authenticated, and encrypted private production cluster. All configurations are resolved dynamically from the environment or overridden at the job level, requiring zero code modifications during deployment cut-over.

1.1 PoC Architecture (Public Sandbox)

In the initial development/testing phase, the system defaults to the public broker hosted by HiveMQ:

  • Host/IP: broker.hivemq.com
  • Protocol/Port: Plaintext MQTT over TCP on port 1883.
  • Security & Auth: None. No username, password, TLS encryption, or access control list (ACL) constraints are applied.
  • QoS Level: QoS 1 (At Least Once) is requested for publishes and subscriptions, ensuring acknowledgement at the network layer.

Risks and Limitations of the PoC Setup:

  1. Zero Eavesdropping Protection: Because the broker is public and unencrypted, any internet user can subscribe to the root topic (python/mqtt/jobs/#) and read the exact prompt, agent sessions, and intermediate progress events.
  2. Event Spoofing & Injection: Anyone can publish messages to any job topic. An attacker could publish a malicious completed or error event, prematurely terminating a running subscriber or causing the delegator to execute unauthorized post-validation hooks.
  3. No Message Persistence: Public brokers do not guarantee queue persistence or durable sessions for disconnected clients. If a subscriber briefly drops offline, QoS 1 messages published during the disconnect window may be discarded.
  4. Rate Limiting & Reliability: Public sandboxes are subject to arbitrary rate limits, traffic spikes, and transient connection resets, leading to network-level timeouts.

1.2 Production Architecture (Secure Private Broker)

For production deployments, the system is designed to run on a private, self-hosted MQTT 5.0 broker such as Mosquitto or EMQX.

graph TD
    subgraph "Secure Corporate Network"
        Broker["Private MQTT Broker (Mosquitto/EMQX) <br> Ports: 8883 (TLS)"]
        
        subgraph "Hermes (Delegator/Orchestrator)"
            SubClient["job_subscriber.py <br> (Role: subscriber)"]
        end
        
        subgraph "Tmux Workspace (Agent Host)"
            PubClient["publish_event.py <br> (Role: publisher)"]
        end
        
        SubClient -- "Subscribe (QoS 1) <br> Auth: hermes <br> ACL: Read jobs/+/events" --> Broker
        PubClient -- "Publish (QoS 1 + Retain Terminal) <br> Auth: claude-worker <br> ACL: Write jobs/+/events" --> Broker
    end

Production Security & Hardening Controls:

  1. Transport Layer Security (TLS v1.3): Traffic is encrypted over port 8883 using a private Certification Authority (CA). The orchestrator validates the broker using MQTT_CA_CERTS (CA bundle path). Optionally, Mutual TLS (mTLS) is supported via client-side certificate keys (MQTT_CERTFILE/MQTT_KEYFILE) for cryptographic device identities.
  2. Strict Client Authentication: All clients must supply credentials (MQTT_USERNAME / MQTT_PASSWORD) to establish a connection. Anonymous logins are explicitly disabled (allow_anonymous false).
  3. Role-Based Topic Access Control Lists (ACLs):
    • Orchestrator/Hermes (Subscriber): Authenticates as user hermes with read-only access to all event streams:
      user hermes
      topic read python/mqtt/jobs/+/events
      
    • Agent/Worker (Publisher): Authenticates as user claude-worker with write-only access restricted to the job event sub-topics:
      user claude-worker
      topic write python/mqtt/jobs/+/events
      
      This prevents workers from eavesdropping on sister agents or intercepting commands on other jobs.
  4. Durable Message Queues & Session State:
    • The broker is configured with persistence true and a dedicated disk storage path.
    • Subscribers connect with persistent session flags to ensure the broker buffers QoS 1 messages during temporary network drops.
  5. Retained Terminal Events: Terminal events (completed/error) are published with the retain=True flag. This allows a late-joining or recovering subscriber to instantly retrieve the final job status without waiting for active transmissions.

1.3 Production Mosquitto Configuration Reference

A hardened /etc/mosquitto/mosquitto.conf production configuration includes:

# Persistence settings
persistence true
persistence_location /var/lib/mosquitto/

# Authentication and Authorization
password_file /etc/mosquitto/auth/passwd
acl_file      /etc/mosquitto/auth/acl
allow_anonymous false

# Listener and TLS Configuration
listener 8883
cafile   /etc/mosquitto/certs/ca.crt
certfile /etc/mosquitto/certs/server.crt
keyfile  /etc/mosquitto/certs/server.key
tls_version tlsv1.3

2. Event Protocol Specification

The event protocol defines a strict, single-direction JSON wire schema. It acts as the contract between the worker agent (the publisher) and the delegator/orchestrator (the subscriber).

2.1 Wire Schema (JSON UTF-8, schema_version = 1)

Every event payload must adhere to the following schema structure:

{
  "schema_version": 1,
  "seq": 2,
  "job_id": "918b0612",
  "event": "progress",
  "timestamp": "2026-06-20T14:48:58Z",
  "detail": "Section 1: MQTT Broker Architecture completed",
  "data": {
    "auth_token": "URL-safe-base64-random-token-here",
    "custom_metric": 42
  }
}

2.2 Wire Schema Field Dictionary

Field Type Required Description & Validation Rules
schema_version Integer Yes Must be exactly 1. Subscribers discard payloads with mismatched version numbers to prevent parser crashes on schema drift.
seq Integer Yes Monotonic sequence counter starting at 1 for the first publish. Incremented and stored in the job's registry file (last_seq) to survive agent pane crashes.
job_id String Yes The 8-character hex string identifying the target job. Subscribers discard any messages whose job_id is unexpected or unrequested.
event String Yes The event classification: started, progress, permission_required, completed, or error.
timestamp String Yes ISO-8601 UTC timestamp with a trailing Z suffix. (Advisory only; never trusted for timeouts).
detail String Yes Generalized, safe text description. Strict rules prohibit absolute paths, workspace paths, passwords, or raw environment variables.
data Object Yes Metadata dictionary. Used in production to pass auth_token or structured execution metrics.

2.3 Event Type Dictionary and Schemas

1. started

  • Emit Trigger: Emitted by the worker agent immediately upon boot inside the tmux session, indicating it has parsed the instructions and started execution.
  • Payload Constraints: seq must be 1. Status in registry is transitioned to running.
  • Example Detail: "Job 918b0612 started"

2. progress

  • Emit Trigger: Optional. Emitted at major check-points, long loops, or sub-task boundaries.
  • Payload Constraints: None.
  • Example Detail: "Section 1: MQTT Broker Architecture completed"

3. permission_required

  • Emit Trigger: Emitted when the agent needs human confirmation (e.g. to run a destructive command or read/write critical system files).
  • Payload Constraints: detail contains the resource/action requested.
  • Example Detail: "needs write permission to MESSAGING.md"

4. completed (Terminal)

  • Emit Trigger: Successful job completion. The agent has generated all expected artifacts and verified correctness.
  • Payload Constraints: Must be the final event. Published with retain=True.
  • Example Detail: "deep report written and committed to git"

5. error (Terminal)

  • Emit Trigger: Terminal failure. Agent encountered an unhandled exception, syntax error, or validation script fail.
  • Payload Constraints: Must be the final event. Published with retain=True.
  • Example Detail: "validation fail: missing files"

2.4 Integrity and Authentication Verification (HMAC-SHA256 Signatures)

To prevent unauthorized users from hijacking or spoofing events on public brokers:

  1. When a job is registered, a cryptographic token (auth_token) is generated (secrets.token_urlsafe(32)).
  2. The publisher reads this token and signs the JSON payload. Specifically, the publisher calculates an HMAC-SHA256 signature using the auth_token as the secret key over the serialized payload (with the hmac_sig field excluded).
  3. The signature is attached as data.hmac_sig on the wire.
  4. The subscriber (job_subscriber.py) reads the expected auth_token from the local registry and verifies the HMAC signature. Any message with a missing, invalid, or mismatched signature is discarded immediately with an "HMAC verify failed" log.
  5. To prevent event drops, all publishers and subscribers must be updated simultaneously during deployment rollout, since the plaintext auth_token is never transmitted on the wire to prevent token interception.

3. Job Lifecycle & State Transitions

The lifecycle of a delegated job progresses through a highly coordinated state machine, involving file-based registry claiming, asynchronous message subscription, and multi-faceted event publishing.

stateDiagram-v2
    [*] --> pending : register_job()
    pending --> running : pick_pending()
    running --> completed : publish_event(--event completed)
    running --> error : publish_event(--event error)
    running --> cancelled : update_status(..., cancelled)
    pending --> cancelled : update_status(..., cancelled)
    completed --> [*]
    error --> [*]
    cancelled --> [*]

3.1 Step-by-Step Lifecycle Phase Details

Phase 1: Registration (register)

  • Trigger: A delegator triggers registry.py register (or the multi-agent-mux-delegate-job submit command).
  • Registry State: Flips from non-existent to pending inside .mam/jobs/<job_id>.json. A last_seq counter is initialized to 0.
  • Locking: Exclusive fcntl file lock acquired over .lock during write.
  • Durable Audit Log: Writes <logs>/<job_id>/meta.json, sets status to pending in status.json, and appends a registered event line to events.ndjson.

Phase 2: Claiming (pick_pending)

  • Trigger: An agent session starts up and calls registry.py pick --agent-session <session_label>.
  • Registry State: Oldest matching pending record is scanned. Status is atomically updated to running. updated_at is stamped.
  • Locking: Reads and writes occur inside the exclusive fcntl lock block.
  • Durable Audit Log: Status is synced to running in status.json and a status_changed event is appended to events.ndjson.

Phase 3: Listening (subscribe)

  • Trigger: The wrapper command launches job_subscriber.py --job <job_id> in the background before launching the agent.
  • Broker Connection: Connects to the resolved host, issues a QoS 1 subscription to python/mqtt/jobs/<job_id>/events, and blocks on an event queue.
  • Timeout Initialization: Dual timeouts (wall-clock budget and activity idle timer) are calculated and start ticking.

Phase 4: Execution & Progress Events (publish)

  • Trigger: The agent executes prompts within tmux and runs publish_event.py at boot and checkpoint stages.
  • Network Handshake: Publisher opens a fresh TCP/TLS socket to the broker, awaits CONNACK, publishes a single QoS 1 message, waits for PUBACK, and gracefully disconnects to avoid socket resource leaks.
  • State Updates: Updates last_seq monotonically, updates status to running (if not already), and mirrors the published payload into the local audit logs (events.ndjson).
  • Subscriber Capture: The subscriber captures the payload, performs bearer token checks, prints the formatted line to stdout, and resets its idle timer.

Phase 5: Terminal Finalization

  • Trigger: Agent publishes --event completed or --event error.
  • Registry Transition: State becomes completed or error.
  • Retained Messaging: The terminal event is published with retain=True on the broker.
  • Subscriber Exit: The subscriber processes the terminal event exactly once, terminates its background loops, and exits (code 0 for completed, 1 for error).

4. Code Internals Analysis

4.1 registry.py & lib.sh (Locking & Atomicity)

Two concurrency control schemes co-exist in this workspace to coordinate state modification:

  1. lib.sh::atomic_dump_yaml(): Used for workspace-wide tmux session inventory (agent-sessions.yaml).
    • Locking: Uses SQLite database transaction serialization via BEGIN IMMEDIATE on agent-sessions.db.
    • Safe Mutation: The mutation source code is passed in an environment variable AGENT_SESSIONS_MUTATION and executed dynamically using exec(compile(..., 'exec'), globals()). This isolates the execution and avoids command-injection vectors.
    • Atomicity: Updates the SQLite tables and then, if a session transitions to a finished state, writes to a temp file in the same directory using tempfile.mkstemp() and performs an os.replace() rename. POSIX guarantees the replacement is atomic, preventing half-written YAML reads. A .bak backup copy is also preserved.
  2. registry.py::register_job() / pick_pending() / _atomic_write_record(): Used for job-level metadata JSON files (<job_id>.json).
    • Locking: Wraps operations in a registry_lock(registry_dir) context manager, implementing an advisory exclusive lock on .lock via fcntl.flock.
    • Atomicity: In _atomic_write_record(), it uses tempfile.mkstemp inside the parent registry folder, serializes the updated job record to the temp file, flushes it, triggers a physical disk sync via os.fsync(fh.fileno()), and executes os.replace to replace the main JSON record file. The file permission is restricted to 0o600 immediately.

4.2 publish_event.py (Retries and Handshakes)

The publisher script enforces robust error handling when sending status updates:

  • Fresh Connection Pattern: Instead of maintaining a persistent socket connection (which is susceptible to socket timeouts or channel leaks), publish_event.py opens a fresh socket, completes the authentication/TLS handshake, publishes a single QoS 1 event, waits for PUBACK, and closes the connection.
  • Exponential Backoff: Wrapped in the with_retry() decorator from mqtt_common.py. In case of socket errors (OSError, TimeoutError, ConnectionError), it retries up to 3 times (configurable via --attempts) with backoff: \text{delay} = \min(\text{base\_delay} \times \text{factor}^{\text{attempt}-1}, \text{max\_delay}) Default parameters: base_delay = 0.5s, factor = 2.0, max_delay = 8.0s.
  • ACK Handshake Deadlines:
    • CONNECT_ACK_TIMEOUT = 10s (stops hangs during broker downtime).
    • PUBLISH_ACK_TIMEOUT = 5s (guarantees QoS 1 message acknowledgment before marking as published).

4.3 job_subscriber.py (Timers and Queue Semantics)

The subscriber acts as the central execution watchdog:

  • Queue Serialization: Uses a thread-safe queue.Queue internally. The Paho MQTT callback thread adds messages to the queue, and the main thread processes them sequentially. This separates network I/O from state machine validation.
  • State Machine Protection: To safeguard against QoS 1 duplicate delivery or out-of-order broker retries, the subscriber runs a terminal state machine. It records job completion in an internal terminal dictionary. Once a job is marked completed or error, any subsequent events for that job_id are ignored:
    if event in TERMINAL_EVENTS:
        if jid in terminal:
            logger.info("ignoring duplicate terminal %s for %s", event, jid)
            continue
        terminal[jid] = event
        pending.discard(jid)
    
  • Dual Timeout Semantics:
    1. Wall-Clock Timeout: Calculated relative to absolute startup time (wall_deadline = start + wall_timeout). It acts as a hard budget limit, guarding against an agent hanging indefinitely.
    2. Activity Idle Timeout: Measured as the difference between the current monotonic time and the last packet arrival time (idle_left = idle_timeout - (now - last_event)). If the agent fails to print logs or publish progress updates for the duration of the idle window, the subscriber aborts and exits with code 2.

4.4 mqtt_common.py (Logging & Config Resolution)

  • Log Routing isolation: Configured via setup_logging(). The root logger is bound to sys.stderr. This preserves the standard output stream (stdout) exclusively for clean JSON-lines payloads, enabling downstream bash tools to pipeline event feeds cleanly (e.g., job_subscriber.py ... | jq).
  • Broker Config Resolution: Configured in broker_config_from_job(). Resolves credentials hierarchically:
    1. Defaults to environment configurations (e.g. MQTT_BROKER, MQTT_PORT, MQTT_TLS, MQTT_CA_CERTS).
    2. Overlays credentials specified inside the job record JSON block (broker.*). This allows the agent to fetch its dedicated target broker credentials on a per-job basis.

5. Cross-System Integration

The delegated messaging system functions as a critical control backplane, binding shell wrappers and monitoring loops across the orchestration stack.

graph LR
    User["User/Cron Client"] -->|submit| Wrap["multi-agent-mux-delegate-job (Bash)"]
    Wrap -->|registers| Reg["registry.py (Live Registry)"]
    Wrap -->|spawns background| Sub["job_subscriber.py"]
    Wrap -->|spawns tmux pane| Tmux["tmux Session (Agent Pane)"]
    
    Tmux -->|executes agent| Agent["Claude / Codex Agent"]
    Agent -->|publish_event.py| Broker["MQTT Broker"]
    Broker -->|delivers events| Sub
    Broker -->|delivers events| Mon["reconcile.sh (Monitor Loop)"]
    
    Mon -->|updates| Inv["agent-sessions.yaml <br> (lib.sh::atomic_dump_yaml)"]

5.1 Orchestration Wrappers (multi-agent-mux-*)

  1. multi-agent-mux-delegate-job (submit):
    • Registers a job, spawns job_subscriber.py to capture standard output streams to .mam/jobs/<job_id>.subscriber.out, and sleeps for 1 second.
    • Boots the agent pane in tmux:
      tmux new-session -d -s "$sess" -c "$WORKDIR" \
        "printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; read"
      
    • Pre-seeds agent instruction headers via stdin to enforce that the agent runs publish_event.py for its transitions.
    • Blocks on wait $sub_pid, and finally prints the audit log directory.
  2. multi-agent-mux-monitor (reconcile.sh):
    • Wildcard Monitor Integration: Runs a unified background subscriber loop (reconcile.sh --subscribe) to capture progress, verify security tokens (HMAC) and sequences, write audit logs, and automatically clean up tmux sessions upon terminal events.
    • Reconciliation loop: Subscribes to the global job topic. On terminal events, it invokes lib.sh::atomic_dump_yaml to sync status drifts (e.g. setting tmux sessions to terminated in agent-sessions.yaml once the agent exits).
  3. multi-agent-mux-create / stop / resume:
    • Integrates the job life status into session metadata updates, ensuring standard tmux cleanup triggers state updates in the registry and audit logs.

6. Known Limitations & Recommendations

6.1 Limitations

  1. Single-Host File Locking Vulnerability: The advisory locking system previously relied heavily on fcntl.flock. While agent-sessions.yaml has been migrated to SQLite WAL to solve concurrent writes, the job metadata in .mam/jobs/ still relies on fcntl.flock which may behave non-atomically on NFS.
  2. Bearer Token Leakage over Plaintext (Public Broker): The auth_token mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using broker.hivemq.com on port 1883), any eavesdropper on the network can steal the token and spoof legitimate events.
  3. Subscriber Network Drop Orphanage: job_subscriber.py does not implement automatic reconnection loops. If the subscriber loses connection to the broker, it exits, leaving the running tmux agent orphaned and without a validation/collection hook.
  4. Lack of Ordering Guarantees in QoS 1: QoS 1 guarantees delivery but not strict ordering. Under heavy backoff retries, a late-delivered progress event could land after a terminal event, causing state inconsistencies.

6.2 Recommendations

  1. [Implemented] Migrate to SQLite WAL Backend: The agent-sessions.yaml locking mechanism in lib.sh has been upgraded to use a SQLite database (agent-sessions.db) configured with Write-Ahead Logging (PRAGMA journal_mode=WAL). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (stopped, terminated, archived), eliminating flock contention during active session updates. Architecture Decision Note: This means agent-sessions.yaml is no longer a real-time view of currently running sessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe live running states.
  2. Implement Signature-Based Payload Verification: Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. HMAC-SHA256(secret_key, payload_bytes)). The subscriber validates the signature, preventing token interception.
  3. Enforce Mandatory Broker-Side TLS and ACLs: De-prioritize plaintext support. Enforce connection over port 8883 with verified TLS certificates. Implement client certificates (mTLS) for agent authentication.
  4. Build Auto-Reconnecting Subscriber Loops: Upgrade job_subscriber.py to handle disconnect callbacks. Maintain a persistent queue in memory and allow the client to reconnect with exponential backoff, preventing socket dropout from terminating the orchestration flow.

Glossary: Session States vs Job States

This project manages two distinct state domains that are often confused:

Session States (YAML — .mam/agent-sessions.yaml)

Managed by .agents/skills/lib.sh and the 6 multi-agent-mux-* skills. Valid values (see lib.sh valid-status set):

State Meaning Set by
running tmux session active, agent running create, resume
stopped deliberately stopped via --capture-id/--reason/--graceful; conversation preserved for resume stop (STOP mode)
terminated hard-killed via --mode hard; tmux session destroyed stop (hard mode), monitor reconcile
archived soft-stopped via --mode soft; tmux left alive, YAML-only update stop (soft mode)

Job States (Registry — .mam/jobs/<id>.json)

Managed by .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py. Valid values:

State Meaning Set by
pending job registered, agent not yet started registry.py register
running agent picked up the job, publishing events publish_event.py --event started
completed terminal event — agent finished successfully publish_event.py --event completed
error terminal event — agent failed publish_event.py --event error
cancelled job cancelled by orchestrator registry.py cancel

Key distinction: Session states track the tmux container lifecycle (create→stop→resume). Job states track the delegated work lifecycle (submit→run→complete/error). A single session can host multiple sequential jobs; a job runs within exactly one session.