24 KiB
Messaging System Technical Analysis & Architecture Report
This report provides a comprehensive, deep-dive analysis of the messaging system implemented in the tmux-agent-orchestrate-delegate-job skill. It covers the MQTT broker architecture, event protocols, job lifecycles, codebase internals, cross-system integration, and a list of known limitations along with production recommendations.
1. MQTT Broker Architecture: PoC vs. TLS Production
The messaging system is designed with a clear, decoupled transition pathway from a Proof of Concept (PoC) public broker setup to a secured, authenticated, and encrypted private production cluster. All configurations are resolved dynamically from the environment or overridden at the job level, requiring zero code modifications during deployment cut-over.
1.1 PoC Architecture (Public Sandbox)
In the initial development/testing phase, the system defaults to the public broker hosted by HiveMQ:
- Host/IP:
broker.hivemq.com - Protocol/Port: Plaintext MQTT over TCP on port
1883. - Security & Auth: None. No username, password, TLS encryption, or access control list (ACL) constraints are applied.
- QoS Level: QoS 1 (At Least Once) is requested for publishes and subscriptions, ensuring acknowledgement at the network layer.
Risks and Limitations of the PoC Setup:
- Zero Eavesdropping Protection: Because the broker is public and unencrypted, any internet user can subscribe to the root topic (
python/mqtt/jobs/#) and read the exact prompt, agent sessions, and intermediate progress events. - Event Spoofing & Injection: Anyone can publish messages to any job topic. An attacker could publish a malicious
completedorerrorevent, prematurely terminating a running subscriber or causing the delegator to execute unauthorized post-validation hooks. - No Message Persistence: Public brokers do not guarantee queue persistence or durable sessions for disconnected clients. If a subscriber briefly drops offline, QoS 1 messages published during the disconnect window may be discarded.
- Rate Limiting & Reliability: Public sandboxes are subject to arbitrary rate limits, traffic spikes, and transient connection resets, leading to network-level timeouts.
1.2 Production Architecture (Secure Private Broker)
For production deployments, the system is designed to run on a private, self-hosted MQTT 5.0 broker such as Mosquitto or EMQX.
graph TD
subgraph "Secure Corporate Network"
Broker["Private MQTT Broker (Mosquitto/EMQX) <br> Ports: 8883 (TLS)"]
subgraph "Hermes (Delegator/Orchestrator)"
SubClient["job_subscriber.py <br> (Role: subscriber)"]
end
subgraph "Tmux Workspace (Agent Host)"
PubClient["publish_event.py <br> (Role: publisher)"]
end
SubClient -- "Subscribe (QoS 1) <br> Auth: hermes <br> ACL: Read jobs/+/events" --> Broker
PubClient -- "Publish (QoS 1 + Retain Terminal) <br> Auth: claude-worker <br> ACL: Write jobs/+/events" --> Broker
end
Production Security & Hardening Controls:
- Transport Layer Security (TLS v1.3): Traffic is encrypted over port
8883using a private Certification Authority (CA). The orchestrator validates the broker usingMQTT_CA_CERTS(CA bundle path). Optionally, Mutual TLS (mTLS) is supported via client-side certificate keys (MQTT_CERTFILE/MQTT_KEYFILE) for cryptographic device identities. - Strict Client Authentication: All clients must supply credentials (
MQTT_USERNAME/MQTT_PASSWORD) to establish a connection. Anonymous logins are explicitly disabled (allow_anonymous false). - Role-Based Topic Access Control Lists (ACLs):
- Orchestrator/Hermes (Subscriber): Authenticates as user
hermeswith read-only access to all event streams:user hermes topic read python/mqtt/jobs/+/events - Agent/Worker (Publisher): Authenticates as user
claude-workerwith write-only access restricted to the job event sub-topics:This prevents workers from eavesdropping on sister agents or intercepting commands on other jobs.user claude-worker topic write python/mqtt/jobs/+/events
- Orchestrator/Hermes (Subscriber): Authenticates as user
- Durable Message Queues & Session State:
- The broker is configured with
persistence trueand a dedicated disk storage path. - Subscribers connect with persistent session flags to ensure the broker buffers QoS 1 messages during temporary network drops.
- The broker is configured with
- Retained Terminal Events: Terminal events (
completed/error) are published with theretain=Trueflag. This allows a late-joining or recovering subscriber to instantly retrieve the final job status without waiting for active transmissions.
1.3 Production Mosquitto Configuration Reference
A hardened /etc/mosquitto/mosquitto.conf production configuration includes:
# Persistence settings
persistence true
persistence_location /var/lib/mosquitto/
# Authentication and Authorization
password_file /etc/mosquitto/auth/passwd
acl_file /etc/mosquitto/auth/acl
allow_anonymous false
# Listener and TLS Configuration
listener 8883
cafile /etc/mosquitto/certs/ca.crt
certfile /etc/mosquitto/certs/server.crt
keyfile /etc/mosquitto/certs/server.key
tls_version tlsv1.3
2. Event Protocol Specification
The event protocol defines a strict, single-direction JSON wire schema. It acts as the contract between the worker agent (the publisher) and the delegator/orchestrator (the subscriber).
2.1 Wire Schema (JSON UTF-8, schema_version = 1)
Every event payload must adhere to the following schema structure:
{
"schema_version": 1,
"seq": 2,
"job_id": "918b0612",
"event": "progress",
"timestamp": "2026-06-20T14:48:58Z",
"detail": "Section 1: MQTT Broker Architecture completed",
"data": {
"auth_token": "URL-safe-base64-random-token-here",
"custom_metric": 42
}
}
2.2 Wire Schema Field Dictionary
| Field | Type | Required | Description & Validation Rules |
|---|---|---|---|
schema_version |
Integer | Yes | Must be exactly 1. Subscribers discard payloads with mismatched version numbers to prevent parser crashes on schema drift. |
seq |
Integer | Yes | Monotonic sequence counter starting at 1 for the first publish. Incremented and stored in the job's registry file (last_seq) to survive agent pane crashes. |
job_id |
String | Yes | The 8-character hex string identifying the target job. Subscribers discard any messages whose job_id is unexpected or unrequested. |
event |
String | Yes | The event classification: started, progress, permission_required, completed, or error. |
timestamp |
String | Yes | ISO-8601 UTC timestamp with a trailing Z suffix. (Advisory only; never trusted for timeouts). |
detail |
String | Yes | Generalized, safe text description. Strict rules prohibit absolute paths, workspace paths, passwords, or raw environment variables. |
data |
Object | Yes | Metadata dictionary. Used in production to pass auth_token or structured execution metrics. |
2.3 Event Type Dictionary and Schemas
1. started
- Emit Trigger: Emitted by the worker agent immediately upon boot inside the tmux session, indicating it has parsed the instructions and started execution.
- Payload Constraints:
seqmust be1. Status in registry is transitioned torunning. - Example Detail:
"Job 918b0612 started"
2. progress
- Emit Trigger: Optional. Emitted at major check-points, long loops, or sub-task boundaries.
- Payload Constraints: None.
- Example Detail:
"Section 1: MQTT Broker Architecture completed"
3. permission_required
- Emit Trigger: Emitted when the agent needs human confirmation (e.g. to run a destructive command or read/write critical system files).
- Payload Constraints:
detailcontains the resource/action requested. - Example Detail:
"needs write permission to MESSAGING.md"
4. completed (Terminal)
- Emit Trigger: Successful job completion. The agent has generated all expected artifacts and verified correctness.
- Payload Constraints: Must be the final event. Published with
retain=True. - Example Detail:
"deep report written and committed to git"
5. error (Terminal)
- Emit Trigger: Terminal failure. Agent encountered an unhandled exception, syntax error, or validation script fail.
- Payload Constraints: Must be the final event. Published with
retain=True. - Example Detail:
"validation fail: missing files"
2.4 Integrity and Authentication Verification (HMAC-SHA256 Signatures)
To prevent unauthorized users from hijacking or spoofing events on public brokers:
- When a job is registered, a cryptographic token (
auth_token) is generated (secrets.token_urlsafe(32)). - The publisher reads this token and signs the JSON payload. Specifically, the publisher calculates an HMAC-SHA256 signature using the
auth_tokenas the secret key over the serialized payload (with thehmac_sigfield excluded). - The signature is attached as
data.hmac_sigon the wire. - The subscriber (
job_subscriber.py) reads the expectedauth_tokenfrom the local registry and verifies the HMAC signature. Any message with a missing, invalid, or mismatched signature is discarded immediately with an "HMAC verify failed" log. - To prevent event drops, all publishers and subscribers must be updated simultaneously during deployment rollout, since the plaintext
auth_tokenis never transmitted on the wire to prevent token interception.
3. Job Lifecycle & State Transitions
The lifecycle of a delegated job progresses through a highly coordinated state machine, involving file-based registry claiming, asynchronous message subscription, and multi-faceted event publishing.
stateDiagram-v2
[*] --> pending : register_job()
pending --> running : pick_pending()
running --> completed : publish_event(--event completed)
running --> error : publish_event(--event error)
running --> cancelled : update_status(..., cancelled)
pending --> cancelled : update_status(..., cancelled)
completed --> [*]
error --> [*]
cancelled --> [*]
3.1 Step-by-Step Lifecycle Phase Details
Phase 1: Registration (register)
- Trigger: A delegator triggers
registry.py register(or thetmux-agent-orchestrate-delegate-job submitcommand). - Registry State: Flips from non-existent to
pendinginside.hermes/jobs/<job_id>.json. Alast_seqcounter is initialized to0. - Locking: Exclusive fcntl file lock acquired over
.lockduring write. - Durable Audit Log: Writes
<logs>/<job_id>/meta.json, sets status topendinginstatus.json, and appends aregisteredevent line toevents.ndjson.
Phase 2: Claiming (pick_pending)
- Trigger: An agent session starts up and calls
registry.py pick --agent-session <session_label>. - Registry State: Oldest matching
pendingrecord is scanned. Status is atomically updated torunning.updated_atis stamped. - Locking: Reads and writes occur inside the exclusive fcntl lock block.
- Durable Audit Log: Status is synced to
runninginstatus.jsonand astatus_changedevent is appended toevents.ndjson.
Phase 3: Listening (subscribe)
- Trigger: The wrapper command launches
job_subscriber.py --job <job_id>in the background before launching the agent. - Broker Connection: Connects to the resolved host, issues a QoS 1 subscription to
python/mqtt/jobs/<job_id>/events, and blocks on an event queue. - Timeout Initialization: Dual timeouts (wall-clock budget and activity idle timer) are calculated and start ticking.
Phase 4: Execution & Progress Events (publish)
- Trigger: The agent executes prompts within tmux and runs
publish_event.pyat boot and checkpoint stages. - Network Handshake: Publisher opens a fresh TCP/TLS socket to the broker, awaits CONNACK, publishes a single QoS 1 message, waits for PUBACK, and gracefully disconnects to avoid socket resource leaks.
- State Updates: Updates
last_seqmonotonically, updatesstatustorunning(if not already), and mirrors the published payload into the local audit logs (events.ndjson). - Subscriber Capture: The subscriber captures the payload, performs bearer token checks, prints the formatted line to stdout, and resets its idle timer.
Phase 5: Terminal Finalization
- Trigger: Agent publishes
--event completedor--event error. - Registry Transition: State becomes
completedorerror. - Retained Messaging: The terminal event is published with
retain=Trueon the broker. - Subscriber Exit: The subscriber processes the terminal event exactly once, terminates its background loops, and exits (code
0for completed,1for error).
4. Code Internals Analysis
4.1 registry.py & lib.sh (Locking & Atomicity)
Two concurrency control schemes co-exist in this workspace to coordinate state modification:
lib.sh::atomic_dump_yaml(): Used for workspace-wide tmux session inventory (agent-sessions.yaml).- Locking: Uses POSIX advisory locking via python's
fcntl.flock(lock_fh, fcntl.LOCK_EX)over a sidecar lock file<yaml_path>.lock. - Safe Mutation: The mutation source code is passed in an environment variable
AGENT_SESSIONS_MUTATIONand executed dynamically usingexec(compile(..., 'exec'), globals()). This isolates the execution and avoids command-injection vectors. - Atomicity: Writes to a temp file in the same directory using
tempfile.mkstemp(), then performs anos.replace()rename. POSIX guarantees the replacement is atomic, preventing half-written YAML reads. A.bakbackup copy is also preserved.
- Locking: Uses POSIX advisory locking via python's
registry.py::register_job() / pick_pending() / _atomic_write_record(): Used for job-level metadata JSON files (<job_id>.json).- Locking: Wraps operations in a
registry_lock(registry_dir)context manager, implementing an advisory exclusive lock on.lockviafcntl.flock. - Atomicity: In
_atomic_write_record(), it usestempfile.mkstempinside the parent registry folder, serializes the updated job record to the temp file, flushes it, triggers a physical disk sync viaos.fsync(fh.fileno()), and executesos.replaceto replace the main JSON record file. The file permission is restricted to0o600immediately.
- Locking: Wraps operations in a
4.2 publish_event.py (Retries and Handshakes)
The publisher script enforces robust error handling when sending status updates:
- Fresh Connection Pattern: Instead of maintaining a persistent socket connection (which is susceptible to socket timeouts or channel leaks),
publish_event.pyopens a fresh socket, completes the authentication/TLS handshake, publishes a single QoS 1 event, waits forPUBACK, and closes the connection. - Exponential Backoff: Wrapped in the
with_retry()decorator frommqtt_common.py. In case of socket errors (OSError,TimeoutError,ConnectionError), it retries up to 3 times (configurable via--attempts) with backoff:\text{delay} = \min(\text{base\_delay} \times \text{factor}^{\text{attempt}-1}, \text{max\_delay})Default parameters:base_delay = 0.5s,factor = 2.0,max_delay = 8.0s. - ACK Handshake Deadlines:
CONNECT_ACK_TIMEOUT = 10s(stops hangs during broker downtime).PUBLISH_ACK_TIMEOUT = 5s(guarantees QoS 1 message acknowledgment before marking as published).
4.3 job_subscriber.py (Timers and Queue Semantics)
The subscriber acts as the central execution watchdog:
- Queue Serialization: Uses a thread-safe
queue.Queueinternally. The Paho MQTT callback thread adds messages to the queue, and the main thread processes them sequentially. This separates network I/O from state machine validation. - State Machine Protection: To safeguard against QoS 1 duplicate delivery or out-of-order broker retries, the subscriber runs a terminal state machine. It records job completion in an internal
terminaldictionary. Once a job is markedcompletedorerror, any subsequent events for thatjob_idare ignored:if event in TERMINAL_EVENTS: if jid in terminal: logger.info("ignoring duplicate terminal %s for %s", event, jid) continue terminal[jid] = event pending.discard(jid) - Dual Timeout Semantics:
- Wall-Clock Timeout: Calculated relative to absolute startup time (
wall_deadline = start + wall_timeout). It acts as a hard budget limit, guarding against an agent hanging indefinitely. - Activity Idle Timeout: Measured as the difference between the current monotonic time and the last packet arrival time (
idle_left = idle_timeout - (now - last_event)). If the agent fails to print logs or publish progress updates for the duration of the idle window, the subscriber aborts and exits with code 2.
- Wall-Clock Timeout: Calculated relative to absolute startup time (
4.4 mqtt_common.py (Logging & Config Resolution)
- Log Routing isolation: Configured via
setup_logging(). The root logger is bound tosys.stderr. This preserves the standard output stream (stdout) exclusively for clean JSON-lines payloads, enabling downstream bash tools to pipeline event feeds cleanly (e.g.,job_subscriber.py ... | jq). - Broker Config Resolution: Configured in
broker_config_from_job(). Resolves credentials hierarchically:- Defaults to environment configurations (e.g.
MQTT_BROKER,MQTT_PORT,MQTT_TLS,MQTT_CA_CERTS). - Overlays credentials specified inside the job record JSON block (
broker.*). This allows the agent to fetch its dedicated target broker credentials on a per-job basis.
- Defaults to environment configurations (e.g.
5. Cross-System Integration
The delegated messaging system functions as a critical control backplane, binding shell wrappers and monitoring loops across the orchestration stack.
graph LR
User["User/Cron Client"] -->|submit| Wrap["tmux-agent-orchestrate-delegate-job (Bash)"]
Wrap -->|registers| Reg["registry.py (Live Registry)"]
Wrap -->|spawns background| Sub["job_subscriber.py"]
Wrap -->|spawns tmux pane| Tmux["tmux Session (Agent Pane)"]
Tmux -->|executes agent| Agent["Claude / Codex Agent"]
Agent -->|publish_event.py| Broker["MQTT Broker"]
Broker -->|delivers events| Sub
Broker -->|delivers events| Mon["reconcile.sh (Monitor Loop)"]
Mon -->|updates| Inv["agent-sessions.yaml <br> (lib.sh::atomic_dump_yaml)"]
5.1 Orchestration Wrappers (tmux-agent-orchestrate-*)
tmux-agent-orchestrate-delegate-job (submit):- Registers a job, spawns
job_subscriber.pyto capture standard output streams to.hermes/jobs/<job_id>.subscriber.out, and sleeps for1second. - Boots the agent pane in tmux:
tmux new-session -d -s "$sess" -c "$WORKDIR" \ "printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; read" - Pre-seeds agent instruction headers via stdin to enforce that the agent runs
publish_event.pyfor its transitions. - Blocks on
wait $sub_pid, and finally prints the audit log directory.
- Registers a job, spawns
tmux-agent-orchestrate-monitor(reconcile.sh&watchdog.sh):- Watchdog Integration: Starts a subscriber monitoring loop (
watchdog.sh) to detect orphaned agent panes or locked workspaces. - Reconciliation loop: Subscribes to the global job topic. On terminal events, it invokes
lib.sh::atomic_dump_yamlto sync status drifts (e.g. setting tmux sessions toterminatedinagent-sessions.yamlonce the agent exits).
- Watchdog Integration: Starts a subscriber monitoring loop (
tmux-agent-orchestrate-create / stop / resume:- Integrates the job life status into session metadata updates, ensuring standard tmux cleanup triggers state updates in the registry and audit logs.
6. Known Limitations & Recommendations
6.1 Limitations
- Single-Host File Locking Vulnerability:
The advisory locking system previously relied heavily on
fcntl.flock. Whileagent-sessions.yamlhas been migrated to SQLite WAL to solve concurrent writes, the job metadata in.hermes/jobs/still relies onfcntl.flockwhich may behave non-atomically on NFS. - Bearer Token Leakage over Plaintext (Public Broker):
The
auth_tokenmechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., usingbroker.hivemq.comon port1883), any eavesdropper on the network can steal the token and spoof legitimate events. - Subscriber Network Drop Orphanage:
job_subscriber.pydoes not implement automatic reconnection loops. If the subscriber loses connection to the broker, it exits, leaving the running tmux agent orphaned and without a validation/collection hook. - Lack of Ordering Guarantees in QoS 1: QoS 1 guarantees delivery but not strict ordering. Under heavy backoff retries, a late-delivered progress event could land after a terminal event, causing state inconsistencies.
6.2 Recommendations
- [Implemented] Migrate to SQLite WAL Backend:
The
agent-sessions.yamllocking mechanism inlib.shhas been upgraded to use a SQLite database (agent-sessions.db) configured with Write-Ahead Logging (PRAGMA journal_mode=WAL). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (stopped,terminated,archived), eliminatingflockcontention during active session updates. Architecture Decision Note: This meansagent-sessions.yamlis no longer a real-time view of currentlyrunningsessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe liverunningstates. - Implement Signature-Based Payload Verification:
Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g.
HMAC-SHA256(secret_key, payload_bytes)). The subscriber validates the signature, preventing token interception. - Enforce Mandatory Broker-Side TLS and ACLs:
De-prioritize plaintext support. Enforce connection over port
8883with verified TLS certificates. Implement client certificates (mTLS) for agent authentication. - Build Auto-Reconnecting Subscriber Loops:
Upgrade
job_subscriber.pyto handle disconnect callbacks. Maintain a persistent queue in memory and allow the client to reconnect with exponential backoff, preventing socket dropout from terminating the orchestration flow.
Glossary: Session States vs Job States
This project manages two distinct state domains that are often confused:
Session States (YAML — .hermes/agent-sessions.yaml)
Managed by .agents/skills/lib.sh and the 6 tmux-agent-orchestrate-* skills.
Valid values (see lib.sh valid-status set):
| State | Meaning | Set by |
|---|---|---|
running |
tmux session active, agent running | create, resume |
stopped |
deliberately stopped via --capture-id/--reason/--graceful; conversation preserved for resume |
stop (STOP mode) |
terminated |
hard-killed via --mode hard; tmux session destroyed |
stop (hard mode), monitor reconcile |
archived |
soft-stopped via --mode soft; tmux left alive, YAML-only update |
stop (soft mode) |
Job States (Registry — .hermes/jobs/<id>.json)
Managed by .agents/skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py.
Valid values:
| State | Meaning | Set by |
|---|---|---|
pending |
job registered, agent not yet started | registry.py register |
running |
agent picked up the job, publishing events | publish_event.py --event started |
completed |
terminal event — agent finished successfully | publish_event.py --event completed |
error |
terminal event — agent failed | publish_event.py --event error |
cancelled |
job cancelled by orchestrator | registry.py cancel |
Key distinction: Session states track the tmux container lifecycle (create→stop→resume). Job states track the delegated work lifecycle (submit→run→complete/error). A single session can host multiple sequential jobs; a job runs within exactly one session.