# Understand-Anything: Project & Architecture Analysis Report This report presents a comprehensive architectural analysis and security verification of the `tmux_agent_orchestration` orchestration workspace. Using the static analysis principles inspired by the `Understand-Anything` pipeline, we map out the codebase structure, evaluate the integrity of the design, identify critical defects/inconsistencies between implementation and documentation, and provide concrete technical recommendations. --- ## 1. Architectural Visualization The following diagram illustrates the interaction between the orchestrator (Hermes/PM), the worker agents running inside TMUX sessions, and the decentralized event backplane (MQTT). ```mermaid sequenceDiagram autonumber actor User as User / PM participant Registry as Job Registry (.hermes/jobs/) participant DB as Session Registry (SQLite WAL & YAML) participant TMUX as Tmux Workspace (Worker Session) participant MQTT as MQTT Broker (HiveMQ / Private) participant Sub as Job Subscriber (job_subscriber.py) participant Mon as Reconcile Monitor (reconcile.sh) User->>Registry: Register Job (registry.py register) Registry-->>User: Return Job ID (JID) User->>Sub: Spawn background subscriber (job_subscriber.py --job JID) Sub->>MQTT: Subscribe to topic (python/mqtt/jobs/JID/events) User->>TMUX: Create session & execute agent (create_session.sh) TMUX->>DB: Add running session (atomic_dump_yaml) Note over TMUX: Agent Starts execution TMUX->>MQTT: Publish 'started' event (publish_event.py) MQTT->>Sub: Deliver event (QoS 1) Sub->>Sub: Verify HMAC Signature Sub->>Sub: Log to events.ndjson & print stdout Note over TMUX: Agent does work & publishes checkpoints TMUX->>MQTT: Publish 'progress' / 'permission_required' MQTT->>Sub: Deliver event (QoS 1) Note over TMUX: Agent finishes execution TMUX->>MQTT: Publish 'completed' or 'error' (retained) MQTT->>Sub: Deliver terminal event (QoS 1) Sub->>Sub: Transition to Terminal State & Exit Note over Mon: Reconcile loop runs periodically Mon->>MQTT: Listen for terminal events MQTT->>Mon: Deliver terminal events Mon->>DB: Mark session terminated, kill tmux (reconcile.sh) ``` --- ## 2. Core Mechanism Deep Dive & Verification ### 2.1 MQTT Backplane & Event Protocol * **Wire Format**: Encoded in UTF-8 JSON matching `schema_version = 1`. It features monotonic `seq` indexing, `job_id`, `event` type, `timestamp`, `detail` description, and a `data` block for metadata. * **QoS and Retention**: Event publishing and subscribing enforce **QoS 1 (At Least Once)** delivery. Terminal events (`completed`/`error`) utilize `retain=True` on the broker. This ensures that late-joining subscribers immediately receive the terminal state without missing the final outcome. * **Network Handshake Isolation**: `publish_event.py` uses a short-lived connection pattern (connect, publish QoS 1, wait for PUBACK, disconnect) with exponential backoff retries. This limits long-lived socket starvation and mitigates socket exhaustion under high session concurrency. ### 2.2 SQLite WAL Session Database * **Database & WAL Mode**: Session metadata has been migrated from a single-point-of-contention YAML file to a SQLite database (`.hermes/agent-sessions.db`) operating in **WAL (Write-Ahead Logging)** mode. * **Concurrency Control**: Concurrency is managed via `BEGIN IMMEDIATE` transactions in `atomic_dump_yaml()`, which blocks concurrent write attempts at the database level rather than relying on brittle file system locks. * **YAML Synchronization**: To maintain compatibility, `agent-sessions.yaml` is updated atomically (using `tempfile.mkstemp` and `os.replace`) only when a session transitions to a terminal state (`stopped`, `terminated`, `archived`), leaving active write traffic isolated within the SQLite WAL database. * **NFS Fallback**: If a network mount (NFS/CIFS/SSHFS) is detected, `lib.sh` automatically falls back to `PRAGMA journal_mode=DELETE` to prevent WAL serialization crashes, as NFS does not support shared-memory mapped files (`-shm`) required by WAL. ### 2.3 HMAC-SHA256 Signature Verification * **Signature Generation**: The publisher serializes the payload (excluding `data.hmac_sig`) into a canonical JSON string (with sorted keys and no whitespace separators) and signs it using HMAC-SHA256 with the job's secret `auth_token`. * **Signature Verification**: `job_subscriber.py` intercepts payloads and calls `verify_hmac()`, which calculates the expected signature and compares it with the received signature using the constant-time `hmac.compare_digest` to prevent timing attacks. --- ## 3. Discovered Flaws & Documentation Inconsistencies We have identified several critical gaps between the architecture specifications and the actual codebase implementation: ### ⚠️ Flaw 1: Documentation Mismatch in `job-protocol.md` (Security Risk if Followed) * **Description**: Section 4 of `job-protocol.md` states: > *`auth_token` (the bonus field) — each job record carries a per-job `auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into `data.auth_token`; the subscriber compares it against the registry's expected token and drops mismatches.* * **Reality in Code**: If the publisher copied the plaintext token into `data.auth_token`, it would be transmitted in plaintext across the MQTT network, exposing the secret token to any eavesdropper (especially on the public PoC broker). * **Correction**: The code correctly implements **HMAC-SHA256 signatures** via `data.hmac_sig` and **never transmits the raw `auth_token`**. The documentation in `job-protocol.md` is obsolete and contradicts the secure implementation. ### ⚠️ Flaw 2: Missing Automated `auth_token` Generation & CLI Support * **Description**: Both `MESSAGING.md` and `registry.md` state that when a job is registered, a cryptographic token is automatically generated using `secrets.token_urlsafe(32)`. * **Reality in Code**: In `registry.py`, `register_job()` accepts `auth_token: Optional[str] = None` and defaults it to `None`. No automatic token generation is implemented. Furthermore, the CLI registration parser (`registry.py register`) does not expose any `--auth-token` flag, nor does it generate one internally. As a result, **every job registered via the CLI is created with `auth_token = null`**, defaulting the system to the unauthenticated/unsecured PoC mode. ### ⚠️ Flaw 3: Replay Attack Vulnerability for Non-Terminal Events * **Description**: `job_subscriber.py` enforces a terminal state machine to ignore duplicate `completed`/`error` events, but it does **not validate sequence numbers (`seq`) or timestamp freshness** for non-terminal events (`progress`, `permission_required`). * **Exploitation Vector**: An attacker sniffing network traffic (easy on HiveMQ's plaintext broker) can capture a signed `permission_required` or `progress` event and replay it repeatedly. Since the HMAC signature remains valid, `job_subscriber.py` will accept the replayed message, write it to the audit log (`events.ndjson`), and output it to stdout, potentially triggering downstream actions or corrupting the audit trail. ### ⚠️ Flaw 4: NFS locking Vulnerability in Job Registry * **Description**: While the session registry was successfully migrated to SQLite to circumvent NFS locking issues, the Job Registry in `.hermes/jobs/` still relies on `fcntl.flock` over a shared `.lock` file to coordinate job claims (`pick_pending`). * **Impact**: If the project registry is located on a network-mounted file system, concurrent calls to `pick_pending` from multiple hosts could result in lock failures, leading to duplicate claims (split-brain) or corruption of the `.json` files during write operations. --- ## 4. Technical Recommendations To address these vulnerabilities and align the codebase with the target production security standards, we recommend the following changes: ### 1. Correct the Protocol Documentation Update `job-protocol.md` to match the actual HMAC-SHA256 signature scheme, removing all references to transmitting the plaintext token in `data.auth_token`. ### 2. Implement Automated Token Generation in `registry.py` Modify `register_job` to automatically generate a cryptographically secure token when running in production mode, and add the `--auth-token` argument to the CLI. *Proposed change in `registry.py`*: ```python # In registry.py:register_job import secrets # Generate token if not provided (production mode default) if auth_token is None: # If broker is secure/private, generate a token by default if broker.get("tls") or broker.get("username"): auth_token = secrets.token_urlsafe(32) ``` ### 3. Harden `job_subscriber.py` Against Replay Attacks Implement monotonic sequence number tracking and timestamp freshness checks in `_Watcher.on_message`. *Proposed change in `job_subscriber.py`*: ```python # In _Watcher inside job_subscriber.py def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]): self.events = queue.Queue() self.expected = set(expected_job_ids) self.tokens = expected_tokens self.last_seq: Dict[str, int] = {} # Track sequence numbers per job def on_message(self, _client, _userdata, msg) -> None: # ... (after json parse and schema check) ... jid = payload.get("job_id") seq = payload.get("seq", 0) # 1. Monotonic Sequence Check if jid in self.last_seq and seq <= self.last_seq[jid]: logger.warning("drop replayed/duplicate event seq=%r for job %s", seq, jid) return # 2. Timestamp freshness check (e.g., 60s window) # (Optional but recommended for strict production environments) # ... (after HMAC verification succeeds) ... self.last_seq[jid] = seq # ... ``` ### 4. Migrate the Job Registry to the SQLite DB To eliminate NFS locking issues completely, merge the Job Registry data into the SQLite database. Define a `jobs` table with a schema similar to: ```sql CREATE TABLE IF NOT EXISTS jobs ( job_id TEXT PRIMARY KEY, status TEXT, agent_session TEXT, created_at TEXT, data JSON ); ``` Replace the file-based `fcntl.flock` in `registry.py` with SQL transactions (`BEGIN IMMEDIATE`), ensuring absolute atomicity and locking security regardless of the underlying filesystem type. --- *Report compiled on 2026-06-21 by Antigravity Reviewer Agent.*