docs: commit analysis report and instruction documents

This commit is contained in:
2026-06-21 10:07:38 +00:00
parent 5d555dbf6d
commit ec92b6c3fa
2 changed files with 230 additions and 0 deletions
+163
View File
@@ -0,0 +1,163 @@
# Understand-Anything: Project & Architecture Analysis Report
This report presents a comprehensive architectural analysis and security verification of the `advanced_multi_agent` orchestration workspace. Using the static analysis principles inspired by the `Understand-Anything` pipeline, we map out the codebase structure, evaluate the integrity of the design, identify critical defects/inconsistencies between implementation and documentation, and provide concrete technical recommendations.
---
## 1. Architectural Visualization
The following diagram illustrates the interaction between the orchestrator (Hermes/PM), the worker agents running inside TMUX sessions, and the decentralized event backplane (MQTT).
```mermaid
sequenceDiagram
autonumber
actor User as User / PM
participant Registry as Job Registry (.hermes/jobs/)
participant DB as Session Registry (SQLite WAL & YAML)
participant TMUX as Tmux Workspace (Worker Session)
participant MQTT as MQTT Broker (HiveMQ / Private)
participant Sub as Job Subscriber (job_subscriber.py)
participant Mon as Reconcile Monitor (reconcile.sh)
User->>Registry: Register Job (registry.py register)
Registry-->>User: Return Job ID (JID)
User->>Sub: Spawn background subscriber (job_subscriber.py --job JID)
Sub->>MQTT: Subscribe to topic (python/mqtt/jobs/JID/events)
User->>TMUX: Create session & execute agent (create_session.sh)
TMUX->>DB: Add running session (atomic_dump_yaml)
Note over TMUX: Agent Starts execution
TMUX->>MQTT: Publish 'started' event (publish_event.py)
MQTT->>Sub: Deliver event (QoS 1)
Sub->>Sub: Verify HMAC Signature
Sub->>Sub: Log to events.ndjson & print stdout
Note over TMUX: Agent does work & publishes checkpoints
TMUX->>MQTT: Publish 'progress' / 'permission_required'
MQTT->>Sub: Deliver event (QoS 1)
Note over TMUX: Agent finishes execution
TMUX->>MQTT: Publish 'completed' or 'error' (retained)
MQTT->>Sub: Deliver terminal event (QoS 1)
Sub->>Sub: Transition to Terminal State & Exit
Note over Mon: Reconcile loop runs periodically
Mon->>MQTT: Listen for terminal events
MQTT->>Mon: Deliver terminal events
Mon->>DB: Mark session terminated, kill tmux (reconcile.sh)
```
---
## 2. Core Mechanism Deep Dive & Verification
### 2.1 MQTT Backplane & Event Protocol
* **Wire Format**: Encoded in UTF-8 JSON matching `schema_version = 1`. It features monotonic `seq` indexing, `job_id`, `event` type, `timestamp`, `detail` description, and a `data` block for metadata.
* **QoS and Retention**: Event publishing and subscribing enforce **QoS 1 (At Least Once)** delivery. Terminal events (`completed`/`error`) utilize `retain=True` on the broker. This ensures that late-joining subscribers immediately receive the terminal state without missing the final outcome.
* **Network Handshake Isolation**: `publish_event.py` uses a short-lived connection pattern (connect, publish QoS 1, wait for PUBACK, disconnect) with exponential backoff retries. This limits long-lived socket starvation and mitigates socket exhaustion under high session concurrency.
### 2.2 SQLite WAL Session Database
* **Database & WAL Mode**: Session metadata has been migrated from a single-point-of-contention YAML file to a SQLite database (`.hermes/agent-sessions.db`) operating in **WAL (Write-Ahead Logging)** mode.
* **Concurrency Control**: Concurrency is managed via `BEGIN IMMEDIATE` transactions in `atomic_dump_yaml()`, which blocks concurrent write attempts at the database level rather than relying on brittle file system locks.
* **YAML Synchronization**: To maintain compatibility, `agent-sessions.yaml` is updated atomically (using `tempfile.mkstemp` and `os.replace`) only when a session transitions to a terminal state (`stopped`, `terminated`, `archived`), leaving active write traffic isolated within the SQLite WAL database.
* **NFS Fallback**: If a network mount (NFS/CIFS/SSHFS) is detected, `lib.sh` automatically falls back to `PRAGMA journal_mode=DELETE` to prevent WAL serialization crashes, as NFS does not support shared-memory mapped files (`-shm`) required by WAL.
### 2.3 HMAC-SHA256 Signature Verification
* **Signature Generation**: The publisher serializes the payload (excluding `data.hmac_sig`) into a canonical JSON string (with sorted keys and no whitespace separators) and signs it using HMAC-SHA256 with the job's secret `auth_token`.
* **Signature Verification**: `job_subscriber.py` intercepts payloads and calls `verify_hmac()`, which calculates the expected signature and compares it with the received signature using the constant-time `hmac.compare_digest` to prevent timing attacks.
---
## 3. Discovered Flaws & Documentation Inconsistencies
We have identified several critical gaps between the architecture specifications and the actual codebase implementation:
### ⚠️ Flaw 1: Documentation Mismatch in `job-protocol.md` (Security Risk if Followed)
* **Description**: Section 4 of `job-protocol.md` states:
> *`auth_token` (the bonus field) — each job record carries a per-job `auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into `data.auth_token`; the subscriber compares it against the registry's expected token and drops mismatches.*
* **Reality in Code**: If the publisher copied the plaintext token into `data.auth_token`, it would be transmitted in plaintext across the MQTT network, exposing the secret token to any eavesdropper (especially on the public PoC broker).
* **Correction**: The code correctly implements **HMAC-SHA256 signatures** via `data.hmac_sig` and **never transmits the raw `auth_token`**. The documentation in `job-protocol.md` is obsolete and contradicts the secure implementation.
### ⚠️ Flaw 2: Missing Automated `auth_token` Generation & CLI Support
* **Description**: Both `MESSAGING.md` and `registry.md` state that when a job is registered, a cryptographic token is automatically generated using `secrets.token_urlsafe(32)`.
* **Reality in Code**: In `registry.py`, `register_job()` accepts `auth_token: Optional[str] = None` and defaults it to `None`. No automatic token generation is implemented. Furthermore, the CLI registration parser (`registry.py register`) does not expose any `--auth-token` flag, nor does it generate one internally. As a result, **every job registered via the CLI is created with `auth_token = null`**, defaulting the system to the unauthenticated/unsecured PoC mode.
### ⚠️ Flaw 3: Replay Attack Vulnerability for Non-Terminal Events
* **Description**: `job_subscriber.py` enforces a terminal state machine to ignore duplicate `completed`/`error` events, but it does **not validate sequence numbers (`seq`) or timestamp freshness** for non-terminal events (`progress`, `permission_required`).
* **Exploitation Vector**: An attacker sniffing network traffic (easy on HiveMQ's plaintext broker) can capture a signed `permission_required` or `progress` event and replay it repeatedly. Since the HMAC signature remains valid, `job_subscriber.py` will accept the replayed message, write it to the audit log (`events.ndjson`), and output it to stdout, potentially triggering downstream actions or corrupting the audit trail.
### ⚠️ Flaw 4: NFS locking Vulnerability in Job Registry
* **Description**: While the session registry was successfully migrated to SQLite to circumvent NFS locking issues, the Job Registry in `.hermes/jobs/` still relies on `fcntl.flock` over a shared `.lock` file to coordinate job claims (`pick_pending`).
* **Impact**: If the project registry is located on a network-mounted file system, concurrent calls to `pick_pending` from multiple hosts could result in lock failures, leading to duplicate claims (split-brain) or corruption of the `<job_id>.json` files during write operations.
---
## 4. Technical Recommendations
To address these vulnerabilities and align the codebase with the target production security standards, we recommend the following changes:
### 1. Correct the Protocol Documentation
Update `job-protocol.md` to match the actual HMAC-SHA256 signature scheme, removing all references to transmitting the plaintext token in `data.auth_token`.
### 2. Implement Automated Token Generation in `registry.py`
Modify `register_job` to automatically generate a cryptographically secure token when running in production mode, and add the `--auth-token` argument to the CLI.
*Proposed change in `registry.py`*:
```python
# In registry.py:register_job
import secrets
# Generate token if not provided (production mode default)
if auth_token is None:
# If broker is secure/private, generate a token by default
if broker.get("tls") or broker.get("username"):
auth_token = secrets.token_urlsafe(32)
```
### 3. Harden `job_subscriber.py` Against Replay Attacks
Implement monotonic sequence number tracking and timestamp freshness checks in `_Watcher.on_message`.
*Proposed change in `job_subscriber.py`*:
```python
# In _Watcher inside job_subscriber.py
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
self.events = queue.Queue()
self.expected = set(expected_job_ids)
self.tokens = expected_tokens
self.last_seq: Dict[str, int] = {} # Track sequence numbers per job
def on_message(self, _client, _userdata, msg) -> None:
# ... (after json parse and schema check) ...
jid = payload.get("job_id")
seq = payload.get("seq", 0)
# 1. Monotonic Sequence Check
if jid in self.last_seq and seq <= self.last_seq[jid]:
logger.warning("drop replayed/duplicate event seq=%r for job %s", seq, jid)
return
# 2. Timestamp freshness check (e.g., 60s window)
# (Optional but recommended for strict production environments)
# ... (after HMAC verification succeeds) ...
self.last_seq[jid] = seq
# ...
```
### 4. Migrate the Job Registry to the SQLite DB
To eliminate NFS locking issues completely, merge the Job Registry data into the SQLite database. Define a `jobs` table with a schema similar to:
```sql
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
status TEXT,
agent_session TEXT,
created_at TEXT,
data JSON
);
```
Replace the file-based `fcntl.flock` in `registry.py` with SQL transactions (`BEGIN IMMEDIATE`), ensuring absolute atomicity and locking security regardless of the underlying filesystem type.
---
*Report compiled on 2026-06-21 by Antigravity Reviewer Agent.*