docs: commit analysis report and instruction documents

2026-06-21 10:07:38 +00:00
parent 5d555dbf6d
commit ec92b6c3fa
2 changed files with 230 additions and 0 deletions
@@ -0,0 +1,67 @@
 # CLAUDE.md
 Behavioral guidelines to reduce common LLM coding mistakes. Merge with project-specific instructions as needed.
 **Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
 ## 1. Think Before Coding
 **Don't assume. Don't hide confusion. Surface tradeoffs.**
 Before implementing:
 - State your assumptions explicitly. If uncertain, ask.
 - If multiple interpretations exist, present them - don't pick silently.
 - If a simpler approach exists, say so. Push back when warranted.
 - If something is unclear, stop. Name what's confusing. Ask.
 ## 2. Simplicity First
 **Minimum code that solves the problem. Nothing speculative.**
 - No features beyond what was asked.
 - No abstractions for single-use code.
 - No "flexibility" or "configurability" that wasn't requested.
 - No error handling for impossible scenarios.
 - If you write 200 lines and it could be 50, rewrite it.
 Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
 ## 3. Surgical Changes
 **Touch only what you must. Clean up only your own mess.**
 When editing existing code:
 - Don't "improve" adjacent code, comments, or formatting.
 - Don't refactor things that aren't broken.
 - Match existing style, even if you'd do it differently.
 - If you notice unrelated dead code, mention it - don't delete it.
 When your changes create orphans:
 - Remove imports/variables/functions that YOUR changes made unused.
 - Don't remove pre-existing dead code unless asked.
 The test: Every changed line should trace directly to the user's request.
 ## 4. Goal-Driven Execution
 **Define success criteria. Loop until verified.**
 Transform tasks into verifiable goals:
 - "Add validation" → "Write tests for invalid inputs, then make them pass"
 - "Fix the bug" → "Write a test that reproduces it, then make it pass"
 - "Refactor X" → "Ensure tests pass before and after"
 For multi-step tasks, state a brief plan:
 ```
 1. [Step] → verify: [check]
 2. [Step] → verify: [check]
 3. [Step] → verify: [check]
 ```
 Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
 ---
 **These guidelines are working if:** fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.
 Read AGENT.md first before working and follow the instructions for orchestration. 
@@ -0,0 +1,163 @@
 # Understand-Anything: Project & Architecture Analysis Report
 This report presents a comprehensive architectural analysis and security verification of the `advanced_multi_agent` orchestration workspace. Using the static analysis principles inspired by the `Understand-Anything` pipeline, we map out the codebase structure, evaluate the integrity of the design, identify critical defects/inconsistencies between implementation and documentation, and provide concrete technical recommendations.
 ---
 ## 1. Architectural Visualization
 The following diagram illustrates the interaction between the orchestrator (Hermes/PM), the worker agents running inside TMUX sessions, and the decentralized event backplane (MQTT).
 ```mermaid
 sequenceDiagram
    autonumber
    actor User as User / PM
    participant Registry as Job Registry (.hermes/jobs/)
    participant DB as Session Registry (SQLite WAL & YAML)
    participant TMUX as Tmux Workspace (Worker Session)
    participant MQTT as MQTT Broker (HiveMQ / Private)
    participant Sub as Job Subscriber (job_subscriber.py)
    participant Mon as Reconcile Monitor (reconcile.sh)
    User->>Registry: Register Job (registry.py register)
    Registry-->>User: Return Job ID (JID)
    User->>Sub: Spawn background subscriber (job_subscriber.py --job JID)
    Sub->>MQTT: Subscribe to topic (python/mqtt/jobs/JID/events)
    User->>TMUX: Create session & execute agent (create_session.sh)
    TMUX->>DB: Add running session (atomic_dump_yaml)
    Note over TMUX: Agent Starts execution
    TMUX->>MQTT: Publish 'started' event (publish_event.py)
    MQTT->>Sub: Deliver event (QoS 1)
    Sub->>Sub: Verify HMAC Signature
    Sub->>Sub: Log to events.ndjson & print stdout
    Note over TMUX: Agent does work & publishes checkpoints
    TMUX->>MQTT: Publish 'progress' / 'permission_required'
    MQTT->>Sub: Deliver event (QoS 1)
    Note over TMUX: Agent finishes execution
    TMUX->>MQTT: Publish 'completed' or 'error' (retained)
    MQTT->>Sub: Deliver terminal event (QoS 1)
    Sub->>Sub: Transition to Terminal State & Exit
    Note over Mon: Reconcile loop runs periodically
    Mon->>MQTT: Listen for terminal events
    MQTT->>Mon: Deliver terminal events
    Mon->>DB: Mark session terminated, kill tmux (reconcile.sh)
 ```
 ---
 ## 2. Core Mechanism Deep Dive & Verification
 ### 2.1 MQTT Backplane & Event Protocol
 * **Wire Format**: Encoded in UTF-8 JSON matching `schema_version = 1`. It features monotonic `seq` indexing, `job_id`, `event` type, `timestamp`, `detail` description, and a `data` block for metadata.
 * **QoS and Retention**: Event publishing and subscribing enforce **QoS 1 (At Least Once)** delivery. Terminal events (`completed`/`error`) utilize `retain=True` on the broker. This ensures that late-joining subscribers immediately receive the terminal state without missing the final outcome.
 * **Network Handshake Isolation**: `publish_event.py` uses a short-lived connection pattern (connect, publish QoS 1, wait for PUBACK, disconnect) with exponential backoff retries. This limits long-lived socket starvation and mitigates socket exhaustion under high session concurrency.
 ### 2.2 SQLite WAL Session Database
 * **Database & WAL Mode**: Session metadata has been migrated from a single-point-of-contention YAML file to a SQLite database (`.hermes/agent-sessions.db`) operating in **WAL (Write-Ahead Logging)** mode.
 * **Concurrency Control**: Concurrency is managed via `BEGIN IMMEDIATE` transactions in `atomic_dump_yaml()`, which blocks concurrent write attempts at the database level rather than relying on brittle file system locks.
 * **YAML Synchronization**: To maintain compatibility, `agent-sessions.yaml` is updated atomically (using `tempfile.mkstemp` and `os.replace`) only when a session transitions to a terminal state (`stopped`, `terminated`, `archived`), leaving active write traffic isolated within the SQLite WAL database.
 * **NFS Fallback**: If a network mount (NFS/CIFS/SSHFS) is detected, `lib.sh` automatically falls back to `PRAGMA journal_mode=DELETE` to prevent WAL serialization crashes, as NFS does not support shared-memory mapped files (`-shm`) required by WAL.
 ### 2.3 HMAC-SHA256 Signature Verification
 * **Signature Generation**: The publisher serializes the payload (excluding `data.hmac_sig`) into a canonical JSON string (with sorted keys and no whitespace separators) and signs it using HMAC-SHA256 with the job's secret `auth_token`.
 * **Signature Verification**: `job_subscriber.py` intercepts payloads and calls `verify_hmac()`, which calculates the expected signature and compares it with the received signature using the constant-time `hmac.compare_digest` to prevent timing attacks.
 ---
 ## 3. Discovered Flaws & Documentation Inconsistencies
 We have identified several critical gaps between the architecture specifications and the actual codebase implementation:
 ### ⚠️ Flaw 1: Documentation Mismatch in `job-protocol.md` (Security Risk if Followed)
 * **Description**: Section 4 of `job-protocol.md` states:
  > *`auth_token` (the bonus field) — each job record carries a per-job `auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into `data.auth_token`; the subscriber compares it against the registry's expected token and drops mismatches.*
 * **Reality in Code**: If the publisher copied the plaintext token into `data.auth_token`, it would be transmitted in plaintext across the MQTT network, exposing the secret token to any eavesdropper (especially on the public PoC broker). 
 * **Correction**: The code correctly implements **HMAC-SHA256 signatures** via `data.hmac_sig` and **never transmits the raw `auth_token`**. The documentation in `job-protocol.md` is obsolete and contradicts the secure implementation.
 ### ⚠️ Flaw 2: Missing Automated `auth_token` Generation & CLI Support
 * **Description**: Both `MESSAGING.md` and `registry.md` state that when a job is registered, a cryptographic token is automatically generated using `secrets.token_urlsafe(32)`.
 * **Reality in Code**: In `registry.py`, `register_job()` accepts `auth_token: Optional[str] = None` and defaults it to `None`. No automatic token generation is implemented. Furthermore, the CLI registration parser (`registry.py register`) does not expose any `--auth-token` flag, nor does it generate one internally. As a result, **every job registered via the CLI is created with `auth_token = null`**, defaulting the system to the unauthenticated/unsecured PoC mode.
 ### ⚠️ Flaw 3: Replay Attack Vulnerability for Non-Terminal Events
 * **Description**: `job_subscriber.py` enforces a terminal state machine to ignore duplicate `completed`/`error` events, but it does **not validate sequence numbers (`seq`) or timestamp freshness** for non-terminal events (`progress`, `permission_required`).
 * **Exploitation Vector**: An attacker sniffing network traffic (easy on HiveMQ's plaintext broker) can capture a signed `permission_required` or `progress` event and replay it repeatedly. Since the HMAC signature remains valid, `job_subscriber.py` will accept the replayed message, write it to the audit log (`events.ndjson`), and output it to stdout, potentially triggering downstream actions or corrupting the audit trail.
 ### ⚠️ Flaw 4: NFS locking Vulnerability in Job Registry
 * **Description**: While the session registry was successfully migrated to SQLite to circumvent NFS locking issues, the Job Registry in `.hermes/jobs/` still relies on `fcntl.flock` over a shared `.lock` file to coordinate job claims (`pick_pending`).
 * **Impact**: If the project registry is located on a network-mounted file system, concurrent calls to `pick_pending` from multiple hosts could result in lock failures, leading to duplicate claims (split-brain) or corruption of the `<job_id>.json` files during write operations.
 ---
 ## 4. Technical Recommendations
 To address these vulnerabilities and align the codebase with the target production security standards, we recommend the following changes:
 ### 1. Correct the Protocol Documentation
 Update `job-protocol.md` to match the actual HMAC-SHA256 signature scheme, removing all references to transmitting the plaintext token in `data.auth_token`.
 ### 2. Implement Automated Token Generation in `registry.py`
 Modify `register_job` to automatically generate a cryptographically secure token when running in production mode, and add the `--auth-token` argument to the CLI.
 *Proposed change in `registry.py`*:
 ```python
 # In registry.py:register_job
 import secrets
 # Generate token if not provided (production mode default)
 if auth_token is None:
    # If broker is secure/private, generate a token by default
    if broker.get("tls") or broker.get("username"):
        auth_token = secrets.token_urlsafe(32)
 ```
 ### 3. Harden `job_subscriber.py` Against Replay Attacks
 Implement monotonic sequence number tracking and timestamp freshness checks in `_Watcher.on_message`.
 *Proposed change in `job_subscriber.py`*:
 ```python
 # In _Watcher inside job_subscriber.py
 def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
    self.events = queue.Queue()
    self.expected = set(expected_job_ids)
    self.tokens = expected_tokens
    self.last_seq: Dict[str, int] = {}  # Track sequence numbers per job
 def on_message(self, _client, _userdata, msg) -> None:
    # ... (after json parse and schema check) ...
    jid = payload.get("job_id")
    seq = payload.get("seq", 0)
    # 1. Monotonic Sequence Check
    if jid in self.last_seq and seq <= self.last_seq[jid]:
        logger.warning("drop replayed/duplicate event seq=%r for job %s", seq, jid)
        return
    # 2. Timestamp freshness check (e.g., 60s window)
    # (Optional but recommended for strict production environments)
    # ... (after HMAC verification succeeds) ...
    self.last_seq[jid] = seq
    # ...
 ```
 ### 4. Migrate the Job Registry to the SQLite DB
 To eliminate NFS locking issues completely, merge the Job Registry data into the SQLite database. Define a `jobs` table with a schema similar to:
 ```sql
 CREATE TABLE IF NOT EXISTS jobs (
    job_id TEXT PRIMARY KEY,
    status TEXT,
    agent_session TEXT,
    created_at TEXT,
    data JSON
 );
 ```
 Replace the file-based `fcntl.flock` in `registry.py` with SQL transactions (`BEGIN IMMEDIATE`), ensuring absolute atomicity and locking security regardless of the underlying filesystem type.
 ---
 *Report compiled on 2026-06-21 by Antigravity Reviewer Agent.*