docs: commit analysis report and instruction documents
This commit is contained in:
@@ -0,0 +1,67 @@
|
|||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
Behavioral guidelines to reduce common LLM coding mistakes. Merge with project-specific instructions as needed.
|
||||||
|
|
||||||
|
**Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
|
||||||
|
|
||||||
|
## 1. Think Before Coding
|
||||||
|
|
||||||
|
**Don't assume. Don't hide confusion. Surface tradeoffs.**
|
||||||
|
|
||||||
|
Before implementing:
|
||||||
|
- State your assumptions explicitly. If uncertain, ask.
|
||||||
|
- If multiple interpretations exist, present them - don't pick silently.
|
||||||
|
- If a simpler approach exists, say so. Push back when warranted.
|
||||||
|
- If something is unclear, stop. Name what's confusing. Ask.
|
||||||
|
|
||||||
|
## 2. Simplicity First
|
||||||
|
|
||||||
|
**Minimum code that solves the problem. Nothing speculative.**
|
||||||
|
|
||||||
|
- No features beyond what was asked.
|
||||||
|
- No abstractions for single-use code.
|
||||||
|
- No "flexibility" or "configurability" that wasn't requested.
|
||||||
|
- No error handling for impossible scenarios.
|
||||||
|
- If you write 200 lines and it could be 50, rewrite it.
|
||||||
|
|
||||||
|
Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
|
||||||
|
|
||||||
|
## 3. Surgical Changes
|
||||||
|
|
||||||
|
**Touch only what you must. Clean up only your own mess.**
|
||||||
|
|
||||||
|
When editing existing code:
|
||||||
|
- Don't "improve" adjacent code, comments, or formatting.
|
||||||
|
- Don't refactor things that aren't broken.
|
||||||
|
- Match existing style, even if you'd do it differently.
|
||||||
|
- If you notice unrelated dead code, mention it - don't delete it.
|
||||||
|
|
||||||
|
When your changes create orphans:
|
||||||
|
- Remove imports/variables/functions that YOUR changes made unused.
|
||||||
|
- Don't remove pre-existing dead code unless asked.
|
||||||
|
|
||||||
|
The test: Every changed line should trace directly to the user's request.
|
||||||
|
|
||||||
|
## 4. Goal-Driven Execution
|
||||||
|
|
||||||
|
**Define success criteria. Loop until verified.**
|
||||||
|
|
||||||
|
Transform tasks into verifiable goals:
|
||||||
|
- "Add validation" → "Write tests for invalid inputs, then make them pass"
|
||||||
|
- "Fix the bug" → "Write a test that reproduces it, then make it pass"
|
||||||
|
- "Refactor X" → "Ensure tests pass before and after"
|
||||||
|
|
||||||
|
For multi-step tasks, state a brief plan:
|
||||||
|
```
|
||||||
|
1. [Step] → verify: [check]
|
||||||
|
2. [Step] → verify: [check]
|
||||||
|
3. [Step] → verify: [check]
|
||||||
|
```
|
||||||
|
|
||||||
|
Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**These guidelines are working if:** fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.
|
||||||
|
|
||||||
|
Read AGENT.md first before working and follow the instructions for orchestration.
|
||||||
@@ -0,0 +1,163 @@
|
|||||||
|
# Understand-Anything: Project & Architecture Analysis Report
|
||||||
|
|
||||||
|
This report presents a comprehensive architectural analysis and security verification of the `advanced_multi_agent` orchestration workspace. Using the static analysis principles inspired by the `Understand-Anything` pipeline, we map out the codebase structure, evaluate the integrity of the design, identify critical defects/inconsistencies between implementation and documentation, and provide concrete technical recommendations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Architectural Visualization
|
||||||
|
|
||||||
|
The following diagram illustrates the interaction between the orchestrator (Hermes/PM), the worker agents running inside TMUX sessions, and the decentralized event backplane (MQTT).
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
autonumber
|
||||||
|
actor User as User / PM
|
||||||
|
participant Registry as Job Registry (.hermes/jobs/)
|
||||||
|
participant DB as Session Registry (SQLite WAL & YAML)
|
||||||
|
participant TMUX as Tmux Workspace (Worker Session)
|
||||||
|
participant MQTT as MQTT Broker (HiveMQ / Private)
|
||||||
|
participant Sub as Job Subscriber (job_subscriber.py)
|
||||||
|
participant Mon as Reconcile Monitor (reconcile.sh)
|
||||||
|
|
||||||
|
User->>Registry: Register Job (registry.py register)
|
||||||
|
Registry-->>User: Return Job ID (JID)
|
||||||
|
|
||||||
|
User->>Sub: Spawn background subscriber (job_subscriber.py --job JID)
|
||||||
|
Sub->>MQTT: Subscribe to topic (python/mqtt/jobs/JID/events)
|
||||||
|
|
||||||
|
User->>TMUX: Create session & execute agent (create_session.sh)
|
||||||
|
TMUX->>DB: Add running session (atomic_dump_yaml)
|
||||||
|
|
||||||
|
Note over TMUX: Agent Starts execution
|
||||||
|
TMUX->>MQTT: Publish 'started' event (publish_event.py)
|
||||||
|
MQTT->>Sub: Deliver event (QoS 1)
|
||||||
|
Sub->>Sub: Verify HMAC Signature
|
||||||
|
Sub->>Sub: Log to events.ndjson & print stdout
|
||||||
|
|
||||||
|
Note over TMUX: Agent does work & publishes checkpoints
|
||||||
|
TMUX->>MQTT: Publish 'progress' / 'permission_required'
|
||||||
|
MQTT->>Sub: Deliver event (QoS 1)
|
||||||
|
|
||||||
|
Note over TMUX: Agent finishes execution
|
||||||
|
TMUX->>MQTT: Publish 'completed' or 'error' (retained)
|
||||||
|
MQTT->>Sub: Deliver terminal event (QoS 1)
|
||||||
|
Sub->>Sub: Transition to Terminal State & Exit
|
||||||
|
|
||||||
|
Note over Mon: Reconcile loop runs periodically
|
||||||
|
Mon->>MQTT: Listen for terminal events
|
||||||
|
MQTT->>Mon: Deliver terminal events
|
||||||
|
Mon->>DB: Mark session terminated, kill tmux (reconcile.sh)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Core Mechanism Deep Dive & Verification
|
||||||
|
|
||||||
|
### 2.1 MQTT Backplane & Event Protocol
|
||||||
|
* **Wire Format**: Encoded in UTF-8 JSON matching `schema_version = 1`. It features monotonic `seq` indexing, `job_id`, `event` type, `timestamp`, `detail` description, and a `data` block for metadata.
|
||||||
|
* **QoS and Retention**: Event publishing and subscribing enforce **QoS 1 (At Least Once)** delivery. Terminal events (`completed`/`error`) utilize `retain=True` on the broker. This ensures that late-joining subscribers immediately receive the terminal state without missing the final outcome.
|
||||||
|
* **Network Handshake Isolation**: `publish_event.py` uses a short-lived connection pattern (connect, publish QoS 1, wait for PUBACK, disconnect) with exponential backoff retries. This limits long-lived socket starvation and mitigates socket exhaustion under high session concurrency.
|
||||||
|
|
||||||
|
### 2.2 SQLite WAL Session Database
|
||||||
|
* **Database & WAL Mode**: Session metadata has been migrated from a single-point-of-contention YAML file to a SQLite database (`.hermes/agent-sessions.db`) operating in **WAL (Write-Ahead Logging)** mode.
|
||||||
|
* **Concurrency Control**: Concurrency is managed via `BEGIN IMMEDIATE` transactions in `atomic_dump_yaml()`, which blocks concurrent write attempts at the database level rather than relying on brittle file system locks.
|
||||||
|
* **YAML Synchronization**: To maintain compatibility, `agent-sessions.yaml` is updated atomically (using `tempfile.mkstemp` and `os.replace`) only when a session transitions to a terminal state (`stopped`, `terminated`, `archived`), leaving active write traffic isolated within the SQLite WAL database.
|
||||||
|
* **NFS Fallback**: If a network mount (NFS/CIFS/SSHFS) is detected, `lib.sh` automatically falls back to `PRAGMA journal_mode=DELETE` to prevent WAL serialization crashes, as NFS does not support shared-memory mapped files (`-shm`) required by WAL.
|
||||||
|
|
||||||
|
### 2.3 HMAC-SHA256 Signature Verification
|
||||||
|
* **Signature Generation**: The publisher serializes the payload (excluding `data.hmac_sig`) into a canonical JSON string (with sorted keys and no whitespace separators) and signs it using HMAC-SHA256 with the job's secret `auth_token`.
|
||||||
|
* **Signature Verification**: `job_subscriber.py` intercepts payloads and calls `verify_hmac()`, which calculates the expected signature and compares it with the received signature using the constant-time `hmac.compare_digest` to prevent timing attacks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Discovered Flaws & Documentation Inconsistencies
|
||||||
|
|
||||||
|
We have identified several critical gaps between the architecture specifications and the actual codebase implementation:
|
||||||
|
|
||||||
|
### ⚠️ Flaw 1: Documentation Mismatch in `job-protocol.md` (Security Risk if Followed)
|
||||||
|
* **Description**: Section 4 of `job-protocol.md` states:
|
||||||
|
> *`auth_token` (the bonus field) — each job record carries a per-job `auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into `data.auth_token`; the subscriber compares it against the registry's expected token and drops mismatches.*
|
||||||
|
* **Reality in Code**: If the publisher copied the plaintext token into `data.auth_token`, it would be transmitted in plaintext across the MQTT network, exposing the secret token to any eavesdropper (especially on the public PoC broker).
|
||||||
|
* **Correction**: The code correctly implements **HMAC-SHA256 signatures** via `data.hmac_sig` and **never transmits the raw `auth_token`**. The documentation in `job-protocol.md` is obsolete and contradicts the secure implementation.
|
||||||
|
|
||||||
|
### ⚠️ Flaw 2: Missing Automated `auth_token` Generation & CLI Support
|
||||||
|
* **Description**: Both `MESSAGING.md` and `registry.md` state that when a job is registered, a cryptographic token is automatically generated using `secrets.token_urlsafe(32)`.
|
||||||
|
* **Reality in Code**: In `registry.py`, `register_job()` accepts `auth_token: Optional[str] = None` and defaults it to `None`. No automatic token generation is implemented. Furthermore, the CLI registration parser (`registry.py register`) does not expose any `--auth-token` flag, nor does it generate one internally. As a result, **every job registered via the CLI is created with `auth_token = null`**, defaulting the system to the unauthenticated/unsecured PoC mode.
|
||||||
|
|
||||||
|
### ⚠️ Flaw 3: Replay Attack Vulnerability for Non-Terminal Events
|
||||||
|
* **Description**: `job_subscriber.py` enforces a terminal state machine to ignore duplicate `completed`/`error` events, but it does **not validate sequence numbers (`seq`) or timestamp freshness** for non-terminal events (`progress`, `permission_required`).
|
||||||
|
* **Exploitation Vector**: An attacker sniffing network traffic (easy on HiveMQ's plaintext broker) can capture a signed `permission_required` or `progress` event and replay it repeatedly. Since the HMAC signature remains valid, `job_subscriber.py` will accept the replayed message, write it to the audit log (`events.ndjson`), and output it to stdout, potentially triggering downstream actions or corrupting the audit trail.
|
||||||
|
|
||||||
|
### ⚠️ Flaw 4: NFS locking Vulnerability in Job Registry
|
||||||
|
* **Description**: While the session registry was successfully migrated to SQLite to circumvent NFS locking issues, the Job Registry in `.hermes/jobs/` still relies on `fcntl.flock` over a shared `.lock` file to coordinate job claims (`pick_pending`).
|
||||||
|
* **Impact**: If the project registry is located on a network-mounted file system, concurrent calls to `pick_pending` from multiple hosts could result in lock failures, leading to duplicate claims (split-brain) or corruption of the `<job_id>.json` files during write operations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Technical Recommendations
|
||||||
|
|
||||||
|
To address these vulnerabilities and align the codebase with the target production security standards, we recommend the following changes:
|
||||||
|
|
||||||
|
### 1. Correct the Protocol Documentation
|
||||||
|
Update `job-protocol.md` to match the actual HMAC-SHA256 signature scheme, removing all references to transmitting the plaintext token in `data.auth_token`.
|
||||||
|
|
||||||
|
### 2. Implement Automated Token Generation in `registry.py`
|
||||||
|
Modify `register_job` to automatically generate a cryptographically secure token when running in production mode, and add the `--auth-token` argument to the CLI.
|
||||||
|
|
||||||
|
*Proposed change in `registry.py`*:
|
||||||
|
```python
|
||||||
|
# In registry.py:register_job
|
||||||
|
import secrets
|
||||||
|
|
||||||
|
# Generate token if not provided (production mode default)
|
||||||
|
if auth_token is None:
|
||||||
|
# If broker is secure/private, generate a token by default
|
||||||
|
if broker.get("tls") or broker.get("username"):
|
||||||
|
auth_token = secrets.token_urlsafe(32)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Harden `job_subscriber.py` Against Replay Attacks
|
||||||
|
Implement monotonic sequence number tracking and timestamp freshness checks in `_Watcher.on_message`.
|
||||||
|
|
||||||
|
*Proposed change in `job_subscriber.py`*:
|
||||||
|
```python
|
||||||
|
# In _Watcher inside job_subscriber.py
|
||||||
|
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
|
||||||
|
self.events = queue.Queue()
|
||||||
|
self.expected = set(expected_job_ids)
|
||||||
|
self.tokens = expected_tokens
|
||||||
|
self.last_seq: Dict[str, int] = {} # Track sequence numbers per job
|
||||||
|
|
||||||
|
def on_message(self, _client, _userdata, msg) -> None:
|
||||||
|
# ... (after json parse and schema check) ...
|
||||||
|
jid = payload.get("job_id")
|
||||||
|
seq = payload.get("seq", 0)
|
||||||
|
|
||||||
|
# 1. Monotonic Sequence Check
|
||||||
|
if jid in self.last_seq and seq <= self.last_seq[jid]:
|
||||||
|
logger.warning("drop replayed/duplicate event seq=%r for job %s", seq, jid)
|
||||||
|
return
|
||||||
|
|
||||||
|
# 2. Timestamp freshness check (e.g., 60s window)
|
||||||
|
# (Optional but recommended for strict production environments)
|
||||||
|
|
||||||
|
# ... (after HMAC verification succeeds) ...
|
||||||
|
self.last_seq[jid] = seq
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Migrate the Job Registry to the SQLite DB
|
||||||
|
To eliminate NFS locking issues completely, merge the Job Registry data into the SQLite database. Define a `jobs` table with a schema similar to:
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS jobs (
|
||||||
|
job_id TEXT PRIMARY KEY,
|
||||||
|
status TEXT,
|
||||||
|
agent_session TEXT,
|
||||||
|
created_at TEXT,
|
||||||
|
data JSON
|
||||||
|
);
|
||||||
|
```
|
||||||
|
Replace the file-based `fcntl.flock` in `registry.py` with SQL transactions (`BEGIN IMMEDIATE`), ensuring absolute atomicity and locking security regardless of the underlying filesystem type.
|
||||||
|
|
||||||
|
---
|
||||||
|
*Report compiled on 2026-06-21 by Antigravity Reviewer Agent.*
|
||||||
Reference in New Issue
Block a user