10 KiB
Understand-Anything: Project & Architecture Analysis Report
This report presents a comprehensive architectural analysis and security verification of the advanced_multi_agent orchestration workspace. Using the static analysis principles inspired by the Understand-Anything pipeline, we map out the codebase structure, evaluate the integrity of the design, identify critical defects/inconsistencies between implementation and documentation, and provide concrete technical recommendations.
1. Architectural Visualization
The following diagram illustrates the interaction between the orchestrator (Hermes/PM), the worker agents running inside TMUX sessions, and the decentralized event backplane (MQTT).
sequenceDiagram
autonumber
actor User as User / PM
participant Registry as Job Registry (.hermes/jobs/)
participant DB as Session Registry (SQLite WAL & YAML)
participant TMUX as Tmux Workspace (Worker Session)
participant MQTT as MQTT Broker (HiveMQ / Private)
participant Sub as Job Subscriber (job_subscriber.py)
participant Mon as Reconcile Monitor (reconcile.sh)
User->>Registry: Register Job (registry.py register)
Registry-->>User: Return Job ID (JID)
User->>Sub: Spawn background subscriber (job_subscriber.py --job JID)
Sub->>MQTT: Subscribe to topic (python/mqtt/jobs/JID/events)
User->>TMUX: Create session & execute agent (create_session.sh)
TMUX->>DB: Add running session (atomic_dump_yaml)
Note over TMUX: Agent Starts execution
TMUX->>MQTT: Publish 'started' event (publish_event.py)
MQTT->>Sub: Deliver event (QoS 1)
Sub->>Sub: Verify HMAC Signature
Sub->>Sub: Log to events.ndjson & print stdout
Note over TMUX: Agent does work & publishes checkpoints
TMUX->>MQTT: Publish 'progress' / 'permission_required'
MQTT->>Sub: Deliver event (QoS 1)
Note over TMUX: Agent finishes execution
TMUX->>MQTT: Publish 'completed' or 'error' (retained)
MQTT->>Sub: Deliver terminal event (QoS 1)
Sub->>Sub: Transition to Terminal State & Exit
Note over Mon: Reconcile loop runs periodically
Mon->>MQTT: Listen for terminal events
MQTT->>Mon: Deliver terminal events
Mon->>DB: Mark session terminated, kill tmux (reconcile.sh)
2. Core Mechanism Deep Dive & Verification
2.1 MQTT Backplane & Event Protocol
- Wire Format: Encoded in UTF-8 JSON matching
schema_version = 1. It features monotonicseqindexing,job_id,eventtype,timestamp,detaildescription, and adatablock for metadata. - QoS and Retention: Event publishing and subscribing enforce QoS 1 (At Least Once) delivery. Terminal events (
completed/error) utilizeretain=Trueon the broker. This ensures that late-joining subscribers immediately receive the terminal state without missing the final outcome. - Network Handshake Isolation:
publish_event.pyuses a short-lived connection pattern (connect, publish QoS 1, wait for PUBACK, disconnect) with exponential backoff retries. This limits long-lived socket starvation and mitigates socket exhaustion under high session concurrency.
2.2 SQLite WAL Session Database
- Database & WAL Mode: Session metadata has been migrated from a single-point-of-contention YAML file to a SQLite database (
.hermes/agent-sessions.db) operating in WAL (Write-Ahead Logging) mode. - Concurrency Control: Concurrency is managed via
BEGIN IMMEDIATEtransactions inatomic_dump_yaml(), which blocks concurrent write attempts at the database level rather than relying on brittle file system locks. - YAML Synchronization: To maintain compatibility,
agent-sessions.yamlis updated atomically (usingtempfile.mkstempandos.replace) only when a session transitions to a terminal state (stopped,terminated,archived), leaving active write traffic isolated within the SQLite WAL database. - NFS Fallback: If a network mount (NFS/CIFS/SSHFS) is detected,
lib.shautomatically falls back toPRAGMA journal_mode=DELETEto prevent WAL serialization crashes, as NFS does not support shared-memory mapped files (-shm) required by WAL.
2.3 HMAC-SHA256 Signature Verification
- Signature Generation: The publisher serializes the payload (excluding
data.hmac_sig) into a canonical JSON string (with sorted keys and no whitespace separators) and signs it using HMAC-SHA256 with the job's secretauth_token. - Signature Verification:
job_subscriber.pyintercepts payloads and callsverify_hmac(), which calculates the expected signature and compares it with the received signature using the constant-timehmac.compare_digestto prevent timing attacks.
3. Discovered Flaws & Documentation Inconsistencies
We have identified several critical gaps between the architecture specifications and the actual codebase implementation:
⚠️ Flaw 1: Documentation Mismatch in job-protocol.md (Security Risk if Followed)
- Description: Section 4 of
job-protocol.mdstates:auth_token(the bonus field) — each job record carries a per-jobauth_token(secrets.token_urlsafe(32)). The publisher copies it intodata.auth_token; the subscriber compares it against the registry's expected token and drops mismatches. - Reality in Code: If the publisher copied the plaintext token into
data.auth_token, it would be transmitted in plaintext across the MQTT network, exposing the secret token to any eavesdropper (especially on the public PoC broker). - Correction: The code correctly implements HMAC-SHA256 signatures via
data.hmac_sigand never transmits the rawauth_token. The documentation injob-protocol.mdis obsolete and contradicts the secure implementation.
⚠️ Flaw 2: Missing Automated auth_token Generation & CLI Support
- Description: Both
MESSAGING.mdandregistry.mdstate that when a job is registered, a cryptographic token is automatically generated usingsecrets.token_urlsafe(32). - Reality in Code: In
registry.py,register_job()acceptsauth_token: Optional[str] = Noneand defaults it toNone. No automatic token generation is implemented. Furthermore, the CLI registration parser (registry.py register) does not expose any--auth-tokenflag, nor does it generate one internally. As a result, every job registered via the CLI is created withauth_token = null, defaulting the system to the unauthenticated/unsecured PoC mode.
⚠️ Flaw 3: Replay Attack Vulnerability for Non-Terminal Events
- Description:
job_subscriber.pyenforces a terminal state machine to ignore duplicatecompleted/errorevents, but it does not validate sequence numbers (seq) or timestamp freshness for non-terminal events (progress,permission_required). - Exploitation Vector: An attacker sniffing network traffic (easy on HiveMQ's plaintext broker) can capture a signed
permission_requiredorprogressevent and replay it repeatedly. Since the HMAC signature remains valid,job_subscriber.pywill accept the replayed message, write it to the audit log (events.ndjson), and output it to stdout, potentially triggering downstream actions or corrupting the audit trail.
⚠️ Flaw 4: NFS locking Vulnerability in Job Registry
- Description: While the session registry was successfully migrated to SQLite to circumvent NFS locking issues, the Job Registry in
.hermes/jobs/still relies onfcntl.flockover a shared.lockfile to coordinate job claims (pick_pending). - Impact: If the project registry is located on a network-mounted file system, concurrent calls to
pick_pendingfrom multiple hosts could result in lock failures, leading to duplicate claims (split-brain) or corruption of the<job_id>.jsonfiles during write operations.
4. Technical Recommendations
To address these vulnerabilities and align the codebase with the target production security standards, we recommend the following changes:
1. Correct the Protocol Documentation
Update job-protocol.md to match the actual HMAC-SHA256 signature scheme, removing all references to transmitting the plaintext token in data.auth_token.
2. Implement Automated Token Generation in registry.py
Modify register_job to automatically generate a cryptographically secure token when running in production mode, and add the --auth-token argument to the CLI.
Proposed change in registry.py:
# In registry.py:register_job
import secrets
# Generate token if not provided (production mode default)
if auth_token is None:
# If broker is secure/private, generate a token by default
if broker.get("tls") or broker.get("username"):
auth_token = secrets.token_urlsafe(32)
3. Harden job_subscriber.py Against Replay Attacks
Implement monotonic sequence number tracking and timestamp freshness checks in _Watcher.on_message.
Proposed change in job_subscriber.py:
# In _Watcher inside job_subscriber.py
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
self.events = queue.Queue()
self.expected = set(expected_job_ids)
self.tokens = expected_tokens
self.last_seq: Dict[str, int] = {} # Track sequence numbers per job
def on_message(self, _client, _userdata, msg) -> None:
# ... (after json parse and schema check) ...
jid = payload.get("job_id")
seq = payload.get("seq", 0)
# 1. Monotonic Sequence Check
if jid in self.last_seq and seq <= self.last_seq[jid]:
logger.warning("drop replayed/duplicate event seq=%r for job %s", seq, jid)
return
# 2. Timestamp freshness check (e.g., 60s window)
# (Optional but recommended for strict production environments)
# ... (after HMAC verification succeeds) ...
self.last_seq[jid] = seq
# ...
4. Migrate the Job Registry to the SQLite DB
To eliminate NFS locking issues completely, merge the Job Registry data into the SQLite database. Define a jobs table with a schema similar to:
CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
status TEXT,
agent_session TEXT,
created_at TEXT,
data JSON
);
Replace the file-based fcntl.flock in registry.py with SQL transactions (BEGIN IMMEDIATE), ensuring absolute atomicity and locking security regardless of the underlying filesystem type.
Report compiled on 2026-06-21 by Antigravity Reviewer Agent.