365 lines
24 KiB
Markdown
365 lines
24 KiB
Markdown
# Messaging System Technical Analysis & Architecture Report
|
|
|
|
This report provides a comprehensive, deep-dive analysis of the messaging system implemented in the `multi-agent-mux-delegate-job` skill. It covers the MQTT broker architecture, event protocols, job lifecycles, codebase internals, cross-system integration, and a list of known limitations along with production recommendations.
|
|
|
|
---
|
|
|
|
## 1. MQTT Broker Architecture: PoC vs. TLS Production
|
|
|
|
The messaging system is designed with a clear, decoupled transition pathway from a Proof of Concept (PoC) public broker setup to a secured, authenticated, and encrypted private production cluster. All configurations are resolved dynamically from the environment or overridden at the job level, requiring zero code modifications during deployment cut-over.
|
|
|
|
### 1.1 PoC Architecture (Public Sandbox)
|
|
In the initial development/testing phase, the system defaults to the public broker hosted by HiveMQ:
|
|
* **Host/IP**: `broker.hivemq.com`
|
|
* **Protocol/Port**: Plaintext MQTT over TCP on port `1883`.
|
|
* **Security & Auth**: None. No username, password, TLS encryption, or access control list (ACL) constraints are applied.
|
|
* **QoS Level**: QoS 1 (At Least Once) is requested for publishes and subscriptions, ensuring acknowledgement at the network layer.
|
|
|
|
#### Risks and Limitations of the PoC Setup:
|
|
1. **Zero Eavesdropping Protection**: Because the broker is public and unencrypted, any internet user can subscribe to the root topic (`python/mqtt/jobs/#`) and read the exact prompt, agent sessions, and intermediate progress events.
|
|
2. **Event Spoofing & Injection**: Anyone can publish messages to any job topic. An attacker could publish a malicious `completed` or `error` event, prematurely terminating a running subscriber or causing the delegator to execute unauthorized post-validation hooks.
|
|
3. **No Message Persistence**: Public brokers do not guarantee queue persistence or durable sessions for disconnected clients. If a subscriber briefly drops offline, QoS 1 messages published during the disconnect window may be discarded.
|
|
4. **Rate Limiting & Reliability**: Public sandboxes are subject to arbitrary rate limits, traffic spikes, and transient connection resets, leading to network-level timeouts.
|
|
|
|
---
|
|
|
|
### 1.2 Production Architecture (Secure Private Broker)
|
|
For production deployments, the system is designed to run on a private, self-hosted MQTT 5.0 broker such as **Mosquitto** or **EMQX**.
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Secure Corporate Network"
|
|
Broker["Private MQTT Broker (Mosquitto/EMQX) <br> Ports: 8883 (TLS)"]
|
|
|
|
subgraph "Hermes (Delegator/Orchestrator)"
|
|
SubClient["job_subscriber.py <br> (Role: subscriber)"]
|
|
end
|
|
|
|
subgraph "Tmux Workspace (Agent Host)"
|
|
PubClient["publish_event.py <br> (Role: publisher)"]
|
|
end
|
|
|
|
SubClient -- "Subscribe (QoS 1) <br> Auth: hermes <br> ACL: Read jobs/+/events" --> Broker
|
|
PubClient -- "Publish (QoS 1 + Retain Terminal) <br> Auth: claude-worker <br> ACL: Write jobs/+/events" --> Broker
|
|
end
|
|
```
|
|
|
|
#### Production Security & Hardening Controls:
|
|
1. **Transport Layer Security (TLS v1.3)**: Traffic is encrypted over port `8883` using a private Certification Authority (CA). The orchestrator validates the broker using `MQTT_CA_CERTS` (CA bundle path). Optionally, Mutual TLS (mTLS) is supported via client-side certificate keys (`MQTT_CERTFILE`/`MQTT_KEYFILE`) for cryptographic device identities.
|
|
2. **Strict Client Authentication**: All clients must supply credentials (`MQTT_USERNAME` / `MQTT_PASSWORD`) to establish a connection. Anonymous logins are explicitly disabled (`allow_anonymous false`).
|
|
3. **Role-Based Topic Access Control Lists (ACLs)**:
|
|
* **Orchestrator/Hermes (Subscriber)**: Authenticates as user `hermes` with read-only access to all event streams:
|
|
```conf
|
|
user hermes
|
|
topic read python/mqtt/jobs/+/events
|
|
```
|
|
* **Agent/Worker (Publisher)**: Authenticates as user `claude-worker` with write-only access restricted to the job event sub-topics:
|
|
```conf
|
|
user claude-worker
|
|
topic write python/mqtt/jobs/+/events
|
|
```
|
|
This prevents workers from eavesdropping on sister agents or intercepting commands on other jobs.
|
|
4. **Durable Message Queues & Session State**:
|
|
* The broker is configured with `persistence true` and a dedicated disk storage path.
|
|
* Subscribers connect with persistent session flags to ensure the broker buffers QoS 1 messages during temporary network drops.
|
|
5. **Retained Terminal Events**: Terminal events (`completed`/`error`) are published with the `retain=True` flag. This allows a late-joining or recovering subscriber to instantly retrieve the final job status without waiting for active transmissions.
|
|
|
|
---
|
|
|
|
### 1.3 Production Mosquitto Configuration Reference
|
|
A hardened `/etc/mosquitto/mosquitto.conf` production configuration includes:
|
|
```conf
|
|
# Persistence settings
|
|
persistence true
|
|
persistence_location /var/lib/mosquitto/
|
|
|
|
# Authentication and Authorization
|
|
password_file /etc/mosquitto/auth/passwd
|
|
acl_file /etc/mosquitto/auth/acl
|
|
allow_anonymous false
|
|
|
|
# Listener and TLS Configuration
|
|
listener 8883
|
|
cafile /etc/mosquitto/certs/ca.crt
|
|
certfile /etc/mosquitto/certs/server.crt
|
|
keyfile /etc/mosquitto/certs/server.key
|
|
tls_version tlsv1.3
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Event Protocol Specification
|
|
|
|
The event protocol defines a strict, single-direction JSON wire schema. It acts as the contract between the worker agent (the publisher) and the delegator/orchestrator (the subscriber).
|
|
|
|
### 2.1 Wire Schema (JSON UTF-8, `schema_version = 1`)
|
|
Every event payload must adhere to the following schema structure:
|
|
|
|
```json
|
|
{
|
|
"schema_version": 1,
|
|
"seq": 2,
|
|
"job_id": "918b0612",
|
|
"event": "progress",
|
|
"timestamp": "2026-06-20T14:48:58Z",
|
|
"detail": "Section 1: MQTT Broker Architecture completed",
|
|
"data": {
|
|
"auth_token": "URL-safe-base64-random-token-here",
|
|
"custom_metric": 42
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2.2 Wire Schema Field Dictionary
|
|
|
|
| Field | Type | Required | Description & Validation Rules |
|
|
| :--- | :--- | :--- | :--- |
|
|
| `schema_version` | Integer | **Yes** | Must be exactly `1`. Subscribers discard payloads with mismatched version numbers to prevent parser crashes on schema drift. |
|
|
| `seq` | Integer | **Yes** | Monotonic sequence counter starting at `1` for the first publish. Incremented and stored in the job's registry file (`last_seq`) to survive agent pane crashes. |
|
|
| `job_id` | String | **Yes** | The 8-character hex string identifying the target job. Subscribers discard any messages whose `job_id` is unexpected or unrequested. |
|
|
| `event` | String | **Yes** | The event classification: `started`, `progress`, `permission_required`, `completed`, or `error`. |
|
|
| `timestamp` | String | **Yes** | ISO-8601 UTC timestamp with a trailing `Z` suffix. (Advisory only; never trusted for timeouts). |
|
|
| `detail` | String | **Yes** | Generalized, safe text description. Strict rules prohibit absolute paths, workspace paths, passwords, or raw environment variables. |
|
|
| `data` | Object | **Yes** | Metadata dictionary. Used in production to pass `auth_token` or structured execution metrics. |
|
|
|
|
---
|
|
|
|
### 2.3 Event Type Dictionary and Schemas
|
|
|
|
#### 1. `started`
|
|
* **Emit Trigger**: Emitted by the worker agent immediately upon boot inside the tmux session, indicating it has parsed the instructions and started execution.
|
|
* **Payload Constraints**: `seq` must be `1`. Status in registry is transitioned to `running`.
|
|
* **Example Detail**: `"Job 918b0612 started"`
|
|
|
|
#### 2. `progress`
|
|
* **Emit Trigger**: Optional. Emitted at major check-points, long loops, or sub-task boundaries.
|
|
* **Payload Constraints**: None.
|
|
* **Example Detail**: `"Section 1: MQTT Broker Architecture completed"`
|
|
|
|
#### 3. `permission_required`
|
|
* **Emit Trigger**: Emitted when the agent needs human confirmation (e.g. to run a destructive command or read/write critical system files).
|
|
* **Payload Constraints**: `detail` contains the resource/action requested.
|
|
* **Example Detail**: `"needs write permission to MESSAGING.md"`
|
|
|
|
#### 4. `completed` (Terminal)
|
|
* **Emit Trigger**: Successful job completion. The agent has generated all expected artifacts and verified correctness.
|
|
* **Payload Constraints**: Must be the final event. Published with `retain=True`.
|
|
* **Example Detail**: `"deep report written and committed to git"`
|
|
|
|
#### 5. `error` (Terminal)
|
|
* **Emit Trigger**: Terminal failure. Agent encountered an unhandled exception, syntax error, or validation script fail.
|
|
* **Payload Constraints**: Must be the final event. Published with `retain=True`.
|
|
* **Example Detail**: `"validation fail: missing files"`
|
|
|
|
---
|
|
|
|
### 2.4 Integrity and Authentication Verification (HMAC-SHA256 Signatures)
|
|
To prevent unauthorized users from hijacking or spoofing events on public brokers:
|
|
1. When a job is registered, a cryptographic token (`auth_token`) is generated (`secrets.token_urlsafe(32)`).
|
|
2. The publisher reads this token and signs the JSON payload. Specifically, the publisher calculates an HMAC-SHA256 signature using the `auth_token` as the secret key over the serialized payload (with the `hmac_sig` field excluded).
|
|
3. The signature is attached as `data.hmac_sig` on the wire.
|
|
4. The subscriber (`job_subscriber.py`) reads the expected `auth_token` from the local registry and verifies the HMAC signature. Any message with a missing, invalid, or mismatched signature is discarded immediately with an "HMAC verify failed" log.
|
|
5. To prevent event drops, all publishers and subscribers must be updated simultaneously during deployment rollout, since the plaintext `auth_token` is never transmitted on the wire to prevent token interception.
|
|
|
|
---
|
|
|
|
## 3. Job Lifecycle & State Transitions
|
|
|
|
The lifecycle of a delegated job progresses through a highly coordinated state machine, involving file-based registry claiming, asynchronous message subscription, and multi-faceted event publishing.
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> pending : register_job()
|
|
pending --> running : pick_pending()
|
|
running --> completed : publish_event(--event completed)
|
|
running --> error : publish_event(--event error)
|
|
running --> cancelled : update_status(..., cancelled)
|
|
pending --> cancelled : update_status(..., cancelled)
|
|
completed --> [*]
|
|
error --> [*]
|
|
cancelled --> [*]
|
|
```
|
|
|
|
### 3.1 Step-by-Step Lifecycle Phase Details
|
|
|
|
#### Phase 1: Registration (`register`)
|
|
* **Trigger**: A delegator triggers `registry.py register` (or the `multi-agent-mux-delegate-job submit` command).
|
|
* **Registry State**: Flips from non-existent to `pending` inside `.mam/jobs/<job_id>.json`. A `last_seq` counter is initialized to `0`.
|
|
* **Locking**: Exclusive fcntl file lock acquired over `.lock` during write.
|
|
* **Durable Audit Log**: Writes `<logs>/<job_id>/meta.json`, sets status to `pending` in `status.json`, and appends a `registered` event line to `events.ndjson`.
|
|
|
|
#### Phase 2: Claiming (`pick_pending`)
|
|
* **Trigger**: An agent session starts up and calls `registry.py pick --agent-session <session_label>`.
|
|
* **Registry State**: Oldest matching `pending` record is scanned. Status is atomically updated to `running`. `updated_at` is stamped.
|
|
* **Locking**: Reads and writes occur inside the exclusive fcntl lock block.
|
|
* **Durable Audit Log**: Status is synced to `running` in `status.json` and a `status_changed` event is appended to `events.ndjson`.
|
|
|
|
#### Phase 3: Listening (`subscribe`)
|
|
* **Trigger**: The wrapper command launches `job_subscriber.py --job <job_id>` in the background **before** launching the agent.
|
|
* **Broker Connection**: Connects to the resolved host, issues a QoS 1 subscription to `python/mqtt/jobs/<job_id>/events`, and blocks on an event queue.
|
|
* **Timeout Initialization**: Dual timeouts (wall-clock budget and activity idle timer) are calculated and start ticking.
|
|
|
|
#### Phase 4: Execution & Progress Events (`publish`)
|
|
* **Trigger**: The agent executes prompts within tmux and runs `publish_event.py` at boot and checkpoint stages.
|
|
* **Network Handshake**: Publisher opens a fresh TCP/TLS socket to the broker, awaits CONNACK, publishes a single QoS 1 message, waits for PUBACK, and gracefully disconnects to avoid socket resource leaks.
|
|
* **State Updates**: Updates `last_seq` monotonically, updates `status` to `running` (if not already), and mirrors the published payload into the local audit logs (`events.ndjson`).
|
|
* **Subscriber Capture**: The subscriber captures the payload, performs bearer token checks, prints the formatted line to stdout, and resets its idle timer.
|
|
|
|
#### Phase 5: Terminal Finalization
|
|
* **Trigger**: Agent publishes `--event completed` or `--event error`.
|
|
* **Registry Transition**: State becomes `completed` or `error`.
|
|
* **Retained Messaging**: The terminal event is published with `retain=True` on the broker.
|
|
* **Subscriber Exit**: The subscriber processes the terminal event exactly once, terminates its background loops, and exits (code `0` for completed, `1` for error).
|
|
|
|
---
|
|
|
|
## 4. Code Internals Analysis
|
|
|
|
### 4.1 `registry.py` & `lib.sh` (Locking & Atomicity)
|
|
Two concurrency control schemes co-exist in this workspace to coordinate state modification:
|
|
|
|
1. **`lib.sh::atomic_dump_yaml()`**: Used for workspace-wide tmux session inventory (`agent-sessions.yaml`).
|
|
* **Locking**: Uses SQLite database transaction serialization via `BEGIN IMMEDIATE` on `agent-sessions.db`.
|
|
* **Safe Mutation**: The mutation source code is passed in an environment variable `AGENT_SESSIONS_MUTATION` and executed dynamically using `exec(compile(..., 'exec'), globals())`. This isolates the execution and avoids command-injection vectors.
|
|
* **Atomicity**: Updates the SQLite tables and then, if a session transitions to a finished state, writes to a temp file in the same directory using `tempfile.mkstemp()` and performs an `os.replace()` rename. POSIX guarantees the replacement is atomic, preventing half-written YAML reads. A `.bak` backup copy is also preserved.
|
|
2. **`registry.py::register_job() / pick_pending() / _atomic_write_record()`**: Used for job-level metadata JSON files (`<job_id>.json`).
|
|
* **Locking**: Wraps operations in a `registry_lock(registry_dir)` context manager, implementing an advisory exclusive lock on `.lock` via `fcntl.flock`.
|
|
* **Atomicity**: In `_atomic_write_record()`, it uses `tempfile.mkstemp` inside the parent registry folder, serializes the updated job record to the temp file, flushes it, triggers a physical disk sync via `os.fsync(fh.fileno())`, and executes `os.replace` to replace the main JSON record file. The file permission is restricted to `0o600` immediately.
|
|
|
|
---
|
|
|
|
### 4.2 `publish_event.py` (Retries and Handshakes)
|
|
The publisher script enforces robust error handling when sending status updates:
|
|
* **Fresh Connection Pattern**: Instead of maintaining a persistent socket connection (which is susceptible to socket timeouts or channel leaks), `publish_event.py` opens a fresh socket, completes the authentication/TLS handshake, publishes a single QoS 1 event, waits for `PUBACK`, and closes the connection.
|
|
* **Exponential Backoff**: Wrapped in the `with_retry()` decorator from `mqtt_common.py`. In case of socket errors (`OSError`, `TimeoutError`, `ConnectionError`), it retries up to 3 times (configurable via `--attempts`) with backoff:
|
|
$$\text{delay} = \min(\text{base\_delay} \times \text{factor}^{\text{attempt}-1}, \text{max\_delay})$$
|
|
Default parameters: `base_delay = 0.5s`, `factor = 2.0`, `max_delay = 8.0s`.
|
|
* **ACK Handshake Deadlines**:
|
|
* `CONNECT_ACK_TIMEOUT = 10s` (stops hangs during broker downtime).
|
|
* `PUBLISH_ACK_TIMEOUT = 5s` (guarantees QoS 1 message acknowledgment before marking as published).
|
|
|
|
---
|
|
|
|
### 4.3 `job_subscriber.py` (Timers and Queue Semantics)
|
|
The subscriber acts as the central execution watchdog:
|
|
* **Queue Serialization**: Uses a thread-safe `queue.Queue` internally. The Paho MQTT callback thread adds messages to the queue, and the main thread processes them sequentially. This separates network I/O from state machine validation.
|
|
* **State Machine Protection**: To safeguard against QoS 1 duplicate delivery or out-of-order broker retries, the subscriber runs a terminal state machine. It records job completion in an internal `terminal` dictionary. Once a job is marked `completed` or `error`, any subsequent events for that `job_id` are ignored:
|
|
```python
|
|
if event in TERMINAL_EVENTS:
|
|
if jid in terminal:
|
|
logger.info("ignoring duplicate terminal %s for %s", event, jid)
|
|
continue
|
|
terminal[jid] = event
|
|
pending.discard(jid)
|
|
```
|
|
* **Dual Timeout Semantics**:
|
|
1. **Wall-Clock Timeout**: Calculated relative to absolute startup time (`wall_deadline = start + wall_timeout`). It acts as a hard budget limit, guarding against an agent hanging indefinitely.
|
|
2. **Activity Idle Timeout**: Measured as the difference between the current monotonic time and the last packet arrival time (`idle_left = idle_timeout - (now - last_event)`). If the agent fails to print logs or publish progress updates for the duration of the idle window, the subscriber aborts and exits with code 2.
|
|
|
|
---
|
|
|
|
### 4.4 `mqtt_common.py` (Logging & Config Resolution)
|
|
* **Log Routing isolation**: Configured via `setup_logging()`. The root logger is bound to `sys.stderr`. This preserves the standard output stream (`stdout`) exclusively for clean JSON-lines payloads, enabling downstream bash tools to pipeline event feeds cleanly (e.g., `job_subscriber.py ... | jq`).
|
|
* **Broker Config Resolution**: Configured in `broker_config_from_job()`. Resolves credentials hierarchically:
|
|
1. Defaults to environment configurations (e.g. `MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`, `MQTT_CA_CERTS`).
|
|
2. Overlays credentials specified inside the job record JSON block (`broker.*`). This allows the agent to fetch its dedicated target broker credentials on a per-job basis.
|
|
|
|
---
|
|
|
|
## 5. Cross-System Integration
|
|
|
|
The delegated messaging system functions as a critical control backplane, binding shell wrappers and monitoring loops across the orchestration stack.
|
|
|
|
```mermaid
|
|
graph LR
|
|
User["User/Cron Client"] -->|submit| Wrap["multi-agent-mux-delegate-job (Bash)"]
|
|
Wrap -->|registers| Reg["registry.py (Live Registry)"]
|
|
Wrap -->|spawns background| Sub["job_subscriber.py"]
|
|
Wrap -->|spawns tmux pane| Tmux["tmux Session (Agent Pane)"]
|
|
|
|
Tmux -->|executes agent| Agent["Claude / Codex Agent"]
|
|
Agent -->|publish_event.py| Broker["MQTT Broker"]
|
|
Broker -->|delivers events| Sub
|
|
Broker -->|delivers events| Mon["reconcile.sh (Monitor Loop)"]
|
|
|
|
Mon -->|updates| Inv["agent-sessions.yaml <br> (lib.sh::atomic_dump_yaml)"]
|
|
```
|
|
|
|
### 5.1 Orchestration Wrappers (`multi-agent-mux-*`)
|
|
1. **`multi-agent-mux-delegate-job (submit)`**:
|
|
* Registers a job, spawns `job_subscriber.py` to capture standard output streams to `.mam/jobs/<job_id>.subscriber.out`, and sleeps for `1` second.
|
|
* Boots the agent pane in tmux:
|
|
```bash
|
|
tmux new-session -d -s "$sess" -c "$WORKDIR" \
|
|
"printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; read"
|
|
```
|
|
* Pre-seeds agent instruction headers via stdin to enforce that the agent runs `publish_event.py` for its transitions.
|
|
* Blocks on `wait $sub_pid`, and finally prints the audit log directory.
|
|
2. **`multi-agent-mux-monitor` (`reconcile.sh`)**:
|
|
* **Wildcard Monitor Integration**: Runs a unified background subscriber loop (`reconcile.sh --subscribe`) to capture progress, verify security tokens (HMAC) and sequences, write audit logs, and automatically clean up tmux sessions upon terminal events.
|
|
* **Reconciliation loop**: Subscribes to the global job topic. On terminal events, it invokes `lib.sh::atomic_dump_yaml` to sync status drifts (e.g. setting tmux sessions to `terminated` in `agent-sessions.yaml` once the agent exits).
|
|
3. **`multi-agent-mux-create / stop / resume`**:
|
|
* Integrates the job life status into session metadata updates, ensuring standard tmux cleanup triggers state updates in the registry and audit logs.
|
|
|
|
---
|
|
|
|
## 6. Known Limitations & Recommendations
|
|
|
|
### 6.1 Limitations
|
|
|
|
1. **Single-Host File Locking Vulnerability**:
|
|
The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.mam/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
|
|
2. **Bearer Token Leakage over Plaintext (Public Broker)**:
|
|
The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events.
|
|
3. **Subscriber Network Drop Orphanage**:
|
|
`job_subscriber.py` does not implement automatic reconnection loops. If the subscriber loses connection to the broker, it exits, leaving the running tmux agent orphaned and without a validation/collection hook.
|
|
4. **Lack of Ordering Guarantees in QoS 1**:
|
|
QoS 1 guarantees delivery but not strict ordering. Under heavy backoff retries, a late-delivered progress event could land after a terminal event, causing state inconsistencies.
|
|
|
|
---
|
|
|
|
### 6.2 Recommendations
|
|
|
|
1. **[Implemented] Migrate to SQLite WAL Backend**:
|
|
The `agent-sessions.yaml` locking mechanism in `lib.sh` has been upgraded to use a SQLite database (`agent-sessions.db`) configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (`stopped`, `terminated`, `archived`), eliminating `flock` contention during active session updates.
|
|
**Architecture Decision Note**: This means `agent-sessions.yaml` is **no longer a real-time view** of currently `running` sessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe live `running` states.
|
|
2. **Implement Signature-Based Payload Verification**:
|
|
Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. `HMAC-SHA256(secret_key, payload_bytes)`). The subscriber validates the signature, preventing token interception.
|
|
3. **Enforce Mandatory Broker-Side TLS and ACLs**:
|
|
De-prioritize plaintext support. Enforce connection over port `8883` with verified TLS certificates. Implement client certificates (mTLS) for agent authentication.
|
|
4. **Build Auto-Reconnecting Subscriber Loops**:
|
|
Upgrade `job_subscriber.py` to handle disconnect callbacks. Maintain a persistent queue in memory and allow the client to reconnect with exponential backoff, preventing socket dropout from terminating the orchestration flow.
|
|
|
|
---
|
|
|
|
## Glossary: Session States vs Job States
|
|
|
|
This project manages **two distinct state domains** that are often confused:
|
|
|
|
### Session States (YAML — `.mam/agent-sessions.yaml`)
|
|
Managed by `.agents/skills/lib.sh` and the 6 `multi-agent-mux-*` skills.
|
|
Valid values (see `lib.sh` valid-status set):
|
|
|
|
| State | Meaning | Set by |
|
|
|---|---|---|
|
|
| `running` | tmux session active, agent running | `create`, `resume` |
|
|
| `stopped` | deliberately stopped via `--capture-id`/`--reason`/`--graceful`; conversation preserved for resume | `stop` (STOP mode) |
|
|
| `terminated` | hard-killed via `--mode hard`; tmux session destroyed | `stop` (hard mode), `monitor` reconcile |
|
|
| `archived` | soft-stopped via `--mode soft`; tmux left alive, YAML-only update | `stop` (soft mode) |
|
|
|
|
### Job States (Registry — `.mam/jobs/<id>.json`)
|
|
Managed by `.agents/skills/multi-agent-mux-delegate-job/scripts/registry.py`.
|
|
Valid values:
|
|
|
|
| State | Meaning | Set by |
|
|
|---|---|---|
|
|
| `pending` | job registered, agent not yet started | `registry.py register` |
|
|
| `running` | agent picked up the job, publishing events | `publish_event.py --event started` |
|
|
| `completed` | terminal event — agent finished successfully | `publish_event.py --event completed` |
|
|
| `error` | terminal event — agent failed | `publish_event.py --event error` |
|
|
| `cancelled` | job cancelled by orchestrator | `registry.py cancel` |
|
|
|
|
**Key distinction**: Session states track the **tmux container lifecycle** (create→stop→resume).
|
|
Job states track the **delegated work lifecycle** (submit→run→complete/error).
|
|
A single session can host multiple sequential jobs; a job runs within exactly one session.
|