multi-agent-mux/README.md

# tmux-agent-orchestration

An advanced, high-reliability **Multi-Agent Orchestration & Messaging Backplane** framework built on Tmux and MQTT. It is designed to coordinate, isolate, and audit long-running agent tasks (such as code generation, refactoring, and security reviews) across multiple LLM backend clients (e.g., Claude, Hermes).

---

## 🚀 Overview

Modern agentic workflows often suffer from session timeout, lack of process isolation, terminal viewport truncation (scrollback limits), and complex concurrency issues.

**tmux-agent-orchestration** addresses these problems by providing:
1. **Tmux-based Process Isolation:** Spawning LLM client sessions inside dedicated, isolated tmux environments to support persistent background runs.
2. **Asynchronous Event-Driven Architecture:** Leveraging an MQTT broker as a message backplane to coordinate state transitions (`started`, `progress`, `completed`, `error`) between collaborating agents.
3. **Multi-Agent Mux (MAM):** Combining local file-based locks (fcntl) and an ACID-compliant SQLite WAL database (`.mam/agent-sessions.db`) to manage concurrent job claims and track running agent sessions without drift.
4. **Automated Review & Quality Loop:** Implementing parallel reviewer loops where worker agents must receive a `PASS` rating from various specialized verification agents (e.g., Claude for high-level logic, Hermes for shell syntax/safety) before merging code.

---

## 🛠️ Core Skills & Scaffolding

All orchestration functionalities are structured under the `.agents/skills/` directory:

*   **`multi-agent-mux-create`**: Spawns isolated tmux sessions running specified agent CLI wrappers. It captures system processes, updates metadata registries, and enforces authentication checks.
*   **`multi-agent-mux-stop`**: Gracefully terminates agent CLI sessions (using key macros like `/exit` or `Exit`) and handles disk purge operations (removing conversation JSON files and SQLite logs for deleted workspaces).
*   **`multi-agent-mux-resume`**: Restores stopped sessions by resolving workspace UUIDs from disk or cache, and invokes the underlying agent using session-resume parameters (e.g., `claude -r <uuid>` or `hermes --resume <uuid>`).
*   **`multi-agent-mux-status`**: Queries the running states of all active sessions, detecting PID mismatches, command signatures, and drifts between actual tmux instances and the registry database.
*   **`multi-agent-mux-monitor`**: A long-running Kanban reconcile worker that dynamically monitors tmux sessions and synchronizes states to `.mam/agent-sessions.yaml`.
*   **`multi-agent-mux-delegate-job`**: The core asynchronous task distribution module containing:
    *   `registry.py`: Atomically registers and claims jobs using file advisory locks (`fcntl`).
    *   `job_subscriber.py`: Connects to the MQTT backplane, captures live events, and appends them to audit trails.
    *   `publish_event.py`: Emits execution status transitions and error details from workspace scripts.
    *   `mqtt_common.py`: Manages connection policies, authentication, and HMAC signing.

---

## 📐 Big-Picture Architecture

The system coordinates LLM agents across multiple workspaces through two core layers:

1. **Layer A — Tmux Orchestration (lib.sh + status/resume/stop/create)**: Runs the agents (one tmux session per agent-workspace combination) and maintains an authoritative registry in `.mam/agent-sessions.yaml` (+ `.mam/agent-sessions.db`).
2. **Layer B — Async Job Delegation (delegate-job)**: Dispatches a task to an agent and observes progress and completion via an event channel.

These two layers share one lock-guarded chokepoint for file I/O: `lib.sh::atomic_dump_yaml`. Every write is protected by an exclusive SQLite database transaction lock and schema validation.

### Data Flow Overview

```text
  +-----------+   register_job    +-------------------+
  | delegator | ---------------> | .mam/jobs/<id>.json|  <-- live record
  +-----------+                   +---------+---------+
                                                |
                                                | atomic rename + fsync
                                                v
                                       +-----------------+
                                       |   audit log     |  <-- append-only
                                       | .mam/delegate_  |      events.ndjson
                                       |  job_logs/<id>/ |
                                       +--------+--------+
                                                ^
                                                | (best-effort mirrors)
                                                |
  +-----------+   publish_event    +-----+-----+      +---------+
  |  agent    | ---------------> | MQTT broker | <--- | monitor |
  | (claude)  |                   +-------------+      +----+----+
  +-----------+                                            |
       ^                                                  v
       | subscriber                                   atomic_dump_yaml
       | (job_subscriber.py)                          (.mam/agent-sessions.yaml)
       |                                                  ^
       +-------- delegator waits here ----------+        |
                                                  +---+---+
                                                  | reconcil|
                                                  |  e.sh  |
                                                  +--------+
```

### 🔒 Tmux Server Isolation

To prevent workspace tmux processes from interfering with each other or with system tmux servers, the framework enforces isolated tmux environments:
*   **Per-Workspace Shim:** `_init_tmux_isolation` and `_resolve_real_tmux_path` instantiate a per-workspace shim directory under `/tmp/multi-agent-tmux-shim/<TMUX_SERVER_NAME>/tmux` that intercepts tmux commands and wraps them in `tmux -L <server>`.
*   **PATH Rewriting:** The `PATH` environment variable is dynamically prepended with the shim path in all child processes. This ensures any `tmux` invocation within the agent's process tree is restricted to its isolated socket server.
*   **Environment Restoration:** If `TMUX_SERVER_NAME` is set to `default`, the PATH override is removed, reverting to the default global tmux server.

### 🛡️ Concurrency Design & Write Serialization

The framework implements lock-guarded execution pathways to prevent race conditions during parallel agent operations:
*   **SQLite Database Locks (`BEGIN IMMEDIATE`):** Every mutation of `agent-sessions.yaml` and the SQLite registry runs through `atomic_dump_yaml` inside `lib.sh`, which serializes writes via an exclusive `BEGIN IMMEDIATE` transaction lock on the SQLite database `.mam/agent-sessions.db`.
*   **Dual-Interpreter Strategy:** To minimize dependency bloat and guarantee stability, the backplane splits execution environments: the virtual environment `.venv` handles MQTT communication and async jobs (requiring `paho-mqtt`), while the system `python3` executes `atomic_dump_yaml` (relying on system-wide `PyYAML`).
*   **NFS and Network FS Safeguards:** Since file locking (`flock`) and SQLite WAL behave unreliably over network protocols (NFS, CIFS, SSHFS), `lib.sh` performs filesystem detection. If a network mount is identified, it outputs a safety warning and SQLite automatically switches its journaling mode from `WAL` to `DELETE`.

---

## 📐 Architecture & Coordination Loop

The interaction between roles (Project Manager, Worker, and Reviewer) is structured as a strict iterative loop:

```mermaid
sequenceDiagram
    autonumber
    actor User as User
    participant PM as Project Manager
    participant W as Worker
    participant R as Reviewers
    participant M as MQTT Backplane

    User->>PM: Hand over requirements
    Note over PM: Plan tasks & register jobs
    PM->>M: Register Job & start Subscriber
    PM->>W: Delegate task (Provide Job ID & Brief)
    W->>M: Publish 'started' event
    Note over W: Implement & verify code
    W->>M: Publish 'completed' (or 'error')
    PM->>R: Request parallel reviews (Provide Diff)
    Note over R: Parallel analysis (Claude, Hermes)
    alt Review Feedback (NOT PASS)
        R->>PM: NOT PASS (Feedback with code blocks)
        Note over PM: Apply fixes or re-delegate
        PM->>W: Re-delegate with comments
    else Verification PASS
        R->>PM: PASS
    end
    PM->>User: Commit changes & Report completion
```

---

## 🔒 Security & Replay Attack Defense

To ensure communication integrity across public MQTT brokers, the backplane integrates an **HMAC-SHA256 signature protocol**:
*   **PoC Mode (Unauthenticated):** Default mode where `auth_token` is `null`, skipping cryptographic validations for quick setups.
*   **Production Mode (Authenticated):** A unique cryptographic token is issued per job. Event payloads must include an `hmac_sig` computed with the token.
*   **Replay Attack Mitigation:** Each event carries a monotonically increasing integer sequence counter (`seq`). The subscriber (`job_subscriber.py`) drops any payload whose sequence number is not strictly greater than the highest sequence number it has already accepted for that job. Combined with the HMAC signature on the payload body, this rejects both re-injected and out-of-order packets without relying on clock synchronization. The wire-format timestamp field is advisory metadata only; the backplane does not enforce a clock-skew window.

---

## 📁 Repository Layout

```text
.
├── .agents/
│   └── skills/                  # Core orchestration shell wrappers & libraries
│       ├── lib.sh               # Shared orchestration library
│       ├── multi-agent-mux-create/
│       ├── multi-agent-mux-stop/
│       ├── multi-agent-mux-resume/
│       ├── multi-agent-mux-status/
│       ├── multi-agent-mux-monitor/
│       └── multi-agent-mux-delegate-job/
│           ├── requirements.txt # Python dependency declaration
│           └── scripts/         # Core backplane implementation (Python)
├── .mam/                        # Multi-Agent Mux metadata (git-ignored)
│   ├── agent-sessions.db        # SQLite WAL session database
│   ├── agent-sessions.yaml      # Human-readable session registry
│   └── jobs/                    # Asynchronous job metadata files
├── scripts/
│   └── generate-env.sh          # Environment bootstrap helper
├── AGENT.md                     # Agent roles, snapshottings, and execution charter
├── BOOTSTRAP.md                 # Detailed installation and verification guide
├── MESSAGING.md                 # MQTT wire protocol specification
└── README.md                    # Project introduction and overview (this file)
```

---

## 🚦 Quick Start

For detailed setup instructions, please consult the **[BOOTSTRAP.md](./BOOTSTRAP.md)** file. Below is a quick summary:

1.  **Initialize Environment Config:**
    ```bash
    ./scripts/generate-env.sh
    ```
2.  **Create Virtual Environment and Install Dependencies:**
    ```bash
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
    ```
3.  **Run Registry Diagnostics:**
    ```bash
    .venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
    ```

---

## 📝 Guidelines for Collaborating Agents

If you are an AI agent newly onboarded to this project:
1.  Read **[AGENT.md](./AGENT.md)** to align on development constraints and roles (PM, Worker, Reviewer).
2.  Adhere to the **Pane Snapshotting Rules** in `AGENT.md` (Section 4) to prevent scrollback data loss during long execution steps.
3.  Never modify core logic without submitting a diff to the reviewer sessions for evaluation.