Godopu 387b43d8e3 fix(deploy): stage installer download and copy runtime assets no-clobber (FW-D1)
deploy/install.sh extracted the repo archive in-place with
`tar --strip-components=1`, which inside an existing project could silently
overwrite the target's own README.md/FUTURE_WORKS.md/etc and litter it with
this repo's dev docs.

Rebuild the fetch path:
- stage the clone/extract into a `mktemp -d` dir, never in-place
- verify `.agents/skills/lib.sh` is present before copying anything
- copy only runtime assets (.agents/, AGENT.md, .env.example) into the target
  with per-file no-clobber guards (`[ ! -e ]`), so existing files always win
- post-fetch sanity check now tests a file, not just the directory
- fail fast when neither git nor curl is available

Use explicit `[ ! -e ]` guards + a POSIX find merge rather than `cp -n`
(non-portable; emits a deprecation warning on GNU coreutils 9.x). The earlier
`tar --exclude` denylist idea was rejected in review: non-portable and the
unanchored `--exclude="scripts"` pattern stripped the skills' own nested
scripts/ dirs, yielding a silently broken install.

Mark FW-D1 resolved and FW-D2 partially addressed in FUTURE_WORKS.md/.ko.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:33:05 +09:00

tmux-agent-orchestration

An advanced, high-reliability Multi-Agent Orchestration & Messaging Backplane framework built on Tmux and MQTT. It is designed to coordinate, isolate, and audit long-running agent tasks (such as code generation, refactoring, and security reviews) across multiple LLM backend clients (e.g., Claude, Hermes).


🚀 Overview

Modern agentic workflows often suffer from session timeout, lack of process isolation, terminal viewport truncation (scrollback limits), and complex concurrency issues.

tmux-agent-orchestration addresses these problems by providing:

  1. Tmux-based Process Isolation: Spawning LLM client sessions inside dedicated, isolated tmux environments to support persistent background runs.
  2. Asynchronous Event-Driven Architecture: Leveraging an MQTT broker as a message backplane to coordinate state transitions (started, progress, completed, error) between collaborating agents.
  3. Multi-Agent Mux (MAM): Combining local file-based locks (fcntl) and an ACID-compliant SQLite WAL database (.mam/agent-sessions.db) to manage concurrent job claims and track running agent sessions without drift.
  4. Automated Review & Quality Loop: Implementing parallel reviewer loops where worker agents must receive a PASS rating from various specialized verification agents (e.g., Claude for high-level logic, Hermes for shell syntax/safety) before merging code.

🛠️ Core Skills & Scaffolding

All orchestration functionalities are structured under the .agents/skills/ directory:

  • multi-agent-mux-create: Spawns isolated tmux sessions running specified agent CLI wrappers. It captures system processes, updates metadata registries, and enforces authentication checks.
  • multi-agent-mux-stop: Gracefully terminates agent CLI sessions (using key macros like /exit or Exit) and handles disk purge operations (removing conversation JSON files and SQLite logs for deleted workspaces).
  • multi-agent-mux-resume: Restores stopped sessions by resolving workspace UUIDs from disk or cache, and invokes the underlying agent using session-resume parameters (e.g., claude -r <uuid> or hermes --resume <uuid>).
  • multi-agent-mux-status: Queries the running states of all active sessions, detecting PID mismatches, command signatures, and drifts between actual tmux instances and the registry database.
  • multi-agent-mux-monitor: A long-running Kanban reconcile worker that dynamically monitors tmux sessions and synchronizes states to .mam/agent-sessions.yaml.
  • multi-agent-mux-delegate-job: The core asynchronous task distribution module containing:
    • registry.py: Atomically registers and claims jobs using file advisory locks (fcntl).
    • job_subscriber.py: Connects to the MQTT backplane, captures live events, and appends them to audit trails.
    • publish_event.py: Emits execution status transitions and error details from workspace scripts.
    • mqtt_common.py: Manages connection policies, authentication, and HMAC signing.

📐 Big-Picture Architecture

The system coordinates LLM agents across multiple workspaces through two core layers:

  1. Layer A — Tmux Orchestration (lib.sh + status/resume/stop/create): Runs the agents (one tmux session per agent-workspace combination) and maintains an authoritative registry in .mam/agent-sessions.yaml (+ .mam/agent-sessions.db).
  2. Layer B — Async Job Delegation (delegate-job): Dispatches a task to an agent and observes progress and completion via an event channel.

These two layers share one lock-guarded chokepoint for file I/O: lib.sh::atomic_dump_yaml. Every write is protected by an exclusive SQLite database transaction lock and schema validation.

Data Flow Overview

  +-----------+   register_job    +-------------------+
  | delegator | ---------------> | .mam/jobs/<id>.json|  <-- live record
  +-----------+                   +---------+---------+
                                                |
                                                | atomic rename + fsync
                                                v
                                       +-----------------+
                                       |   audit log     |  <-- append-only
                                       | .mam/delegate_  |      events.ndjson
                                       |  job_logs/<id>/ |
                                       +--------+--------+
                                                ^
                                                | (best-effort mirrors)
                                                |
  +-----------+   publish_event    +-----+-----+      +---------+
  |  agent    | ---------------> | MQTT broker | <--- | monitor |
  | (claude)  |                   +-------------+      +----+----+
  +-----------+                                            |
       ^                                                  v
       | subscriber                                   atomic_dump_yaml
       | (job_subscriber.py)                          (.mam/agent-sessions.yaml)
       |                                                  ^
       +-------- delegator waits here ----------+        |
                                                  +---+---+
                                                  | reconcil|
                                                  |  e.sh  |
                                                  +--------+

🔒 Tmux Server Isolation

To prevent workspace tmux processes from interfering with each other or with system tmux servers, the framework enforces isolated tmux environments:

  • Per-Workspace Shim: _init_tmux_isolation and _resolve_real_tmux_path instantiate a per-workspace shim directory under /tmp/multi-agent-tmux-shim/<TMUX_SERVER_NAME>/tmux that intercepts tmux commands and wraps them in tmux -L <server>.
  • PATH Rewriting: The PATH environment variable is dynamically prepended with the shim path in all child processes. This ensures any tmux invocation within the agent's process tree is restricted to its isolated socket server.
  • Environment Restoration: If TMUX_SERVER_NAME is set to default, the PATH override is removed, reverting to the default global tmux server.

🛡️ Concurrency Design & Write Serialization

The framework implements lock-guarded execution pathways to prevent race conditions during parallel agent operations:

  • SQLite Database Locks (BEGIN IMMEDIATE): Every mutation of agent-sessions.yaml and the SQLite registry runs through atomic_dump_yaml inside lib.sh, which serializes writes via an exclusive BEGIN IMMEDIATE transaction lock on the SQLite database .mam/agent-sessions.db.
  • Dual-Interpreter Strategy: To minimize dependency bloat and guarantee stability, the backplane splits execution environments: the virtual environment .venv handles MQTT communication and async jobs (requiring paho-mqtt), while the system python3 executes atomic_dump_yaml (relying on system-wide PyYAML).
  • NFS and Network FS Safeguards: Since file locking (flock) and SQLite WAL behave unreliably over network protocols (NFS, CIFS, SSHFS), lib.sh performs filesystem detection. If a network mount is identified, it outputs a safety warning and SQLite automatically switches its journaling mode from WAL to DELETE.

📐 Architecture & Coordination Loop

The interaction between roles (Project Manager, Worker, and Reviewer) is structured as a strict iterative loop:

sequenceDiagram
    autonumber
    actor User as User
    participant PM as Project Manager
    participant W as Worker
    participant R as Reviewers
    participant M as MQTT Backplane

    User->>PM: Hand over requirements
    Note over PM: Plan tasks & register jobs
    PM->>M: Register Job & start Subscriber
    PM->>W: Delegate task (Provide Job ID & Brief)
    W->>M: Publish 'started' event
    Note over W: Implement & verify code
    W->>M: Publish 'completed' (or 'error')
    PM->>R: Request parallel reviews (Provide Diff)
    Note over R: Parallel analysis (Claude, Hermes)
    alt Review Feedback (NOT PASS)
        R->>PM: NOT PASS (Feedback with code blocks)
        Note over PM: Apply fixes or re-delegate
        PM->>W: Re-delegate with comments
    else Verification PASS
        R->>PM: PASS
    end
    PM->>User: Commit changes & Report completion

🔒 Security & Replay Attack Defense

To ensure communication integrity across public MQTT brokers, the backplane integrates an HMAC-SHA256 signature protocol:

  • PoC Mode (Unauthenticated): Default mode where auth_token is null, skipping cryptographic validations for quick setups.
  • Production Mode (Authenticated): A unique cryptographic token is issued per job. Event payloads must include an hmac_sig computed with the token.
  • Replay Attack Mitigation: Each event carries a monotonically increasing integer sequence counter (seq). The subscriber (job_subscriber.py) drops any payload whose sequence number is not strictly greater than the highest sequence number it has already accepted for that job. Combined with the HMAC signature on the payload body, this rejects both re-injected and out-of-order packets without relying on clock synchronization. The wire-format timestamp field is advisory metadata only; the backplane does not enforce a clock-skew window.

📁 Repository Layout

.
├── .agents/
│   └── skills/                  # Core orchestration shell wrappers & libraries
│       ├── lib.sh               # Shared orchestration library
│       ├── multi-agent-mux-create/
│       ├── multi-agent-mux-stop/
│       ├── multi-agent-mux-resume/
│       ├── multi-agent-mux-status/
│       ├── multi-agent-mux-monitor/
│       └── multi-agent-mux-delegate-job/
│           ├── requirements.txt # Python dependency declaration
│           └── scripts/         # Core backplane implementation (Python)
├── .mam/                        # Multi-Agent Mux metadata (git-ignored)
│   ├── agent-sessions.db        # SQLite WAL session database
│   ├── agent-sessions.yaml      # Human-readable session registry
│   └── jobs/                    # Asynchronous job metadata files
├── scripts/
│   └── generate-env.sh          # Environment bootstrap helper
├── AGENT.md                     # Agent roles, snapshottings, and execution charter
├── BOOTSTRAP.md                 # Detailed installation and verification guide
├── MESSAGING.md                 # MQTT wire protocol specification
└── README.md                    # Project introduction and overview (this file)

🚦 Quick Start

For detailed setup instructions, please consult the BOOTSTRAP.md file. Below is a quick summary:

  1. Initialize Environment Config:
    ./scripts/generate-env.sh
    
  2. Create Virtual Environment and Install Dependencies:
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
    
  3. Run Registry Diagnostics:
    .venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
    

📝 Guidelines for Collaborating Agents

If you are an AI agent newly onboarded to this project:

  1. Read AGENT.md to align on development constraints and roles (PM, Worker, Reviewer).
  2. Adhere to the Pane Snapshotting Rules in AGENT.md (Section 4) to prevent scrollback data loss during long execution steps.
  3. Never modify core logic without submitting a diff to the reviewer sessions for evaluation.
S
Description
No description provided
Readme 669 KiB
Languages
Shell 69.9%
Python 30.1%