Compare commits

...

30 Commits

Author SHA1 Message Date
Godopu 6e3c866461 docs: clean up stale create_session usage instructions in comments and markdown examples 2026-06-28 10:31:58 +09:00
Godopu 7c8267240d feat: enforce required agent roles at creation and role immutability in registry 2026-06-28 10:27:36 +09:00
Godopu f457180777 refactor: adapt multi-agent-mux skills and agent guidelines for the Team Leader scenario 2026-06-28 10:21:24 +09:00
Godopu 81474ac3f7 docs: add Step 0 provisioning to BOOTSTRAP.md and update README.md with curl installer 2026-06-28 09:34:52 +09:00
Godopu dd9500a271 feat(multi-agent-mux): integrate cline agent support, fix sqlite3 naming collision, simplify delegation docs, and add SKILL_FEATURES.md 2026-06-28 09:17:11 +09:00
Godopu dfd0a9483d feat: implement loop and discuss task delegation types in multi-agent-mux-delegate-job 2026-06-27 08:28:47 +09:00
Godopu 3b8db1eca2 fix(skills): add 0.5s sleep delay after paste-buffer to prevent key collisions 2026-06-26 23:00:30 +09:00
Godopu 698ea09b27 docs: update AGENT.md references to .agents/AGENT.md 2026-06-26 21:33:26 +09:00
Godopu 57d8f6c2ff refactor: move AGENT.md and AGENT.ko.md to .agents/ directory 2026-06-26 21:28:41 +09:00
Godopu e14ee90243 fix(skills): point HOME_DIR to real home directory and fix Hermes database path 2026-06-26 21:17:43 +09:00
Godopu b47fcbda9b fix(deploy): resolve TARGET_DIR to absolute path in update.sh 2026-06-24 12:29:00 +09:00
Godopu 5da6e59d2f fix(deploy): fix update.sh fallback mode, trap rollback, and database names 2026-06-24 12:28:17 +09:00
Godopu 701d3f10d9 Add MAM updater script (update.sh) and integrate into installer copy block 2026-06-24 12:24:17 +09:00
Godopu 5d69ad4f0b Harden remove.sh: fix fallback data-loss risk, prevent remove.sh clobbering, and ensure macOS compatibility 2026-06-24 12:08:00 +09:00
Godopu db75b7deb0 Add MAM uninstaller script (remove.sh) and integrate into installer copy block 2026-06-24 11:49:45 +09:00
Godopu b37407874d Harden installer: partial-install detection, complete runtime docs, explicit copy checks 2026-06-24 10:43:08 +09:00
Godopu 387b43d8e3 fix(deploy): stage installer download and copy runtime assets no-clobber (FW-D1)
deploy/install.sh extracted the repo archive in-place with
`tar --strip-components=1`, which inside an existing project could silently
overwrite the target's own README.md/FUTURE_WORKS.md/etc and litter it with
this repo's dev docs.

Rebuild the fetch path:
- stage the clone/extract into a `mktemp -d` dir, never in-place
- verify `.agents/skills/lib.sh` is present before copying anything
- copy only runtime assets (.agents/, AGENT.md, .env.example) into the target
  with per-file no-clobber guards (`[ ! -e ]`), so existing files always win
- post-fetch sanity check now tests a file, not just the directory
- fail fast when neither git nor curl is available

Use explicit `[ ! -e ]` guards + a POSIX find merge rather than `cp -n`
(non-portable; emits a deprecation warning on GNU coreutils 9.x). The earlier
`tar --exclude` denylist idea was rejected in review: non-portable and the
unanchored `--exclude="scripts"` pattern stripped the skills' own nested
scripts/ dirs, yielding a silently broken install.

Mark FW-D1 resolved and FW-D2 partially addressed in FUTURE_WORKS.md/.ko.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:33:05 +09:00
Godopu 7eaaaf8944 fix(lib,install): update locking doc to SQLite transaction, cache NFS check, verify PyYAML 2026-06-23 23:41:18 +09:00
Godopu 25cf729040 fix(deploy): add directory creation guard and sanity check after download in installer 2026-06-23 23:22:45 +09:00
Godopu 1d2eca57ce fix(deploy): automatically download .agents/skills from Gitea if missing in installer 2026-06-23 23:16:00 +09:00
Godopu 82dcb78a85 fix(deploy): refine install.sh variables, pip upgrade, permissions, and active defaults based on review 2026-06-23 08:32:02 +09:00
Godopu a10c789dc2 refactor(wrapper): update dry-run message to align with existing session delegation 2026-06-23 08:26:13 +09:00
Godopu 6cd0d5333a feat(deploy): add Gitea deployment templates, installer, and CI/CD workflow 2026-06-23 08:22:41 +09:00
Godopu 99ac8b3ce4 refactor(security,concurrency): resolve structural issues, enforce Claude permission skip, update docs 2026-06-23 08:03:43 +09:00
Godopu 12dceb14b2 feat(monitor): consolidate per-job watchdogs into shared wildcard subscriber (FW-W3) 2026-06-23 00:37:39 +09:00
Godopu 31f18b2e5a docs: update FUTURE_WORKS.md and FUTURE_WORKS.ko.md with portability and workflow bottleneck roadmap 2026-06-22 16:28:31 +09:00
Godopu c721d1cd86 refactor: rename skills from tmux-agent-orchestrate-* to multi-agent-mux-* in backplane scripts and documents 2026-06-22 15:58:48 +09:00
Godopu ee48d77d0a docs: add README.md and README.ko.md introducing the orchestration skills and backplane architecture 2026-06-22 14:19:15 +09:00
Godopu 9735258bc5 refactor: rename metadata directory .hermes to .mam in backplane scripts and documents 2026-06-22 14:06:13 +09:00
Godopu 30e447189e refactor: migrate skills/ directory to .agents/skills/ 2026-06-21 14:42:12 +00:00
49 changed files with 2931 additions and 1175 deletions
+55 -40
View File
@@ -10,21 +10,28 @@
역할군 간의 책임 및 권한을 명확히 분리하여 병목을 줄이고 작업의 완성도를 높입니다.
### 👤 Project Manager (PM / Orchestrator)
- **주요 책무**: 사용자 요구사항 접수, 상세 작업 계획 수립, 작업자 할당/지시, 전체 워크플로우 통제 및 최종 결과 보고.
### 👑 General Manager (총괄 매니저)
- **주요 책무**: 사용자와 직접 소통하여 요구사항 접수, 상세 작업 계획 수립, 팀장 에이전트 할당 및 작업 위임, 전체 워크플로우 통제 및 최종 완료 보고.
- **모호성 제거**: 사용자의 요구사항에 모호한 부분이 있다면 작업을 추측하여 진행하지 말고, 즉시 사용자에게 질문하여 명확히 해야 합니다 (`/grill-me` 슬래시 명령어 권장).
- **피드백 루프 조정**: Reviewer들의 검증 의견을 분석하여 개선 방향을 의사결정합니다. 결정하기 까다로운 기술적 난제는 Worker 및 Reviewer들의 조사를 거쳐 PM 본인의 판단을 더한 최종 보고서를 작성해 사용자에게 제시하고 프로젝트의 방향을 결정합니다.
- **자가 치유 (Hermes Fallback Fix)**: Reviewer가 지적한 결함이 아주 경미하거나 단순 오탈자/설정 누락인 경우, Worker에게 재할당하지 않고 PM이 직접 소스코드를 수정하여 전체 왕복(Round-trip) 비용을 최소화합니다.
### 🛠️ Worker (Implementation Agent)
- **주요 책무**: PM으로부터 위임받은 구체적인 비즈니스 로직 설계 및 소스코드 구현.
- **협업 및 소통**: 할당받은 업무 범위에서 구현 방향이 모호하거나 인터페이스 설계 변경이 필요한 경우 PM에게 질문하여 합의를 이룬 후 수술적(Surgical) 변경을 적용합니다.
- **계약 준수**: PM이 전달한 단일 작업 지침(Brief) 및 고유 Job ID 규약을 준수하며, 작업 시작 시 `started`, 종료 시 `completed`/`error` 이벤트를 백플레인에 발행해야 합니다.
### 👥 Team Leaders (팀장)
새롭게 생성되는 에이전트(`antigravity`, `claude`, `cline`, `hermes` 등)는 각 팀의 **팀장** 역할을 수행합니다. 총괄 매니저로부터 작업을 위임받아 개발 또는 리뷰 워크플로우를 주도합니다.
- **Developer Team Leader (개발 팀장)**:
- 총괄 매니저로부터 작업을 위임받습니다.
- **작업 분석 및 계획**: 주어진 작업을 철저히 분석하고, 작은 단위로 문제를 나누어 세부 계획을 수립합니다.
- **내부 병렬 처리**: 내부적으로 subagent를 활용해 위임받은 작업을 병렬적으로 처리할 수 있습니다.
- **리뷰 타당성 검증 및 거부**: 리뷰어가 지적한 피드백을 면밀히 검토합니다. 타당한 제안은 수렴하여 코드를 수정하지만, 타당하지 않다고 판단되는 안건은 반영하지 않고 **그 명확한 이유를 작성하여 리뷰어에게 되돌려 보냅니다**.
- **완료 신호 송신**: 모든 리뷰어들로부터 `PASS`를 획득하고 변경 사항이 검증되면, 최초 작업을 위임받았던 개발 팀장이 총괄 매니저에게 최종 작업 완료 신호를 송신합니다.
- **Reviewer Team Leader (리뷰어 팀장)**:
- 개발 팀장으로부터 리뷰 요청을 접수합니다.
- **문제 제시에 대한 이유와 개선 방향 포함**: 단순한 반려(`NOT PASS`) 통보는 금지됩니다. 이슈를 제기할 때는 **반드시 해당 문제가 발생하는 구체적인 이유와 확실한 개선 방향(코드 대안 포함)을 함께 작성**해야 합니다.
- **합의 루프**: 모든 지적 사항이 해결되고 최종 `PASS`를 발행할 때까지 리뷰 루프에 동참합니다.
### 🔍 Reviewer (Verification Agent)
- **주요 책무**: Worker가 제출한 소스코드 변경 사항(Diff)과 구현 명세를 검증하고, 보안 결함 탐지, 성능 개선안 도출 및 설계 일관성을 심사하는 조력자.
- **구체적 대안 제시**: 단순한 반려(`NOT PASS`) 통보를 금지하며, 문제를 제기할 때는 **안정적이고 검증된 구체적인 코드 대안(Alternative Code)이나 해결 방안을 반드시 함께 제시**해야 합니다.
- **교차 검증의 상호보완성**: 에이전트의 모델 특성(예: Flash 계열은 의미론적 셸 결함 포착에 강하고, Opus/Sonnet 계열은 API 서명 및 논리 회귀 분석에 강함)을 살려 병렬로 상호보완적 심사를 수행합니다.
### 🛡️ 역할 범위 준수 원칙 (Role Suitability Check)
- 모든 에이전트는 자신에게 부여된 역할에 부합하는 작업만을 수행해야 합니다. (예: 개발 팀장은 최종 PASS 여부를 결정하지 않으며, 리뷰어 팀장은 직접 프로젝트 소스코드를 작성하지 않습니다.)
- **자신의 역할에 맞지 않는 작업이 지시된 경우**, 에이전트는 반드시:
1. 해당 작업을 수행하기에 가장 적합한 에이전트 세션을 추천하여 위임을 유도하거나,
2. 프로젝트 연속성을 위해 극히 필요한 경우 직접 작업을 수행합니다.
---
@@ -43,8 +50,8 @@
### 🗃️ 레지스트리 및 상태 관리
- 본 아키텍처는 목적에 따라 두 가지 레지스트리를 분리하여 운영합니다:
- **잡 레지스트리 (Job Registry)**: 각 비동기 잡의 메타데이터와 생명주기는 개별 JSON 파일(`.hermes/jobs/<id>.json`)로 기록되며, 다중 세션 간의 동시 청구(claiming) 경합은 파일 단위의 `fcntl` advisory lock(`registry_lock` via `registry.py`)을 통해 방어합니다.
- **세션 레지스트리 (Session Registry)**: TMUX 모니터링 상태 및 에이전트 구동 정보는 SQLite WAL 데이터베이스(`.hermes/agent-sessions.db`)를 통해 단일 호스트 내에서 안정적인 동시 트랜잭션으로 일관되게 제어합니다. 단, SQLite WAL 모드는 NFS(네트워크 파일 시스템) 환경에서는 완전한 파일 락이 보장되지 않으므로 로컬 파일 시스템 사용을 권장합니다.
- **잡 레지스트리 (Job Registry)**: 각 비동기 잡의 메타데이터와 생명주기는 개별 JSON 파일(`.mam/jobs/<id>.json`)로 기록되며, 다중 세션 간의 동시 청구(claiming) 경합은 파일 단위의 `fcntl` advisory lock(`registry_lock` via `registry.py`)을 통해 방어합니다.
- **세션 레지스트리 (Session Registry)**: TMUX 모니터링 상태 및 에이전트 구동 정보는 SQLite WAL 데이터베이스(`.mam/agent-sessions.db`)를 통해 단일 호스트 내에서 안정적인 동시 트랜잭션으로 일관되게 제어합니다. 단, SQLite WAL 모드는 NFS(네트워크 파일 시스템) 환경에서는 완전한 파일 락이 보장되지 않으므로 로컬 파일 시스템 사용을 권장합니다.
### 🛡️ 보안 프로토콜 (HMAC-SHA256)
- **무인증 PoC 모드**: 잡 레지스트리 생성 시 `auth_token``null`로 지정된 경우(PoC 기본 모드), 별도의 서명 검증을 생략하고 모든 이벤트를 수용합니다 (`verify_hmac`이 항상 `True`를 반환).
@@ -59,34 +66,42 @@
sequenceDiagram
autonumber
actor User as 사용자
participant PM as Project Manager
participant W as Worker
participant R as Reviewers
participant GM as General Manager
participant DTL as Developer Team Leader
participant RTL as Reviewer Team Leaders
participant M as MQTT Backplane
User->>PM: 요구사항 전달
Note over PM: grill-me 및 계획 수립
PM->>M: Job 등록 및 Subscriber 구동
PM->>W: 작업 위임 (Job ID & Brief 전달)
W->>M: 'started' 이벤트 발행
Note over W: 코드 변경 및 구현
W->>M: 'completed' (혹은 'error') 발행
PM->>R: 병렬 리뷰 요청 (Diff 전달)
Note over R: 교차 분석 & 검증
alt 결함 발견
R->>PM: NOT PASS (대안 포함 피드백)
Note over PM: 경미한 결함은 PM이 직접 수정
PM->>W: 피드백 반영 및 재할당
else 검증 통과
R->>PM: PASS
User->>GM: 요구사항 전달
GM->>DTL: 작업 위임 (예: 랜딩 페이지 제작)
Note over DTL: 작업 분석, 세분화 및 subagent 병렬 구동
DTL->>M: 'started' 이벤트 발행
Note over DTL: 코드 변경 및 구현
DTL->>M: 'completed' 발행
DTL->>RTL: 리뷰 요청 (랜딩 페이지를 제작했습니다. 리뷰를 진행해주세요)
Note over RTL: 교차 분석 & 검증
alt 결함 발견 (리뷰어 피드백)
RTL->>DTL: NOT PASS / 피드백 (반드시 이유와 확실한 개선 방향 포함)
Note over DTL: DTL이 피드백의 타당성 검증
alt 타당한 피드백
Note over DTL: DTL이 수용하여 코드 수정
else 타당하지 않은 피드백
DTL->>RTL: 반론 및 거부 이유 전달 (부적절한 항목 미반영)
end
PM->>User: 최종 검증 통과 보고 & 커밋
DTL->>RTL: 재리뷰 요청 (리뷰 안건 수정 완료)
else 검증 통과
RTL->>DTL: PASS
end
DTL->>GM: 최종 완료 신호 송신
GM->>User: 사용자에게 작업 완료 통보
```
1. **계획 수립 및 할당**: PM은 사용자 요청을 구체화하고 의존성이 겹치지 않는 범위에서 잡을 정의합니다.
2. **작업 개시 및 통보**: PM은 구독자를 띄운 뒤 Worker 세션에 잡을 인가하며, Worker는 로직을 수행하고 단말 이벤트를 전송해 세션을 자동 종료합니다.
3. **교차 검수 반복 (Review Loop)**: PM은 작업 완료 후 변경분을 Reviewer 에이전트들에게 병렬 회람시킵니다. 리뷰어 전원이 `PASS` 의견을 낼 때까지 수정-반려 주기를 무한 반복(Loop)하여 코드 완성도를 보증합니다.
4. **릴리즈 및 정리**: 검증이 완료된 코드는 Git에 커밋하고, 임시 세션 리소스를 회수합니다.
1. **계획 수립 및 할당**: 총괄 매니저는 개발 팀장에게 작업을 인가합니다.
2. **분석 및 내부 실행**: 개발 팀장은 작업을 분석하고 세분화하여 계획을 세운 뒤 내부 subagent를 가동하여 구현을 완료합니다. 이후 `started`를 거쳐 `completed` 이벤트를 발행하고 리뷰어에게 검수를 요청합니다.
3. **이의 제기 및 정제 루프**:
- 리뷰어 팀장은 상세 피드백 시 반드시 이유와 보완 방향을 제시해야 합니다.
- 개발 팀장은 의견을 검토해 타당하면 수정하고, 타당하지 않으면 반론과 근거를 회신합니다.
- 리뷰어 전원이 `PASS`를 인가할 때까지 이 과정이 반복됩니다.
4. **최종 보고**: 개발 팀장이 총괄 매니저에게 완료 신호를 보내면 총괄 매니저가 사용자에게 완료를 알립니다.
---
@@ -116,11 +131,11 @@ TMUX 환경에서 실행되는 에이전트가 화면 스크롤 한계로 인해
- [ ] **가상환경 의존성**: `pyyaml`, `paho-mqtt` 등 필요한 Python 패키지가 `.venv` 또는 `requirements.txt`에 포함되었는가?
- [ ] **환경 설정 파일**: MQTT 브로커 주소 및 보안 Credential이 `.env` 파일에 안전하게 로드되고 공유되는가?
- [ ] **디렉토리 규약**: 레지스트리 경로(`.hermes/jobs/`) 및 로깅 경로(`.hermes/delegate_job_logs/`)가 `.gitignore`에 등록되었는가?
- [ ] **디렉토리 규약**: 레지스트리 경로(`.mam/jobs/`) 및 로깅 경로(`.mam/delegate_job_logs/`)가 `.gitignore`에 등록되었는가?
- [ ] **스크립트 구비**: `mqtt_common.py`, `publish_event.py`, `job_subscriber.py`, `registry.py` 등의 핵심 모듈이 배치되었는가?
- [ ] **HMAC 활성화**: 새로운 레지스트리 잡 발급 시 난수 기반의 `auth_token`이 정상적으로 주입되고, 서명 기반의 상호 인증이 활성화되는가?
- [ ] **운영 헌장 배치**: 본 규약 파일(`AGENT.md`)이 새 프로젝트의 **최상위 루트(Root) 디렉터리**에 배치되었는가? (협업을 수행하는 에이전트들이 온보딩 시 규칙을 가장 먼저 인지할 수 있도록 루트 경로 배치가 필수적입니다.)
- [ ] **운영 헌장 배치**: 본 규약 파일(`AGENT.md`)이 새 프로젝트의 **.agents/ 디렉터리**에 배치되었는가? (프로젝트 루트를 깔끔하게 유지하면서도 온보딩하는 에이전트들이 규칙을 이해할 수 있도록 `.agents/` 경로 배치가 권장됩니다.)
---
*본 가이드는 협업 효율성과 코드 보안의 엄격한 균형을 유지하기 위한 규범입니다. 변경 사항이 필요한 경우 PM 및 Reviewer의 전원 합의를 거쳐 본 문서를 업데이트해야 합니다.*
*본 가이드는 협업 효율성과 코드 보안의 엄격한 균형을 유지하기 위한 규범입니다. 변경 사항이 필요한 경우 총괄 매니저 및 전체 팀장의 합의를 거쳐 본 문서를 업데이트해야 합니다.*
+55 -40
View File
@@ -10,21 +10,28 @@ All agents working on a new project must read this document thoroughly and compl
We clearly separate responsibilities and permissions between roles to reduce bottlenecks and enhance the quality of execution.
### 👤 Project Manager (PM / Orchestrator)
- **Core Responsibility**: Receive user requirements, establish detailed task plans, assign and instruct workers, control the overall workflow, and report final results.
### 👑 General Manager (Orchestrator)
- **Core Responsibility**: Interact directly with the user, receive high-level requirements, establish task plans, delegate tasks to Team Leaders, control the overall workflow, and report completion back to the user.
- **Ambiguity Resolution**: If a user's requirements contain ambiguous details, do not guess. Immediately ask the user for clarification (we recommend using the `/grill-me` slash command).
- **Feedback Loop Adjustment**: Analyze verification feedback from Reviewers to decide on improvement paths. For complex technical challenges, direct Workers and Reviewers to research options, add the PM's own assessment, and present a final report to the user to decide the project's direction.
- **Self-Healing (Hermes Fallback Fix)**: If a defect pointed out by a Reviewer is extremely minor or is a simple typo/configuration omission, the PM should directly fix the source code instead of reassigning it to the Worker, thereby minimizing the round-trip cost.
### 🛠️ Worker (Implementation Agent)
- **Core Responsibility**: Design business logic and implement source code as delegated by the PM.
- **Collaboration & Communication**: If the implementation path is ambiguous or interface design changes are required within the assigned scope, ask the PM for consensus before applying surgical changes.
- **Contract Adherence**: Comply with the single task instructions (Brief) and the unique Job ID convention provided by the PM. Workers must publish a `started` event when starting work, and a `completed` or `error` event to the backplane upon termination.
### 👥 Team Leaders (팀장)
Newly spawned agents (e.g., `antigravity`, `claude`, `cline`, `hermes`) act as **Team Leaders** of their respective groups. They receive delegated tasks from the General Manager and manage implementation or review workflows.
- **Developer Team Leader (개발 팀장)**:
- Receives tasks from the General Manager.
- **Task Breakdown & Planning**: Thoroughly analyzes the task, breaks it down into small units, and creates a plan.
- **Internal Parallelism**: Can run subagents in parallel internally to handle the delegated work.
- **Review Integrity & Refusal**: Thoroughly reviews feedback from Reviewers. Adopts/implements recommendations if valid. If any recommendation is judged invalid, the Developer Team Leader must **not** implement it, but instead return the refutation along with detailed reasons to the Reviewer.
- **Completion Signal**: Once all reviewers yield a `PASS` and changes are verified, the Developer Team Leader who first received the task sends a completion signal back to the General Manager.
- **Reviewer Team Leader (리뷰어 팀장)**:
- Receives review requests from the Developer Team Leader.
- **Detailed Feedback with Directions**: Simply rejecting changes (`NOT PASS`) is forbidden. Reviewers **must** specify the exact reason for the issue and provide a concrete, stable, and verified alternative direction for improvement.
- **Consensus Loop**: Engages in the review cycle until all objections are resolved and a final `PASS` is issued.
### 🔍 Reviewer (Verification Agent)
- **Core Responsibility**: Verify source code changes (Diff) and implementation specifications submitted by Workers. Reviewers act as facilitators by detecting security vulnerabilities, proposing performance improvements, and examining design consistency.
- **Provide Concrete Alternatives**: Simply rejecting changes (`NOT PASS`) is forbidden. When raising an issue, Reviewers must propose a **concrete, stable, and verified alternative code block or solution**.
- **Complementary Cross-Verification**: Leverage the unique characteristics of different agent models (e.g., Flash-class models are skilled at capturing semantic shell bugs, while Opus/Sonnet-class models excel at API signatures and logical regression analysis) to perform parallel and mutually-supportive reviews.
### 🛡️ Role Suitability Check Principle (자신의 역할 범위 수행 원칙)
- Every agent must only perform tasks suitable for its designated role (e.g., Developer Team Leaders do not issue final reviews, and Reviewer Team Leaders do not write project code).
- **If an agent receives a task that does not fit its role**, it must either:
1. Recommend the optimal agent session to delegate the task to, or
2. Perform the task directly if strictly necessary for project continuity.
---
@@ -43,8 +50,8 @@ Asynchronous communication and state management between agents are controlled vi
### 🗃️ Registry & State Management
- This architecture maintains two distinct registries based on their purpose:
- **Job Registry**: The metadata and lifecycle of each asynchronous job are recorded in individual JSON files (`.hermes/jobs/<id>.json`). Concurrency conflicts (claiming races) across multiple sessions are prevented via file-based `fcntl` advisory locks (`registry_lock` via `registry.py`).
- **Session Registry**: TMUX monitoring states and running agent metadata are consistently controlled using a SQLite WAL database (`.hermes/agent-sessions.db`) to support reliable concurrent transactions on a single host. However, since SQLite WAL mode does not guarantee complete file locking in Network File System (NFS) environments, we recommend using a local file system.
- **Job Registry**: The metadata and lifecycle of each asynchronous job are recorded in individual JSON files (`.mam/jobs/<id>.json`). Concurrency conflicts (claiming races) across multiple sessions are prevented via file-based `fcntl` advisory locks (`registry_lock` via `registry.py`).
- **Session Registry**: TMUX monitoring states and running agent metadata are consistently controlled using a SQLite WAL database (`.mam/agent-sessions.db`) to support reliable concurrent transactions on a single host. However, since SQLite WAL mode does not guarantee complete file locking in Network File System (NFS) environments, we recommend using a local file system.
### 🛡️ Security Protocol (HMAC-SHA256)
- **Unauthenticated PoC Mode**: If the `auth_token` in the job registry is set to `null` (the default PoC mode), signature verification is skipped and all events are accepted (`verify_hmac` always returns `True`).
@@ -59,34 +66,42 @@ Asynchronous communication and state management between agents are controlled vi
sequenceDiagram
autonumber
actor User as User
participant PM as Project Manager
participant W as Worker
participant R as Reviewers
participant GM as General Manager
participant DTL as Developer Team Leader
participant RTL as Reviewer Team Leaders
participant M as MQTT Backplane
User->>PM: Hand over requirements
Note over PM: Run grill-me & plan tasks
PM->>M: Register Job & start Subscriber
PM->>W: Delegate task (Provide Job ID & Brief)
W->>M: Publish 'started' event
Note over W: Modify code & implement
W->>M: Publish 'completed' (or 'error')
PM->>R: Request parallel review (Provide Diff)
Note over R: Cross-analysis & verification
alt Defect Found
R->>PM: NOT PASS (Feedback with alternatives)
Note over PM: PM directly fixes minor defects
PM->>W: Apply feedback & re-delegate
else Verification Pass
R->>PM: PASS
User->>GM: Hand over requirements
GM->>DTL: Delegate task (e.g., create landing page)
Note over DTL: Analyze, breakdown & spawn parallel subagents
DTL->>M: Publish 'started' event
Note over DTL: Modify code & implement
DTL->>M: Publish 'completed'
DTL->>RTL: Request review (I created landing page. Please review it)
Note over RTL: Cross-analysis & verification
alt Defect Found (Reviewer feedback)
RTL->>DTL: NOT PASS / Feedback (Must include reason & improvement direction)
Note over DTL: DTL checks validity of suggestions
alt Valid feedback
Note over DTL: DTL adopts and modifies code
else Invalid feedback
DTL->>RTL: Send refutation & reasons (Did not reflect inappropriate parts)
end
PM->>User: Report final pass & commit changes
DTL->>RTL: Request review again (Modified review items)
else Verification Pass
RTL->>DTL: PASS
end
DTL->>GM: Send completion signal
GM->>User: Notify task completion
```
1. **Planning and Allocation**: The PM defines requirements and outlines independent jobs to avoid conflicting dependencies.
2. **Execution and Notification**: The PM launches a subscriber, then assigns the job to a Worker session. The Worker performs the logic and sends a terminal event, automatically closing the session.
3. **Cross-Verification Iteration (Review Loop)**: Once the task is complete, the PM circulates the changes to the Reviewer agents in parallel. The modify-reject cycle repeats until all reviewers yield a `PASS`, ensuring high-quality code.
4. **Release and Cleanup**: Code that passes verification is committed to Git, and temporary session resources are reclaimed.
1. **Planning and Allocation**: The General Manager delegates the task to the Developer Team Leader.
2. **Analysis and Internal Execution**: The Developer Team Leader analyzes the task, breaks it down, plans execution, and optionally spawns parallel subagents. It publishes `started`, completes the task, and requests review from the Reviewer Team Leader.
3. **Objection & Refinement Loop**:
- The Reviewer Team Leader must provide clear reasons and improvement directions for any issues.
- The Developer Team Leader validates the feedback. Valid suggestions are implemented; invalid ones are refuted with reasons and returned to the reviewer.
- This cycle repeats until all reviewers issue a `PASS`.
4. **Completion and Report**: The Developer Team Leader sends the final completion signal to the General Manager, who notifies the user.
---
@@ -116,11 +131,11 @@ Use this checklist when deploying this agent orchestration model to a new projec
- [ ] **Virtualenv Dependencies**: Are required Python packages like `pyyaml` and `paho-mqtt` included in `.venv` or `requirements.txt`?
- [ ] **Configuration File**: Are the MQTT broker address and security credentials safely loaded and shared via the `.env` file?
- [ ] **Directory Convention**: Are the registry path (`.hermes/jobs/`) and logging path (`.hermes/delegate_job_logs/`) added to `.gitignore`?
- [ ] **Directory Convention**: Are the registry path (`.mam/jobs/`) and logging path (`.mam/delegate_job_logs/`) added to `.gitignore`?
- [ ] **Core Scripts**: Are the core scripts (`mqtt_common.py`, `publish_event.py`, `job_subscriber.py`, and `registry.py`) in place?
- [ ] **HMAC Enablement**: When a new registry job is created, is a random `auth_token` correctly injected, and is signature-based mutual authentication active?
- [ ] **Charter Placement**: Is this protocol file (`AGENT.md`) placed in the **top-level root directory** of the new project? (Placing it at the root is essential so that onboarding agents can recognize the rules immediately.)
- [ ] **Charter Placement**: Is this protocol file (`AGENT.md`) placed in the **.agents/ directory** of the new project? (Placing it in `.agents/` is essential to keep the project root clean while allowing onboarding agents to align on the rules.)
---
*This guide balances collaboration efficiency with strict code security. Any required changes must be discussed and agreed upon by the PM and all Reviewers before updating this document.*
*This guide balances collaboration efficiency with strict code security. Any required changes must be discussed and agreed upon by the General Manager and all Team Leaders before updating this document.*
+139 -49
View File
@@ -1,10 +1,10 @@
#!/usr/bin/env bash
# lib.sh — shared library for the tmux-agent-orchestrate-* skills.
# lib.sh — shared library for the multi-agent-mux-* skills.
#
# Single source of truth for the four things that were inconsistently
# re-implemented across create/resume/delete/monitor (REVIEW.md §4.1):
# - derive_session_name : the tmux session slug (P0-A)
# - atomic_dump_yaml : flock + temp+rename + .bak + validate (P0-B)
# - atomic_dump_yaml : SQLite db transaction + temp+rename + .bak + validate (P0-B)
# - env_python : env-safe Python (no heredoc injection) (P0-B / P1-B)
# - find_workspace_uuid : workspace-SCOPED resume id lookup (P0-C)
#
@@ -15,11 +15,11 @@
# atomic_dump_yaml. Never `open(yaml_path, 'w')` anywhere else.
SKILL_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
WORKSPACE_ROOT="$(cd "$SKILL_DIR/.." && pwd)"
AGENT_SESSIONS_YAML="${AGENT_SESSIONS_YAML:-$WORKSPACE_ROOT/.hermes/agent-sessions.yaml}"
WORKSPACE_ROOT="$(cd "$SKILL_DIR/../.." && pwd)"
AGENT_SESSIONS_YAML="${AGENT_SESSIONS_YAML:-$WORKSPACE_ROOT/.mam/agent-sessions.yaml}"
# Workspace-relative defaults with environment overrides (Phase Z)
HOME_DIR="${HOME_DIR:-$WORKSPACE_ROOT}"
HOME_DIR="${HOME_DIR:-$HOME}"
CLAUDE_PROJECT_DIR="${CLAUDE_PROJECT_DIR:-$HOME/.claude/projects}"
LOCAL_BIN="${LOCAL_BIN:-$HOME/.local/bin}"
@@ -28,7 +28,7 @@ LOCAL_BIN="${LOCAL_BIN:-$HOME/.local/bin}"
# ---------------------------------------------------------------------------
# Paths to exclude when resolving the real tmux binary (shim/wrapper dirs).
_TMUX_SHIM_DIR_PATTERN="${_TMUX_SHIM_DIR_PATTERN:-/multi-agent-tmux-shim/}"
_TMUX_SKILLS_BIN_PATTERN="${_TMUX_SKILLS_BIN_PATTERN:-/skills/.bin}"
_TMUX_SKILLS_BIN_PATTERN="${_TMUX_SKILLS_BIN_PATTERN:-/.agents/skills/.bin}"
TMUX_SERVER_NAME="${TMUX_SERVER_NAME:-default}"
@@ -173,9 +173,16 @@ derive_session_name() {
local workspace="$1" agent="$2"
local abs parent work slug
abs="$(cd "$workspace" 2>/dev/null && pwd)" || abs="$workspace"
parent="$(basename "$(dirname "$abs")")"
work="$(basename "$abs")"
parent="$(basename "$(dirname "$abs")" 2>/dev/null || echo "")"
work="$(basename "$abs" 2>/dev/null || echo "root")"
if [ -z "$parent" ] || [ "$parent" = "/" ] || [ "$parent" = "." ]; then
parent="workspace"
fi
if [ -z "$work" ] || [ "$work" = "/" ] || [ "$work" = "." ]; then
work="root"
fi
slug="$(printf '%s-%s' "$parent" "$work" | tr '[:upper:]' '[:lower:]' | tr '_' '-')"
slug="$(printf '%s' "$slug" | tr -cd 'a-zA-Z0-9-')"
printf '%s-creator-%s' "$slug" "$agent"
}
@@ -189,13 +196,35 @@ derive_session_name() {
# inside the script — never spliced into the source. Read-only by convention;
# use atomic_dump_yaml when you need to write the YAML.
# ---------------------------------------------------------------------------
_validate_env_key() {
local key="$1"
if [[ ! "$key" =~ ^[a-zA-Z_][a-zA-Z0-9_]*$ ]]; then
echo "ERROR: Invalid environment variable name: $key" >&2
return 1
fi
case "$key" in
LD_PRELOAD|LD_LIBRARY_PATH|PYTHONPATH|PYTHONHOME|PYTHONINSPECT|PYTHONSTARTUP)
echo "ERROR: Blocked environment variable: $key" >&2
return 1
;;
esac
return 0
}
env_python() {
local yaml_path="$1"; shift
local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN")
while [ $# -gt 0 ]; do
case "$1" in
*=*) envs+=("$1"); shift ;;
*) break ;;
*=*)
local key="${1%%=*}"
_validate_env_key "$key" || return 1
envs+=("$1")
shift
;;
*)
break
;;
esac
done
env "${envs[@]}" python3 - "$@"
@@ -205,36 +234,54 @@ env_python() {
# atomic_dump_yaml <yaml_path> [KEY=VALUE ...] (mutation source from stdin)
#
# The ONLY sanctioned way to write agent-sessions.yaml. It:
# 1. takes an exclusive flock on <yaml_path>.lock (serialises all writers)
# 2. loads the YAML into `d`
# 1. takes an exclusive SQLite BEGIN IMMEDIATE transaction lock on
# agent-sessions.db (serialises all writers)
# 2. loads the current state into `d` (seeds from YAML if DB is empty)
# 3. exec()s the caller's mutation source (sees d, yaml, os, datetime,
# timezone, glob, subprocess; reads values via os.environ). The mutation
# may print and may `raise SystemExit(n)` to abort *without* writing.
# 4. validates the resulting schema
# 5. backs up to <yaml_path>.bak, then writes atomically (temp + os.replace)
# 5. backs up to <yaml_path>.bak, then writes YAML atomically (temp + os.replace)
# when a session transitions to a finished state.
#
# The mutation source is passed via env and exec()'d — it is never string
# spliced and untrusted data never lands in Python source (P0-B / P1-B).
# ---------------------------------------------------------------------------
# Check if the workspace is on NFS — flock is unreliable on NFS
_atomic_dump_yaml_check_nfs() {
# Check if the workspace is on NFS — locking behaves differently on NFS
_check_is_nfs() {
local f="$1"
local mountpoint
mountpoint="$(df --output=target "$f" 2>/dev/null | tail -1)" || return 0
mountpoint="$(df --output=target "$f" 2>/dev/null | tail -1)" || return 1
if mount | grep -q "$mountpoint.*nfs\|$mountpoint.*cifs\|$mountpoint.*fuse.sshfs"; then
echo "WARNING: $mountpoint appears to be a network filesystem (NFS/CIFS/SSHFS)." >&2
echo "WARNING: fcntl.flock-based atomic writes are unreliable on network filesystems." >&2
echo "WARNING: SQLite journal_mode automatically falls back to DELETE." >&2
return 0 # is NFS
fi
return 1 # not NFS
}
atomic_dump_yaml() {
local yaml_path="$1"; shift
local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN")
if [ -z "${MAM_IS_NFS:-}" ]; then
if _check_is_nfs "$(dirname "$yaml_path")"; then
export MAM_IS_NFS="true"
echo "WARNING: $(dirname "$yaml_path") appears to be a network filesystem (NFS/CIFS/SSHFS)." >&2
echo "WARNING: SQLite journal_mode automatically falls back to DELETE." >&2
else
export MAM_IS_NFS="false"
fi
fi
local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN" "MAM_IS_NFS=$MAM_IS_NFS")
while [ $# -gt 0 ]; do
case "$1" in
*=*) envs+=("$1"); shift ;;
*) break ;;
*=*)
local key="${1%%=*}"
_validate_env_key "$key" || return 1
envs+=("$1")
shift
;;
*)
break
;;
esac
done
local mutation; mutation="$(cat)"
@@ -258,6 +305,8 @@ def _validate(d):
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] not a mapping")
if not s.get('name') or not s.get('status'):
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] missing name/status")
if s.get('role') is not None and (not isinstance(s['role'], str) or not s['role'].strip()):
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] {s.get('name')!r} role must be a non-empty string")
if s['status'] not in valid:
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] {s.get('name')!r} bad status {s['status']!r}")
if not isinstance(s.get('pane'), dict):
@@ -276,19 +325,8 @@ for f in [db_path, db_path + '-wal', db_path + '-shm']:
except Exception:
pass
def is_nfs(path):
try:
df_out = subprocess.check_output(['df', '--output=target', path], text=True, stderr=subprocess.DEVNULL)
target = df_out.strip().split('\n')[-1].strip()
mount_out = subprocess.check_output(['mount'], text=True)
for line in mount_out.split('\n'):
if f" on {target} " in line and (' type nfs ' in line or ' type cifs ' in line or ' fuse.sshfs ' in line):
return True
except Exception:
pass
return False
if is_nfs(os.path.dirname(db_path) or '.'):
is_nfs = os.environ.get('MAM_IS_NFS') == 'true'
if is_nfs:
conn.execute('PRAGMA journal_mode=DELETE')
else:
conn.execute('PRAGMA journal_mode=WAL')
@@ -330,10 +368,17 @@ try:
d['tmux_sessions'] = []
old_terminals = get_terminal_set(d)
old_roles = {s.get('name'): s.get('role') for s in db_sessions if s.get('role')}
# --- caller mutation (module scope: sees d, yaml, os, glob, subprocess) ---
exec(compile(os.environ['AGENT_SESSIONS_MUTATION'], '<mutation>', 'exec'), globals())
# Role immutability check
for s in d.get('tmux_sessions', []):
name = s.get('name')
if name in old_roles and s.get('role') != old_roles[name]:
raise SystemExit(f"VALIDATE: role of session {name!r} cannot be modified from {old_roles[name]!r} to {s.get('role')!r}")
_validate(d)
# Separate globals and sessions for normalization
@@ -451,6 +496,10 @@ def hermes_exists(uuid):
return False
def cline_exists(uuid):
return os.path.exists(f"{home}/.cline/data/sessions/{uuid}/{uuid}.json")
def emit(u):
print(u)
raise SystemExit(0)
@@ -500,6 +549,10 @@ for s in sessions:
cand = s.get('hermes_conversation_id_own')
if cand and hermes_exists(cand):
emit(cand)
if agent == 'cline' and name.endswith('-creator-cline'):
cand = s.get('cline_conversation_id_own')
if cand and cline_exists(cand):
emit(cand)
# 2) disk scan scoped to THIS workspace
if agent == 'claude':
@@ -542,6 +595,27 @@ elif agent == 'hermes':
cand = None
if cand:
emit(cand)
elif agent == 'cline':
sessions_dir = f"{home}/.cline/data/sessions"
if os.path.isdir(sessions_dir):
candidates = []
for session_folder in glob.glob(f"{sessions_dir}/*"):
if os.path.isdir(session_folder):
folder_name = os.path.basename(session_folder)
json_file = f"{session_folder}/{folder_name}.json"
if os.path.exists(json_file):
candidates.append(json_file)
candidates.sort(key=os.path.getmtime, reverse=True)
for j in candidates:
try:
with open(j) as f:
sdata = json.load(f)
if sdata.get('cwd') == ws or sdata.get('workspace_root') == ws:
sid = sdata.get('session_id')
if sid:
emit(sid)
except Exception:
pass
# 3) agent_identities cache, ONLY when its project_cwd == this workspace
ai = {}
@@ -573,6 +647,10 @@ if ai_agent.get('project_cwd') == ws:
cand = ai_agent.get('session_id') or ai.get('conversation_id')
if cand and hermes_exists(cand):
emit(cand)
elif agent == 'cline':
cand = ai_agent.get('session_id') or ai.get('conversation_id')
if cand and cline_exists(cand):
emit(cand)
print('')
PYEOF
@@ -646,13 +724,13 @@ PYEOF
}
# ---------------------------------------------------------------------------
# tmux-agent-orchestrate-delegate-job integration helpers
# multi-agent-mux-delegate-job integration helpers
#
# All paths are resolved relative to lib.sh's own location (BASH_SOURCE), so the
# skill tree is relocatable — no hardcoded absolute paths (review item 6).
# ---------------------------------------------------------------------------
# _delegate_py_bin — echo the virtualenv python (walk up from skills/), else python3.
# _delegate_py_bin — echo the virtualenv python (walk up from .agents/skills/), else python3.
_delegate_py_bin() {
# Return cached result if available (shell variable, not exported — avoids cross-workspace pollution)
if [ -n "${AGENT_PYTHON_BIN:-}" ] && [ -x "$AGENT_PYTHON_BIN" ]; then
@@ -671,26 +749,26 @@ _delegate_py_bin() {
printf '%s\n' "$AGENT_PYTHON_BIN"
}
# _delegate_script <name> — echo the path to a tmux-agent-orchestrate-delegate-job script, resolved
# relative to skills/ (lib.sh dir). Empty if not found.
# _delegate_script <name> — echo the path to a multi-agent-mux-delegate-job script, resolved
# relative to .agents/skills/ (lib.sh dir). Empty if not found.
_delegate_script() {
local name="$1" skill_dir cand
skill_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cand="$skill_dir/tmux-agent-orchestrate-delegate-job/scripts/$name"
cand="$skill_dir/multi-agent-mux-delegate-job/scripts/$name"
if [ -f "$cand" ]; then printf '%s\n' "$cand"; return 0; fi
printf '%s\n' "$(find "$skill_dir" -name "$name" 2>/dev/null | head -n 1 || true)"
}
# delegate_submit_job <prompt> <agent> <agent_session>
#
# Register a job in the tmux-agent-orchestrate-delegate-job registry. Prints the new JID on stdout.
# Register a job in the multi-agent-mux-delegate-job registry. Prints the new JID on stdout.
delegate_submit_job() {
local prompt="$1" agent="$2" session="$3"
local py_bin registry_py
py_bin="$(_delegate_py_bin)"
registry_py="$(_delegate_script registry.py)"
if [ -z "$registry_py" ] || [ ! -f "$registry_py" ]; then
echo "ERROR: tmux-agent-orchestrate-delegate-job registry.py not found under skills/" >&2
echo "ERROR: multi-agent-mux-delegate-job registry.py not found under .agents/skills/" >&2
return 1
fi
"$py_bin" "$registry_py" register \
@@ -701,7 +779,7 @@ delegate_submit_job() {
# delegate_publish_event <job_id> <event> [detail]
#
# Publish a lifecycle event to the tmux-agent-orchestrate-delegate-job registry. Consolidates the
# Publish a lifecycle event to the multi-agent-mux-delegate-job registry. Consolidates the
# inline .venv-walk + publish_event.py blocks that were duplicated across
# create/delete/resume (review item 7). Non-fatal by contract: an empty job id,
# a missing script, or a broker failure never aborts the caller.
@@ -723,16 +801,28 @@ delegate_publish_event() {
start_watchdog() {
local job_id="$1"
local workdir="${2:-$PWD}"
local watchdog_script="$workdir/skills/tmux-agent-orchestrate-monitor/scripts/watchdog.sh"
local log_file="$workdir/.hermes/jobs/${job_id}.watchdog.log"
local monitor_script="$workdir/.agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh"
local log_file="$workdir/.mam/multi-agent-mux-monitor.log"
if [ ! -x "$watchdog_script" ]; then
echo "ERROR: watchdog not found or not executable: $watchdog_script" >&2
if [ ! -f "$monitor_script" ]; then
echo "ERROR: monitor script not found: $monitor_script" >&2
return 1
fi
nohup "$watchdog_script" "$job_id" "$workdir" > "$log_file" 2>&1 &
local pid=$!
# Check if reconcile.sh --subscribe is already running on this workspace
local pid
pid=$(pgrep -f "bash $monitor_script --subscribe" || true)
if [ -z "$pid" ]; then
# Start the wildcard monitor subscriber daemon with --idle-timeout 0 (never idle out)
# and ensure it runs with $workdir as cwd to anchor relative log paths.
local orig_pwd="$PWD"
cd "$workdir"
nohup bash "$monitor_script" --subscribe --idle-timeout 0 >> "$log_file" 2>&1 &
pid=$!
cd "$orig_pwd"
fi
echo "$pid"
}
@@ -1,6 +1,6 @@
---
name: tmux-agent-orchestrate-create
description: "Create a new agent session (claude, antigravity/agy) in a dedicated tmux session for context-preserving long-running work. Always creates a tmux session — never backgrounds with nohup/disown. Writes the new session to .hermes/agent-sessions.yaml. Use when you want to start a fresh agent (no prior UUID) for a new project workspace."
name: multi-agent-mux-create
description: "Create a new agent session (claude, antigravity/agy) in a dedicated tmux session for context-preserving long-running work. Always creates a tmux session — never backgrounds with nohup/disown. Writes the new session to .mam/agent-sessions.yaml. Use when you want to start a fresh agent (no prior UUID) for a new project workspace."
version: 1.0.0
author: godopu
license: MIT
@@ -9,18 +9,18 @@ environments: [terminal, tmux]
metadata:
hermes:
tags: [agent, tmux, claude, antigravity, agy, multi-agent, context, session]
related_skills: [tmux-agent-orchestrate-resume, tmux-agent-orchestrate-stop, tmux-agent-orchestrate-monitor, claude-code]
related_skills: [multi-agent-mux-resume, multi-agent-mux-stop, multi-agent-mux-monitor, claude-code]
prereq_skills: [claude-code]
---
# Multi-Agent Create — Start a Fresh Agent in a tmux Session
> **Companion skills**: `tmux-agent-orchestrate-resume` (resume an existing UUID), `tmux-agent-orchestrate-stop` (terminate), `tmux-agent-orchestrate-monitor` (live status).
> **Single source of truth**: `./.hermes/agent-sessions.yaml` (this skill writes to it; never read it ad-hoc — go through this skill).
> **Companion skills**: `multi-agent-mux-resume` (resume an existing UUID), `multi-agent-mux-stop` (terminate), `multi-agent-mux-monitor` (live status).
> **Single source of truth**: `./.mam/agent-sessions.yaml` (this skill writes to it; never read it ad-hoc — go through this skill).
## What this skill does
Spawn a new agent (`claude` or `agy`/antigravity-cli) in a **dedicated tmux session** for context-preserving long-running work. The tmux session is the *container*; the agent's session ID is *data* inside the container. **This skill creates the container + starts the agent — but does not resume an old conversation** (use `tmux-agent-orchestrate-resume` for that).
Spawn a new agent (`claude` or `agy`/antigravity-cli) in a **dedicated tmux session** for context-preserving long-running work. The tmux session is the *container*; the agent's session ID is *data* inside the container. **This skill creates the container + starts the agent — but does not resume an old conversation** (use `multi-agent-mux-resume` for that).
For all agents: the tmux session name is produced by **`lib.sh::derive_session_name`** — the single source of truth shared by create/resume/stop/status/monitor (P0-A). The rule (verbatim from the function):
@@ -74,12 +74,12 @@ To prevent this, you can run this skill inside an **isolated tmux server** using
```
2. **Via Option Flag**:
```bash
bash scripts/create_session.sh --workspace /path/to/project --agent claude --tmux-server multi-agent-canary
bash scripts/create_session.sh --workspace /path/to/project --agent claude --role developer --tmux-server multi-agent-canary
```
3. **Submit Job Integration**:
You can automatically register a delegated job with a prompt when creating a session:
```bash
bash scripts/create_session.sh --workspace /path/to/project --agent claude --submit-job "Task prompt here"
bash scripts/create_session.sh --workspace /path/to/project --agent claude --role developer --submit-job "Task prompt here"
```
### Recommended Alias
@@ -98,12 +98,12 @@ tmc ls # Lists only your multi-agent sessions
```bash
WORKSPACE=/path/to/project
AGENT=claude # or agy
source skills/lib.sh
source .agents/skills/lib.sh
SESSION_NAME="$(derive_session_name "$WORKSPACE" "$AGENT")"
# 1. If session already alive, fail fast
tmux has-session -t "$SESSION_NAME" 2>/dev/null && {
echo "ERROR: tmux session '$SESSION_NAME' already exists. Use tmux-agent-orchestrate-resume to attach or tmux-agent-orchestrate-stop first."
echo "ERROR: tmux session '$SESSION_NAME' already exists. Use multi-agent-mux-resume to attach or multi-agent-mux-stop first."
exit 1
}
@@ -137,7 +137,7 @@ TMUX_EPOCH=$(tmux list-sessions -F '#{session_created}' -t "$SESSION_NAME" 2>/de
## Registering the session in agent-sessions.yaml
After spawn, append a new `tmux_sessions[]` entry to `.hermes/agent-sessions.yaml`:
After spawn, append a new `tmux_sessions[]` entry to `.mam/agent-sessions.yaml`:
```yaml
- name: <SESSION_NAME>
@@ -172,8 +172,8 @@ After spawn, append a new `tmux_sessions[]` entry to `.hermes/agent-sessions.yam
Use the `agent-sessions-yaml-edit` script in `scripts/` to safely append (preserves comments + format):
```bash
bash skills/tmux-agent-orchestrate-create/scripts/create_session.sh \
--workspace "$WORKSPACE" --agent "$AGENT" --session "$SESSION_NAME"
bash .agents/skills/multi-agent-mux-create/scripts/create_session.sh \
--workspace "$WORKSPACE" --agent "$AGENT" --role "$ROLE" --session "$SESSION_NAME"
```
The script handles the YAML append, pane capture, and the `last_visible_status` placeholder.
@@ -184,7 +184,7 @@ The script handles the YAML append, pane capture, and the `last_visible_status`
- **Don't trust `--session-id <uuid>` flags blindly** — claude/agy may not accept a fixed session id on first spawn. The session id is *assigned* on first user message; you can read it back from `~/.claude/projects/.../session.jsonl` headers or `~/.gemini/.../cache/last_conversations.json` AFTER the first message.
- **Wrapper script MUST NOT be created via `hermes profile alias`** — that command writes a `hermes -p <profile>` wrapper that destroys the tmux behavior. Create wrappers manually (see `lab-landing-page-creator-claude` template).
- **Always use the workspace-relative path** in tmux `cwd` — relative paths break when tmux respawns in a different shell context.
- **The first `claude` message generates the session id**`tmux-agent-orchestrate-create` only sets up the *container*. If you need a known session id for later resume, send a placeholder message (e.g. "init") and read it back, then call `tmux-agent-orchestrate-resume` later.
- **The first `claude` message generates the session id**`multi-agent-mux-create` only sets up the *container*. If you need a known session id for later resume, send a placeholder message (e.g. "init") and read it back, then call `multi-agent-mux-resume` later.
## Verification
@@ -200,7 +200,7 @@ tmux list-panes -t "$SESSION_NAME" -F 'cmd=#{pane_current_command} cwd=#{pane_cu
# 3. agent-sessions.yaml has the new entry
python3 -c "
import yaml
d = yaml.safe_load(open('.hermes/agent-sessions.yaml'))
d = yaml.safe_load(open('.mam/agent-sessions.yaml'))
names = [s['name'] for s in d['tmux_sessions']]
assert '$SESSION_NAME' in names, 'session not registered'
print('OK:', names)
@@ -214,7 +214,7 @@ tmux capture-pane -t "$SESSION_NAME" -p -S -20
## When NOT to use this skill
- **Resuming an old conversation**`tmux-agent-orchestrate-resume`
- **Killing an existing session**`tmux-agent-orchestrate-stop`
- **Resuming an old conversation**`multi-agent-mux-resume`
- **Killing an existing session**`multi-agent-mux-stop`
- **Just attaching to an existing session**`tmux attach -t <name>` (no skill needed)
- **One-shot print mode (claude -p "...")** → no tmux needed; use `claude-code` skill's print mode
@@ -1,7 +1,7 @@
#!/usr/bin/env bash
# create_session.sh — tmux-agent-orchestrate-create 의 부속 스크립트
# create_session.sh — multi-agent-mux-create 의 부속 스크립트
# Usage:
# bash create_session.sh --workspace <path> --agent <claude|agy> [--session <name>] [--wrapper]
# bash create_session.sh --workspace <path> --agent <claude|agy> --role <role> [--session <name>] [--wrapper]
#
# 동작:
# 1) preflight: tmux/claude/agy 가용성, workspace 존재
@@ -15,7 +15,7 @@
# 0 = success
# 1 = preflight failure
# 2 = invalid args
# 3 = tmux session already exists (use tmux-agent-orchestrate-resume or delete first)
# 3 = tmux session already exists (use multi-agent-mux-resume or delete first)
# 4 = agent-sessions.yaml append failure
set -euo pipefail
@@ -23,22 +23,24 @@ source "$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/lib.sh"
usage() {
cat <<EOF
Usage: $0 --workspace <path> --agent <claude|agy|hermes> [options]
Usage: $0 --workspace <path> --agent <claude|agy|hermes|cline> --role <role> [options]
Options:
--workspace PATH project directory (required)
--agent AGENT claude | agy | hermes (required)
--agent AGENT claude | agy | hermes | cline (required)
--role ROLE assigned role (required)
--session NAME tmux session name (default: derived from workspace)
--wrapper force use of ~/.local/bin/<session> wrapper even if not present
--dry-run print commands without executing
--tmux-server NAME specify isolated tmux server name
--submit-job PROMPT submit a job to tmux-agent-orchestrate-delegate-job registry with the given prompt
--submit-job PROMPT submit a job to multi-agent-mux-delegate-job registry with the given prompt
-h, --help this help
EOF
}
WORKSPACE=""
AGENT=""
ROLE=""
SESSION_NAME=""
USE_WRAPPER=0
DRY_RUN=0
@@ -49,6 +51,7 @@ while [ $# -gt 0 ]; do
case "$1" in
--workspace) WORKSPACE="$2"; shift 2 ;;
--agent) AGENT="$2"; shift 2 ;;
--role) ROLE="$2"; shift 2 ;;
--session) SESSION_NAME="$2"; shift 2 ;;
--wrapper) USE_WRAPPER=1; shift ;;
--dry-run) DRY_RUN=1; shift ;;
@@ -66,6 +69,7 @@ fi
# Preflight
[ -n "$WORKSPACE" ] || { echo "ERROR: --workspace required" >&2; usage; exit 2; }
[ -n "$AGENT" ] || { echo "ERROR: --agent required" >&2; usage; exit 2; }
[ -n "$ROLE" ] || { echo "ERROR: --role required" >&2; usage; exit 2; }
[ -d "$WORKSPACE" ] || { echo "ERROR: workspace $WORKSPACE not a directory" >&2; exit 1; }
command -v tmux >/dev/null || { echo "ERROR: tmux not installed" >&2; exit 1; }
command -v "$AGENT" >/dev/null || { echo "ERROR: $AGENT CLI not in PATH" >&2; exit 1; }
@@ -86,6 +90,11 @@ elif [ "$AGENT" = "hermes" ]; then
echo "ERROR: hermes is not functional. Run 'hermes setup' first." >&2
exit 1
fi
elif [ "$AGENT" = "cline" ]; then
if ! cline history --json >/dev/null 2>&1; then
echo "ERROR: cline is not functional or configured." >&2
exit 1
fi
fi
# 세션 이름 — lib.sh::derive_session_name 이 단일 소스 (P0-A)
@@ -95,7 +104,7 @@ fi
# 이미 살아있으면 실패
if _tmux has-session -t "$SESSION_NAME" 2>/dev/null; then
echo "ERROR: tmux session '$SESSION_NAME' already exists. Use tmux-agent-orchestrate-resume to attach, or tmux-agent-orchestrate-stop first." >&2
echo "ERROR: tmux session '$SESSION_NAME' already exists. Use multi-agent-mux-resume to attach, or multi-agent-mux-stop first." >&2
exit 3
fi
@@ -106,11 +115,11 @@ WRAPPER="$LOCAL_BIN/$SESSION_NAME"
spawn() {
case "$AGENT" in
claude)
if [ -x "$WRAPPER" ] || [ "$USE_WRAPPER" = "1" ]; then
if { [ -x "$WRAPPER" ] && [ "$(basename "$WRAPPER")" != "claude" ]; } || [ "$USE_WRAPPER" = "1" ]; then
nohup "$WRAPPER" >/dev/null 2>&1 &
disown
else
_tmux new-session -d -s "$SESSION_NAME" -x 140 -y 40 -c "$WORKSPACE" "claude"
_tmux new-session -d -s "$SESSION_NAME" -x 140 -y 40 -c "$WORKSPACE" "claude --dangerously-skip-permissions"
fi
;;
agy)
@@ -119,7 +128,10 @@ spawn() {
hermes)
_tmux new-session -d -s "$SESSION_NAME" -x 140 -y 40 -c "$WORKSPACE" "hermes"
;;
*) echo "ERROR: --agent must be claude, agy or hermes, got: $AGENT" >&2; exit 2 ;;
cline)
_tmux new-session -d -s "$SESSION_NAME" -x 140 -y 40 -c "$WORKSPACE" "cline -i"
;;
*) echo "ERROR: --agent must be claude, agy, hermes or cline, got: $AGENT" >&2; exit 2 ;;
esac
}
@@ -142,9 +154,10 @@ NOW_ISO=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
# cmd_full 결정
case "$AGENT" in
claude) CMD_FULL='claude' ;;
claude) CMD_FULL='claude --dangerously-skip-permissions' ;;
agy) CMD_FULL='agy --dangerously-skip-permissions' ;;
hermes) CMD_FULL='hermes' ;;
cline) CMD_FULL='cline -i' ;;
esac
# 시작 명령
@@ -158,10 +171,10 @@ case "$AGENT" in
if [ -x "$WRAPPER" ]; then
START_CMD="$WRAPPER # ~/.local/bin 의 래퍼"
else
START_CMD="$local_tmux new-session -d -s \"$SESSION_NAME\" -x 140 -y 40 -c \"$WORKSPACE\" \"claude\""
START_CMD="$local_tmux new-session -d -s \"$SESSION_NAME\" -x 140 -y 40 -c \"$WORKSPACE\" \"claude --dangerously-skip-permissions\""
fi
;;
agy|hermes)
agy|hermes|cline)
START_CMD="$local_tmux new-session -d -s \"$SESSION_NAME\" -x 140 -y 40 -c \"$WORKSPACE\" \"$CMD_FULL\""
;;
esac
@@ -174,6 +187,8 @@ if [ -n "$SUBMIT_JOB_PROMPT" ]; then
delegate_agent="claude-code"
elif [ "$AGENT" = "hermes" ]; then
delegate_agent="hermes-agent"
elif [ "$AGENT" = "cline" ]; then
delegate_agent="cline-agent"
else
delegate_agent="antigravity-cli"
fi
@@ -191,7 +206,7 @@ fi
# 모든 값은 환경변수로 전달 — heredoc interpolation 없음 (P1-B).
# 자식 pid 는 bash 에서 pgrep 으로 미리 구함 (P2: 도구명 필터).
CHILD_PID=0
if { [ "$AGENT" = "agy" ] || [ "$AGENT" = "hermes" ]; } && [ -n "$PANE_PID" ]; then
if { [ "$AGENT" = "agy" ] || [ "$AGENT" = "hermes" ] || [ "$AGENT" = "cline" ]; } && [ -n "$PANE_PID" ]; then
CHILD_PID=$(pgrep -P "$PANE_PID" -x "$AGENT" 2>/dev/null | head -1 || true)
CHILD_PID="${CHILD_PID:-0}"
fi
@@ -201,9 +216,10 @@ atomic_dump_yaml "$AGENT_SESSIONS_YAML" \
TMUX_EPOCH="$TMUX_EPOCH" PANE_PID="$PANE_PID" PANE_CWD="$PANE_CWD" \
CMD_FULL="$CMD_FULL" START_CMD="$START_CMD" CHILD_PID="$CHILD_PID" \
TMUX_SERVER_NAME="${TMUX_SERVER_NAME:-default}" \
DELEGATE_JOB_ID="$DELEGATE_JOB_ID" <<'PYEOF'
DELEGATE_JOB_ID="$DELEGATE_JOB_ID" ROLE="$ROLE" <<'PYEOF'
name = os.environ['SESSION_NAME']
agent = os.environ['AGENT']
role = os.environ['ROLE']
pid = os.environ.get('PANE_PID', '')
epoch = os.environ.get('TMUX_EPOCH', '')
server_name = os.environ.get('TMUX_SERVER_NAME', 'default')
@@ -222,6 +238,7 @@ sessions[:] = [s for s in sessions if s.get('name') != name]
entry = {
'name': name,
'status': 'running',
'role': role,
'tmux_session_created_at': os.environ['NOW_ISO'],
'tmux_session_epoch': int(epoch) if epoch.isdigit() else 0,
'tmux_server': server_name,
@@ -265,6 +282,11 @@ elif agent == 'hermes':
entry['child_pid'] = int(cp) if cp.isdigit() else 0
entry['hermes_conversation_id_own'] = None
entry['last_visible_status'] = "TUI started; awaiting first user message"
elif agent == 'cline':
cp = os.environ.get('CHILD_PID', '0')
entry['child_pid'] = int(cp) if cp.isdigit() else 0
entry['cline_conversation_id_own'] = None
entry['last_visible_status'] = "TUI started; awaiting first user message"
sessions.append(entry)
@@ -279,7 +301,7 @@ echo "=== created ==="
echo "tmux session: $SESSION_NAME (pane pid $PANE_PID, cmd $PANE_CMD, cwd $PANE_CWD)"
if [ -n "$DELEGATE_JOB_ID" ]; then
echo "delegate job: $DELEGATE_JOB_ID"
delegate_publish_event "$DELEGATE_JOB_ID" started "tmux-agent-orchestrate session created"
delegate_publish_event "$DELEGATE_JOB_ID" started "multi-agent-mux session created"
WD_PID=$(start_watchdog "$DELEGATE_JOB_ID" "$WORKSPACE")
echo "watchdog PID: $WD_PID"
fi
@@ -290,5 +312,5 @@ if [ -n "${TMUX_SERVER_NAME:-}" ] && [ "$TMUX_SERVER_NAME" != "default" ]; then
else
echo "Attach: tmux attach -t $SESSION_NAME"
fi
echo "Delete: use tmux-agent-orchestrate-stop skill"
echo "Resume: use tmux-agent-orchestrate-resume skill (after first message creates a session id)"
echo "Delete: use multi-agent-mux-stop skill"
echo "Resume: use multi-agent-mux-resume skill (after first message creates a session id)"
@@ -0,0 +1,116 @@
# Task Delegation Types (작업 위임 타입) Design Specification
이 문서는 `multi-agent-mux-delegate-job` 스킬에 **작업 위임 타입 (Task Delegation Types)**을 정의하고, 단일 에이전트 실행을 넘어 에이전트 협업 구조(루프, 토론 등)를 체계적으로 오케스트레이션하기 위한 설계 명세입니다.
---
## 1. 개요 및 필요성
기존의 잡 위임 시스템은 단일 에이전트에 지시사항(Prompt)을 전달하고 완료(`completed`) 또는 에러(`error`) 이벤트를 수신하면 작업을 종료하는 **단방향 직접 위임(Direct)** 구조였습니다.
하지만 실제 협업 환경에서는 다음과 같은 유형의 고도화된 협업 흐름이 필요합니다:
1. **자료조사 및 토론 (Research & Discussion)**: 계획 수립 또는 개념 검토를 위해 여러 에이전트가 협의에 이를 때까지 논의를 주고받음.
2. **작업자-리뷰어 루프 (Worker-Reviewer Loop)**: 작업자(Worker)가 코드를 수정하면, 리뷰어(Reviewer)가 검토하여 `PASS`를 줄 때까지 피드백 반영 및 수정을 반복함.
이러한 협업 워크플로우를 개별 에이전트의 내부 코드 수정 없이 **오케스트레이터(위임 스크립트) 레이어에서 제어**할 수 있도록 작업 타입을 도입합니다.
---
## 2. 작업 위임 타입 정의
| 타입명 (`--type`) | 설명 | 워크플로우 흐름 |
|------------------|------|----------------|
| `direct` (기본값) | 단일 에이전트에 대한 직접 위임 | 지시 → 에이전트 수행 → 완료/에러 수신 후 종료 |
| `loop` | 작업자-리뷰어 피드백 루프 | 작업자 실행 → 완료 시 리뷰어 자동 호출 → 리뷰 통과 시 종료, 실패 시 피드백과 함께 작업자 재호출 (반복) |
| `discuss` | 자료조사 및 상호 토론 | 에이전트 A(초안 작성) → 에이전트 B(검토 및 의견 제시) → 에이전트 A(반영 및 수정) → 합의 도달 시 종료 |
---
## 3. CLI 명세 확장
`multi-agent-mux-delegate-job submit` 명령어에 다음 옵션들이 추가됩니다.
```bash
multi-agent-mux-delegate-job submit \
--prompt <text> \
--agent <worker_agent> \
--agent-session <worker_session> \
[--type <direct|loop|discuss>] \
[--reviewer <reviewer_agent>] \
[--reviewer-session <reviewer_session>] \
[--max-iterations <count>] \
[--validate <script>]
```
### 신규 옵션 상세:
* `--type`: 작업 위임 타입을 지정합니다. (`direct`, `loop`, `discuss`)
* `--reviewer`: 리뷰를 담당할 에이전트 이름입니다 (기본값: `hermes`).
* `--reviewer-session`: 리뷰어 에이전트가 돌고 있는 tmux 세션 이름입니다 (기본값: `tmux:hermes`).
* `--max-iterations`: 루프 또는 토론의 최대 반복 횟수입니다 (기본값: `5`).
---
## 4. 작업자-리뷰어 루프 (`loop`) 상태 머신
오케스트레이터는 다음 상태 다이어그램에 따라 작업을 순차적으로 위임하고 이벤트를 구독합니다.
```mermaid
stateDiagram-v2
[*] --> Worker_Pending : Submit Job
Worker_Pending --> Worker_Running : Worker picks job
Worker_Running --> Reviewer_Pending : Worker emits "completed"
Worker_Running --> Error_Terminal : Worker emits "error"
Reviewer_Pending --> Reviewer_Running : Reviewer picks job
Reviewer_Running --> Success_Terminal : Reviewer emits "completed" (PASS)
Reviewer_Running --> Worker_Pending : Reviewer emits "error" / "completed" (Feedback) / Increment Iteration
Reviewer_Running --> Error_Terminal : Reviewer emits "error" & Iteration > Max
Success_Terminal --> [*]
Error_Terminal --> [*]
```
### 단계별 상세 동작 프로토콜:
1. **작업자(Worker) 실행**:
* 오케스트레이터는 작업을 `pending`으로 등록하고, `agent_session`을 작업자 세션(예: `tmux:claude`)으로 설정하여 전달합니다.
* 작업자가 수행을 완료하고 `completed` 이벤트를 발행하면 오케스트레이터가 이를 가로챕니다.
2. **리뷰어(Reviewer)로 스위칭**:
* 오케스트레이터는 전체 작업을 종료하지 않고, 작업 레코드의 `agent_session`을 리뷰어 세션(예: `tmux:hermes`)으로 변경합니다.
* 리뷰어에게 전달할 프롬프트를 자동으로 조립합니다:
> *"Review the changes/artifacts generated for job $JOB_ID. Check if they meet the requirements. If correct, publish completed event with 'PASS'. If there are issues, publish error event with detailed feedback/nits."*
* 상태를 다시 `pending`으로 리셋하여 리뷰어 세션이 잡을 집어갈 수 있도록 합니다.
3. **리뷰 결과 판정**:
* **PASS인 경우**: 리뷰어가 `completed` 이벤트와 함께 `"PASS"` 메시지를 주면 잡이 최종 `completed` 처리되며 오케스트레이터가 종료됩니다.
* **피드백 발생 시**: 리뷰어가 `error` 또는 일반 `completed`와 함께 피드백 내용을 발행하면, 오케스트레이터는 반복 횟수(`iteration`)를 1 증가시킵니다.
* 최대 반복 횟수(`max-iterations`)를 초과한 경우 최종 `error` 종료됩니다.
* 그렇지 않다면, 다시 `agent_session`을 작업자 세션으로 돌리고 프롬프트를 조립하여 `pending` 상태로 돌려보냅니다:
> *"The reviewer provided the following feedback for job $JOB_ID: <리뷰어 피드백>. Please modify the code/artifacts to address these comments."*
---
## 5. 토론 및 협의 (`discuss`) 상태 머신
토론 타입은 작업자와 리뷰어가 동등한 관계(예: PM/기획 에이전트와 설계 에이전트)에서 상호 계획안을 다듬어 나가는 구조입니다.
1. **초안 작성 (PM/Researcher)**:
* PM 세션에 최초 프롬프트를 보냅니다. PM은 요구사항 분석 및 초안을 파일(예: `draft_plan.md`)로 작성하고 `completed`를 발행합니다.
2. **의견 검토 (Designer/Developer)**:
* 작업을 설계자 세션으로 전환하고 다음 프롬프트를 줍니다:
> *"Read draft_plan.md. Review its technical feasibility. Write your feedback/objections to draft_plan.md or comments.md. If you agree with the plan, reply with 'AGREE'."*
3. **합의 도달 여부 검토**:
* 상대방이 `"AGREE"`를 보내면 토론 합의가 성립되어 최종 완료됩니다.
* 반대 의견이 있으면 PM 세션으로 다시 넘겨 계획을 개정하도록 유도합니다.
---
## 6. 구현 계획
1. **`registry.py` 확장**:
* `job_type` (기본값 `"direct"`), `reviewer`, `reviewer_session`, `max_iterations`, `iteration` 필드를 잡 레코드 모델에 추가합니다.
* `register_job` 함수와 CLI 파서에 신규 매개변수를 등록합니다.
2. **`multi-agent-mux-delegate-job` 래퍼 스크립트 수정**:
* `cmd_submit`에서 위임 타입(`--type`)을 받아 루프를 도는 셸 스크립트 상태 기계를 작성합니다.
* 각 에피소드가 끝날 때마다 상태를 변경하여 작업자/리뷰어 간에 소유권을 주고받도록 구현합니다.
3. **검증**:
* 가상의 worker/reviewer 시나리오를 만들거나 claude/hermes 세션에서 직접 상호 검증 루프를 돌려 정상 수렴하는지 테스트합니다.
@@ -0,0 +1,11 @@
# multi-agent-mux-delegate-job 스킬
작업(Job)을 자율 에이전트(claude-code/hermes/agy/cline/codex/opencode/human)에게 위임하고 MQTT
이벤트 채널로 비동기 관찰하는 범용 에이전트 협업 스킬. **시작점은 [`SKILL.md`](./SKILL.md).**
- 프로토콜/스키마: [`job-protocol.md`](./job-protocol.md)
- 브로커 PoC→운영 전환: [`mqtt-broker-setup.md`](./mqtt-broker-setup.md)
- 레지스트리 포맷/동시성: [`registry.md`](./registry.md)
- 참조 구현: [`multi-agent-mux-delegate-job`](./multi-agent-mux-delegate-job) (bash wrapper), [`scripts/publish_event.py`](./scripts/publish_event.py), [`scripts/job_subscriber.py`](./scripts/job_subscriber.py), [`scripts/registry.py`](./scripts/registry.py), [`scripts/mqtt_common.py`](./scripts/mqtt_common.py)
- 영구 감사 로그: `.mam/delegate_job_logs/<job_id>/` (`meta.json`·`events.ndjson`·`status.json`)
`multi-agent-mux-delegate-job logs <id>` 또는 `multi-agent-mux-delegate-job logs --list`로 조회 (SKILL.md "Audit Logs" 참조)
@@ -0,0 +1,94 @@
---
name: multi-agent-mux-delegate-job
description: "Delegate a unit of work to any autonomous agent (claude-code, hermes, agy, cline, codex, or a human) and observe it asynchronously over an MQTT event channel. Supported roles include orchestrator, worker, and reviewer."
version: 1.1.0
author: Multi-Agent System
license: MIT
platforms: [linux, macos, windows]
---
# multi-agent-mux-delegate-job — Async Job Delegation over MQTT
Delegate a unit of work to any autonomous agent, then **observe** it asynchronously instead of blocking. Every job gets a unique ID and a registry record. The worker agent publishes lifecycle events (`started`, `permission_required`, `progress`, `completed`, `error`) to a per-job MQTT topic, and the delegator/orchestrator subscribes to verify the final state.
This skill allows any agent (`claude-code`, `hermes`, `agy`, `cline`, etc.) to play any role: **Orchestrator/Delegator**, **Worker/Implementer**, or **Reviewer**.
---
## Roles in Multi-Agent Mux
- **Orchestrator (Delegator)**: Initiates the job, coordinates other agents, handles loops and reviews, and commits final changes.
- **Worker (Implementer)**: Receives the brief file or task prompt, performs the implementation, and emits started/completed/error events.
- **Reviewer**: Evaluates git diffs or artifacts produced by the worker, and responds with a `completed` event containing `"PASS"` or feedback.
---
## Core Commands (CLI)
The `multi-agent-mux-delegate-job` bash wrapper handles job registration, subscriber management, agent session targeting, and validation hooks:
```bash
# 1) Submit a new job to a targeted agent session (e.g. tmux session name 'demo')
multi-agent-mux-delegate-job submit \
--agent <claude-code|hermes-agent|agy-agent|cline-agent|human> \
--agent-session tmux:<session_name> \
--prompt "Task description or instructions here" \
--timeout 3600 --idle-timeout 120
# 2) Submit a job with a feedback loop (Worker-Reviewer Loop)
multi-agent-mux-delegate-job submit \
--agent <worker_agent> --agent-session tmux:<worker_session> \
--type loop --reviewer <reviewer_agent> --reviewer-session tmux:<reviewer_session> \
--prompt "Task description"
# 3) Check job status and audit logs
multi-agent-mux-delegate-job status --job <JOB_ID>
multi-agent-mux-delegate-job logs <JOB_ID> # Chronological log of events
multi-agent-mux-delegate-job list # Summary of all registered jobs
# 4) Verify job artifacts with a validation script
multi-agent-mux-delegate-job verify --job <JOB_ID> --validate ./validate.sh
```
---
## Task Delegation Types
Supported job types include:
- `direct` (default): Single agent execution (direct tasking).
- `loop` (Worker-Reviewer Loop): Alternates worker execution and reviewer evaluation until reviewer approves (`PASS`) or iterations run out.
- `discuss` (Research & Discussion): Collaboration between two agents to reach a consensus (e.g., agreeing on a design or plan).
For detailed state machine diagrams and configurations, see [DELEGATION_TYPES.md](./DELEGATION_TYPES.md).
---
## The Event Protocol Contract
Every agent participating in the delegation contract must follow the same lifecycle publishing protocol using `publish_event.py`:
1. **On Start**: Publish `started` event.
`python3 .agents/skills/multi-agent-mux-delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started`
2. **On Tool/Permission Prompt**: Publish `permission_required` event.
`python3 ... --job "$JOB_ID" --event permission_required --detail "<tool>:<reason>"`
3. **On Progress Update (Optional)**: Publish `progress` event.
`python3 ... --job "$JOB_ID" --event progress --detail "<status_update>"`
4. **On Success**: Publish `completed` event.
`python3 ... --job "$JOB_ID" --event completed --detail "<summary>"` (Reviewer should include `"PASS"` in the detail to approve).
5. **On Failure/Feedback**: Publish `error` event.
`python3 ... --job "$JOB_ID" --event error --detail "<reason_or_feedback>"`
---
## Audit Logs
Job lifecycle execution events are persistently mirrored to an append-only log under `.mam/delegate_job_logs/<job_id>/` (containing `meta.json`, `events.ndjson`, and `status.json`). Use `multi-agent-mux-delegate-job logs <job_id>` to view the timeline.
---
## Best Practices and Pitfalls
- **Subscribe-Before-Publish**: The subscriber must be running before the agent starts publishing. The `submit` command handles this automatically by launching the subscriber in the background first.
- **Fresh job_id Propagation**: Make sure the worker agent receives the correct `JOB_ID` generated for the current run, rather than reusing stale IDs from previous sessions.
- **Brief delivery via file path**: For long or complex prompts, write the instructions to a file (e.g. `/tmp/task-brief.md`) and pass a short prompt pointing to the file path to prevent terminal buffer overflows.
- **Batch Grouping**: Group non-overlapping tasks into batches to parallelize execution across multiple agent sessions, reducing overhead.
@@ -1,6 +1,6 @@
# Job Event Protocol
The wire contract every tmux-agent-orchestrate-delegate-job agent (claude-code, codex, opencode,
The wire contract every multi-agent-mux-delegate-job agent (claude-code, codex, opencode,
human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across
the PoC (public broker) and production (own broker) stages; only transport
hardening changes, never the payload shape.
@@ -1,6 +1,6 @@
# MQTT Broker Setup — PoC → Production
The tmux-agent-orchestrate-delegate-job scripts read **all** broker settings from environment
The multi-agent-mux-delegate-job scripts read **all** broker settings from environment
variables (or a job record's `broker.*` block) through a single helper,
`broker_config_from_env()` in
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py). The design goal:
@@ -152,7 +152,7 @@ export MQTT_PASSWORD=… # subscriber side
mosquitto_sub -h "$MQTT_BROKER" -p 8883 --cafile "$MQTT_CA_CERTS" \
-u hermes -P "$MQTT_PASSWORD" -t 'python/mqtt/jobs/+/events' -v &
# 3) run the unchanged tmux-agent-orchestrate-delegate-job loop
# 3) run the unchanged multi-agent-mux-delegate-job loop
PY=.venv/bin/python
JID=$($PY scripts/registry.py register --prompt "broker cutover smoke")
$PY scripts/job_subscriber.py --job "$JID" --timeout 30 &
@@ -0,0 +1,440 @@
#!/usr/bin/env bash
# multi-agent-mux-delegate-job — user-facing orchestrator for the multi-agent-mux-delegate-job skill.
#
# Subcommands:
# submit register a job, start the subscriber FIRST, then run the agent,
# then (optionally) run a validation script.
# status show one job record.
# list list all jobs.
# verify run a user-supplied --validate script against a job's artifacts.
# wait block until all running/pending jobs reach a terminal state.
#
# This is a reference wrapper: it shells out to the python scripts that live
# next to it. Copy it into your project and customise as needed. It never hard
# fails if `claude`/`codex`/`tmux` are missing — it prints what it would run.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Load local .env if it exists in current dir or workspace root
if [[ -f .env ]]; then
set -a; source .env; set +a
elif [[ -f "$SCRIPT_DIR/../../.env" ]]; then
set -a; source "$SCRIPT_DIR/../../.env"; set +a
fi
# Pick an interpreter: prefer a project .venv, else python3.
pick_python() {
local py_bin
if [[ -n "${DELEGATE_JOB_PYTHON:-}" ]]; then
py_bin="$DELEGATE_JOB_PYTHON"
elif [[ -x "${WORKDIR:-.}/.venv/bin/python" ]]; then
py_bin="${WORKDIR}/.venv/bin/python"
elif [[ -x ".venv/bin/python" ]]; then
py_bin="$(pwd)/.venv/bin/python"
else
py_bin="python3"
fi
if ! "$py_bin" -c "import paho.mqtt" 2>/dev/null; then
echo "ERROR: paho-mqtt package is missing for $py_bin." >&2
echo " Please create a virtual environment and install it:" >&2
echo " python3 -m venv .venv && .venv/bin/pip install -r \"$SCRIPT_DIR/requirements.txt\"" >&2
exit 1
fi
echo "$py_bin"
}
REGISTRY_DIR_DEFAULT=".mam/jobs"
usage() {
cat <<'EOF'
multi-agent-mux-delegate-job <command> [options]
submit --agent <name> --prompt <text> [--workdir <dir>] [--agent-session <label>]
[--timeout <sec>] [--idle-timeout <sec>] [--validate <script>]
[--registry-dir <dir>] [--dry-run]
[--type <direct|loop|discuss>] [--reviewer <reviewer_agent>]
[--reviewer-session <reviewer_session>] [--max-iterations <count>]
# The skill is tmux-interactive only; --mode print was removed.
status --job <id> [--registry-dir <dir>]
list [--registry-dir <dir>]
verify --job <id> --validate <script> [--registry-dir <dir>]
wait [--job <id>] [--timeout <sec>] [--registry-dir <dir>]
logs <job_id> | --list # persistent audit log (delegate_job_logs/)
EOF
}
# ---- arg parsing helpers --------------------------------------------------
AGENT="claude-code"; PROMPT=""; WORKDIR="$(pwd)"; AGENT_SESSION="tmux:claude"
TIMEOUT=3600; IDLE_TIMEOUT=120; VALIDATE=""; DRY_RUN=0
JOB_ID=""; REGISTRY_DIR="$REGISTRY_DIR_DEFAULT"
TYPE="direct"; REVIEWER="hermes"; REVIEWER_SESSION="tmux:hermes"; MAX_ITERATIONS=5
parse_opts() {
while [[ $# -gt 0 ]]; do
case "$1" in
--agent) AGENT="$2"; shift 2;;
--prompt) PROMPT="$2"; shift 2;;
--workdir) WORKDIR="$2"; shift 2;;
--agent-session) AGENT_SESSION="$2"; shift 2;;
--timeout) TIMEOUT="$2"; shift 2;;
--idle-timeout) IDLE_TIMEOUT="$2"; shift 2;;
--validate) VALIDATE="$2"; shift 2;;
--job) JOB_ID="$2"; shift 2;;
--registry-dir) REGISTRY_DIR="$2"; shift 2;;
--dry-run) DRY_RUN=1; shift;;
--type) TYPE="$2"; shift 2;;
--reviewer) REVIEWER="$2"; shift 2;;
--reviewer-session) REVIEWER_SESSION="$2"; shift 2;;
--max-iterations) MAX_ITERATIONS="$2"; shift 2;;
*) echo "unknown option: $1" >&2; usage; exit 1;;
esac
done
}
cmd_submit() {
parse_opts "$@"
[[ -n "$PROMPT" ]] || { echo "submit requires --prompt" >&2; exit 1; }
PY="$(pick_python)"
cd "$WORKDIR"
mkdir -p "$REGISTRY_DIR"
# 1) register job (prints the new job id)
JOB_ID="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" register \
--prompt "$PROMPT" --agent "$AGENT" --agent-session "$AGENT_SESSION" \
--timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
--job-type "$TYPE" --reviewer "$REVIEWER" --reviewer-session "$REVIEWER_SESSION" \
--max-iterations "$MAX_ITERATIONS")"
echo "registered job: $JOB_ID"
if [[ "$TYPE" == "direct" ]]; then
# 2) START THE SUBSCRIBER FIRST (ordering dependency — MQTT does not queue
# non-retained messages for absent subscribers).
local logf="$REGISTRY_DIR/$JOB_ID.subscriber.out"
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
>"$logf" 2>&1 &
local sub_pid=$!
echo "subscriber pid: $sub_pid (log: $logf)"
sleep 1 # give the subscriber time to CONNACK + SUBSCRIBE before the agent runs
# 3) run the agent (or print the command for dry-run / missing binary)
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
# NOTE: the agent MUST use --job "$JOB_ID" (the one we just minted). Hard-coding
# an id from an earlier session is the #1 reason a delegated job sits idle and
# times out (see SKILL.md "Wrong job_id propagated to the agent"). We make the
# freshness explicit in the instruction header.
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
On start run: $pub --event started.
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
On progress (optional): $pub --event progress --detail '<short status>'.
On success run: $pub --event completed --detail '<one-line summary>'.
On failure run: $pub --event error --detail '<one-line reason>'.
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
Task: $PROMPT"
run_agent "$JOB_ID" "$instructions"
# 4) optional validation hook
if [[ -n "$VALIDATE" ]]; then
echo "running validation: $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "validation: PASS"
else
local rc=$?
echo "validation: FAIL (exit $rc)"
fi
fi
if [[ "$DRY_RUN" == "1" ]]; then
# In dry-run we never started a real subscriber (the wrapper short-circuits
# before launching one), but the wait below would still try to join the
# background sub_pid from cmd_submit. Skip both the wait and the subscriber
# log dump; the user just wants to see the instruction that would have run.
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root_dry/$JOB_ID"
return 0
fi
wait "$sub_pid" || true
echo "subscriber output:"; cat "$logf" || true
# Last stdout line: the persistent audit-log dir for this job (see SKILL.md
# "Audit Logs"). Callers can scrape `tail -n1` to find it.
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root/$JOB_ID"
else
# Implement loop/discuss orchestrator
local iteration=1
local current_prompt="$PROMPT"
local current_session="$AGENT_SESSION"
local current_role="worker"
if [[ "$DRY_RUN" == "1" ]]; then
echo "[dry-run] orchestrator loop would start for job: $JOB_ID type: $TYPE"
echo "worker session: $AGENT_SESSION, reviewer session: $REVIEWER_SESSION"
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root_dry/$JOB_ID"
return 0
fi
while true; do
echo "=================================================="
echo "Iteration $iteration - Role: $current_role"
echo "Session: $current_session"
echo "=================================================="
# Update job details in registry
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" update \
--job "$JOB_ID" \
--agent-session "$current_session" \
--prompt "$current_prompt" \
--iteration "$iteration" \
--status "pending"
# Start subscriber
local logf="$REGISTRY_DIR/${JOB_ID}.iter_${iteration}_${current_role}.subscriber.out"
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
>"$logf" 2>&1 &
local sub_pid=$!
echo "subscriber pid: $sub_pid (log: $logf)"
sleep 1
# Format instruction block
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
On start run: $pub --event started.
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
On progress (optional): $pub --event progress --detail '<short status>'.
On success run: $pub --event completed --detail '<one-line summary>'.
On failure run: $pub --event error --detail '<one-line reason>'.
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
Task: $current_prompt"
# Trigger agent
run_agent "$JOB_ID" "$instructions" "$current_session"
# Wait for subscriber
local sub_rc=0
wait "$sub_pid" || sub_rc=$?
echo "subscriber output:"; cat "$logf" || true
# Check job status based on subscriber exit code
local job_status="running"
if [[ $sub_rc -eq 0 ]]; then
job_status="completed"
elif [[ $sub_rc -eq 1 ]]; then
job_status="error"
else
job_status="timeout"
fi
echo "Job role $current_role finished with status: $job_status"
# Retrieve feedback from the last event
local feedback
feedback="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get-feedback --job "$JOB_ID")"
echo "Feedback/Detail: $feedback"
if [[ "$current_role" == "worker" ]]; then
if [[ "$job_status" != "completed" ]]; then
echo "Worker did not complete successfully (status: $job_status). Terminating workflow."
break
fi
# Worker completed successfully, now switch to reviewer
current_role="reviewer"
current_session="$REVIEWER_SESSION"
# Build reviewer prompt based on type
if [[ "$TYPE" == "loop" ]]; then
current_prompt="Review the changes/artifacts generated for job $JOB_ID. Check if they meet the requirements. If correct, publish completed event with 'PASS'. If there are issues, publish error event with detailed feedback/nits. CRITICAL: When raising issues or giving a review, you MUST include the exact reason for the issue and a clear direction for improvement (문제 제시에 대한 이유와 확실한 개선 방향을 반드시 포함해야 합니다)."
elif [[ "$TYPE" == "discuss" ]]; then
current_prompt="Read draft/documents generated for job $JOB_ID. Review the feasibility and content. Write your feedback/objections. If you agree with the plan, reply with 'AGREE'."
fi
else
if [[ "$job_status" != "completed" ]]; then
echo "Reviewer did not complete successfully (status: $job_status). Terminating workflow."
break
fi
# Reviewer finished. Check if pass/agree
local success=0
if [[ "$TYPE" == "loop" ]]; then
if [[ "${feedback,,}" == *"pass"* ]]; then
success=1
fi
elif [[ "$TYPE" == "discuss" ]]; then
if [[ "${feedback,,}" == *"agree"* ]]; then
success=1
fi
fi
if [[ "$success" == "1" ]]; then
echo "Reviewer approved the work. Finalizing job as completed."
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" status --job "$JOB_ID" --set "completed"
break
else
# Reviewer rejected/provided feedback. Increment & check max iterations
if [[ $iteration -ge $MAX_ITERATIONS ]]; then
echo "Max iterations ($MAX_ITERATIONS) reached without approval. Terminating workflow."
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" status --job "$JOB_ID" --set "error"
break
fi
iteration=$((iteration + 1))
current_role="worker"
current_session="$AGENT_SESSION"
current_prompt="The reviewer provided the following feedback for job $JOB_ID: $feedback. Please modify the code/artifacts to address these comments. CRITICAL: As the Developer Team Leader, you must thoroughly review the suggested modifications, verify their validity, adopt/implement them if valid, and if you judge any recommendation to be invalid, do NOT implement it but instead explain your reasons clearly in your response and send it back to the reviewer (수정안을 최대한 꼼꼼히 검토하여 타당성을 검증하고, 타당하다면 수렴하여 수정을 진행하되, 타당하지 않다고 판단되는 부분이 있다면 그 이유를 명확히 밝혀 리뷰어에게 전달하십시오)."
fi
fi
done
# 4) optional validation hook
if [[ -n "$VALIDATE" ]]; then
echo "running validation: $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "validation: PASS"
else
local rc=$?
echo "validation: FAIL (exit $rc)"
fi
fi
# Last stdout line: the persistent audit-log dir
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root/$JOB_ID"
fi
}
run_agent() {
local job_id="$1"; local instructions="$2"; local target_session="${3:-$AGENT_SESSION}"
# The skill is INTERACTIVE-ONLY. We never invoke `claude -p` or any other
# one-shot print mode, because:
# - claude -p exits the moment stdin is drained, so there's nothing to
# `tmux attach` to afterwards.
# - fire-and-forget via wrapper defeats the whole point of the audit log
# (you can't tell what happened if the agent crashes mid-turn).
# - the job registry already gives us an authoritative completion signal,
# so we don't need a wrapper-side exit code to know "done".
# The user attaches with `tmux attach -t <session>` and types follow-up
# prompts themselves. We pre-load the first prompt via stdin and `read`
# keeps the pane open after the agent exits so the user can review.
if [ "$AGENT" = "human" ]; then
echo "[human agent] complete the task, then run publish_event.py --event completed"
return
fi
local sess="${target_session#tmux:}"
if [[ "$DRY_RUN" == "1" ]]; then
echo "[dry-run] would delegate task to running agent '$AGENT' in tmux session '$sess' with instructions:"
echo "----"; echo "$instructions"; echo "----"
return
fi
if ! command -v tmux >/dev/null 2>&1; then
echo "ERROR: this skill requires tmux (interactive agent sessions)." >&2
echo " Install with: brew install tmux (or your package manager)" >&2
return 1
fi
local _tmux="tmux"
if [ -n "${TMUX_SERVER_NAME:-}" ]; then
_tmux="tmux -L $TMUX_SERVER_NAME"
fi
if ! $_tmux has-session -t "$sess" 2>/dev/null; then
echo "ERROR: 에이전트 세션 '$sess'이 존재하지 않습니다. 작업을 위임하기 전에 먼저 에이전트 세션을 기동해 주세요." >&2
echo " 팁: 'multi-agent-mux-resume' 또는 'multi-agent-mux-create'를 통해 에이전트를 먼저 생성할 수 있습니다." >&2
return 1
fi
# Before launching the agent, set up error trap to publish error event
if [ -n "${job_id:-}" ] && [ -n "${PY:-}" ]; then
local pub_script="$SCRIPT_DIR/scripts/publish_event.py"
trap 'rc=$?; if [ $rc -ne 0 ]; then "$PY" "$pub_script" --job "$job_id" --event error --detail "agent bootstrap failed (exit $rc)"; fi' EXIT
fi
echo "살아있는 에이전트 세션 '$sess'에 작업을 위임합니다..."
$_tmux set-buffer -b "job_buf_$job_id" "$instructions"
$_tmux paste-buffer -b "job_buf_$job_id" -t "$sess"
sleep 0.5
$_tmux send-keys -t "$sess" C-m
$_tmux delete-buffer -b "job_buf_$job_id"
echo "작업이 세션 '$sess'에 전송되었습니다. (연결하려면: $_tmux attach -t $sess)"
trap - EXIT
}
cmd_status() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "status requires --job" >&2; exit 1; }
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get --job "$JOB_ID"
}
cmd_list() {
parse_opts "$@"
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" list
}
cmd_verify() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "verify requires --job" >&2; exit 1; }
[[ -n "$VALIDATE" ]] || { echo "verify requires --validate <script>" >&2; exit 1; }
echo "verifying job $JOB_ID with $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "verify: PASS (exit 0)"; exit 0
else
rc=$?; echo "verify: FAIL (exit $rc)"; exit "$rc"
fi
}
cmd_logs() {
# logs <job_id> | logs --list — delegates to registry.py's logs CLI, which
# reads the persistent audit log under $DELEGATE_JOB_LOGS_DIR (or
# <cwd>/delegate_job_logs). Run from your project dir so the default resolves.
PY="$(pick_python)"
if [[ "${1:-}" == "--list" ]]; then
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs --list
else
local jid="${1:-}"
[[ -n "$jid" ]] || { echo "logs requires <job_id> or --list" >&2; exit 1; }
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs "$jid"
fi
}
cmd_wait() {
parse_opts "$@"
PY="$(pick_python)"
if [[ -n "$JOB_ID" ]]; then
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT"
else
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--wait-any --timeout "$TIMEOUT"
fi
}
main() {
local sub="${1:-}"; shift || true
case "$sub" in
submit) cmd_submit "$@";;
status) cmd_status "$@";;
list) cmd_list "$@";;
verify) cmd_verify "$@";;
wait) cmd_wait "$@";;
logs) cmd_logs "$@";;
""|-h|--help|help) usage;;
*) echo "unknown command: $sub" >&2; usage; exit 1;;
esac
}
main "$@"
@@ -15,13 +15,13 @@ Reference implementation: [`./scripts/registry.py`](./scripts/registry.py)
## 1. Directory layout
```
.hermes/jobs/
.mam/jobs/
<job_id>.json # job metadata record (schema below)
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
.lock # shared advisory lock (fcntl) for the whole registry
```
`registry_dir` defaults to `.hermes/jobs` and is overridable everywhere via
`registry_dir` defaults to `.mam/jobs` and is overridable everywhere via
`--registry-dir`.
---
@@ -143,13 +143,13 @@ that session.
## 7. Persistent audit log
Separate from the registry, every job is also mirrored to a durable append-only
audit log at `.hermes/delegate_job_logs/<job_id>/` (override with
`DELEGATE_JOB_LOGS_DIR`, default `<cwd>/.hermes/delegate_job_logs`). The registry
audit log at `.mam/delegate_job_logs/<job_id>/` (override with
`DELEGATE_JOB_LOGS_DIR`, default `<cwd>/.mam/delegate_job_logs`). The registry
is **live state** mutated in place; the audit log is **history** that survives
even after the registry dir is cleaned up. It is git-ignored.
```
.hermes/delegate_job_logs/<job_id>/
.mam/delegate_job_logs/<job_id>/
meta.json # registration snapshot (the full job record at register time)
events.ndjson # append-only, one JSON event per line, time-ordered
status.json # current status only (fast point-query)
@@ -59,11 +59,11 @@ def _format_line(topic: str, payload: Dict[str, Any]) -> str:
class _Watcher:
"""Holds the shared queue + the set of job_ids we accept events for."""
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]], expected_seqs: Dict[str, int]):
self.events: "queue.Queue[Tuple[str, Dict[str, Any]]]" = queue.Queue()
self.expected = set(expected_job_ids)
self.tokens = expected_tokens # job_id -> expected auth_token (or None)
self.last_seq: Dict[str, int] = {jid: 0 for jid in expected_job_ids}
self.last_seq = dict(expected_seqs)
def on_message(self, _client, _userdata, msg) -> None:
# --- defensive parsing -------------------------------------------
@@ -153,7 +153,8 @@ def main(argv=None) -> int:
expected_ids: Set[str] = {j["job_id"] for j in jobs}
tokens = {j["job_id"]: j.get("auth_token") for j in jobs}
watcher = _Watcher(expected_ids, tokens)
seqs = {j["job_id"]: int(j.get("last_seq", 0)) for j in jobs}
watcher = _Watcher(expected_ids, tokens, seqs)
# Resolve timeouts from CLI, falling back to the (first) job's settings.
base_job = jobs[0]
@@ -1,4 +1,4 @@
"""Shared MQTT + registry helpers for the tmux-agent-orchestrate-delegate-job skill.
"""Shared MQTT + registry helpers for the multi-agent-mux-delegate-job skill.
Single entry point for:
- broker configuration (env -> dataclass),
@@ -71,11 +71,11 @@ _load_dotenv()
# Constants
# --------------------------------------------------------------------------
SCHEMA_VERSION = 1
DEFAULT_REGISTRY_DIR = ".hermes/jobs"
DEFAULT_REGISTRY_DIR = ".mam/jobs"
DEFAULT_TOPIC_ROOT = "python/mqtt/jobs"
LOCK_FILENAME = ".lock"
# Persistent audit-log layout: .hermes/delegate_job_logs/<job_id>/{meta,events,status}.
# Persistent audit-log layout: .mam/delegate_job_logs/<job_id>/{meta,events,status}.
# This is a *separate* artifact from the registry: the registry is the live job
# record (mutated in place), the audit log is an append-only history that
# survives even if the registry dir is cleaned up.
@@ -86,15 +86,15 @@ STATUS_FILENAME = "status.json"
def _default_logs_dir() -> str:
"""Audit-log root. Overridable with ``DELEGATE_JOB_LOGS_DIR``; otherwise
``<cwd>/.hermes/delegate_job_logs`` we keep audit logs next to the
live registry (``.hermes/jobs/``) so the two runtime artifacts sit
``<cwd>/.mam/delegate_job_logs`` we keep audit logs next to the
live registry (``.mam/jobs/``) so the two runtime artifacts sit
under the same parent dir and follow the same ``.gitignore`` rule.
The cwd of whichever process emits events (the bash wrapper and
scripts) is used as the anchor."""
env = os.environ.get("DELEGATE_JOB_LOGS_DIR")
if env and env.strip():
return env
return os.path.join(os.getcwd(), ".hermes", "delegate_job_logs")
return os.path.join(os.getcwd(), ".mam", "delegate_job_logs")
LOGS_DIR = _default_logs_dir()
@@ -328,8 +328,8 @@ def update_job_status(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR, **f
This is the single chokepoint for status writes (both ``registry.update_status``
and ``publish_event.py``'s status sync route through here), so it also mirrors
any ``status`` change into the persistent audit log best-effort, after the
registry lock is released so a slow/failed log write never blocks the record."""
any ``status`` change into the persistent audit log. We perform the log mirror
under the lock to guarantee sequential consistency in audit history."""
with registry_lock(registry_dir):
record = load_job(job_id, registry_dir)
old_status = record.get("status")
@@ -376,7 +376,7 @@ def _utcnow_precise() -> str:
# --------------------------------------------------------------------------
# Persistent audit log (.hermes/delegate_job_logs/<job_id>/...)
# Persistent audit log (.mam/delegate_job_logs/<job_id>/...)
#
# Every function here is idempotent, concurrency-safe, and *best-effort*: a
# logging failure is swallowed with a logger.warning and never propagated, so it
@@ -410,6 +410,21 @@ def _file_lock(fh):
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
def _redact_dict(d: Any) -> Any:
"""Recursively mask sensitive values (passwords, secrets, tokens) inside logs."""
if isinstance(d, dict):
redacted = {}
for k, v in d.items():
if any(s in k.lower() for s in ("password", "token", "secret", "auth_token", "key")):
redacted[k] = "[REDACTED]"
else:
redacted[k] = _redact_dict(v)
return redacted
elif isinstance(d, list):
return [_redact_dict(item) for item in d]
return d
def append_event(job_id: str, event_dict: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
"""Append one event as a JSON line to ``<logs>/<job_id>/events.ndjson``.
@@ -418,7 +433,7 @@ def append_event(job_id: str, event_dict: Dict[str, Any], logs_dir: Optional[str
try:
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
path.parent.mkdir(parents=True, exist_ok=True)
record = dict(event_dict)
record = _redact_dict(dict(event_dict))
record.setdefault("logged_at", _utcnow_precise())
line = json.dumps(record, ensure_ascii=False) + "\n"
with open(path, "a", encoding="utf-8") as fh:
@@ -453,8 +468,9 @@ def init_job_log(job_id: str, meta: Dict[str, Any], logs_dir: Optional[str] = No
try:
d = job_log_dir(job_id, logs_dir)
d.mkdir(parents=True, exist_ok=True)
meta_redacted = _redact_dict(meta)
with open(d / META_FILENAME, "w", encoding="utf-8") as fh:
json.dump(meta, fh, ensure_ascii=False, indent=2)
json.dump(meta_redacted, fh, ensure_ascii=False, indent=2)
fh.write("\n")
status = meta.get("status", "pending")
update_logged_status(
@@ -1,4 +1,4 @@
"""Job registry for the tmux-agent-orchestrate-delegate-job skill.
"""Job registry for the multi-agent-mux-delegate-job skill.
A job record is the single source of truth for one delegated unit of work:
its id, prompt, owning agent session, broker connection, timeouts, and status.
@@ -9,7 +9,7 @@ Concurrency is handled via the fcntl lock in :mod:`mqtt_common` (PoC). For
multi-host delegation, migrate to SQLite WAL see references/registry.md.
Importable as a library and runnable as a CLI (``register``/``list``/``get``/
``status``/``pick``) so the ``tmux-agent-orchestrate-delegate-job`` bash wrapper can shell out.
``status``/``pick``) so the ``multi-agent-mux-delegate-job`` bash wrapper can shell out.
"""
from __future__ import annotations
@@ -59,6 +59,10 @@ def register_job(
expected_artifacts: Optional[List[str]] = None,
bits: int = 32,
auth_token: Optional[str] = None,
job_type: str = "direct",
reviewer: Optional[str] = None,
reviewer_session: Optional[str] = None,
max_iterations: int = 5,
) -> str:
"""Create a new ``pending`` job record and return its id.
@@ -90,6 +94,11 @@ def register_job(
"expected_artifacts": expected_artifacts or [],
"last_seq": 0,
"auth_token": auth_token,
"job_type": job_type,
"reviewer": reviewer,
"reviewer_session": reviewer_session,
"max_iterations": int(max_iterations),
"iteration": 1,
}
with registry_lock(registry_dir):
if mqtt_common._job_path(job_id, registry_dir).exists():
@@ -164,7 +173,7 @@ def append_event(job_id: str, registry_dir: str, payload: Dict[str, Any]) -> Non
# convenience re-export so callers can `from registry import load_job`
__all__ = [
"register_job", "pick_pending", "update_status", "load_job",
"list_jobs", "append_event", "generate_job_id",
"list_jobs", "append_event", "generate_job_id", "get_feedback",
]
@@ -180,11 +189,49 @@ def _iter_records(registry_dir: str):
logger.warning("skipping unreadable record %s: %s", path, exc)
def get_feedback(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> str:
"""Read the job's audit log or events log and return the detail of the last completed/error event."""
# 1) Try the unified audit log first (ndjson) since it's written synchronously by the subscriber
try:
import mqtt_common
logs_dir = mqtt_common.LOGS_DIR
events = list(mqtt_common.iter_logged_events(job_id, logs_dir))
for e in reversed(events):
if e.get("source_event") in ("completed", "error"):
return e.get("detail", "")
if e.get("event") in ("completed", "error"):
return e.get("detail", "")
except Exception:
pass
# 2) Fallback to local .events.log
log_path = Path(registry_dir) / f"{job_id}.events.log"
if log_path.exists():
feedback = ""
try:
with open(log_path, "r", encoding="utf-8") as fh:
for line in fh:
if not line.strip():
continue
try:
payload = json.loads(line)
if payload.get("event") in ("completed", "error"):
feedback = payload.get("detail", "")
except json.JSONDecodeError:
continue
except OSError:
pass
if feedback:
return feedback
return ""
# --------------------------------------------------------------------------
# CLI (so the bash wrapper can shell out without inline python)
# --------------------------------------------------------------------------
def _build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="tmux-agent-orchestrate-delegate-job registry CLI")
parser = argparse.ArgumentParser(description="multi-agent-mux-delegate-job registry CLI")
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
sub = parser.add_subparsers(dest="command", required=True)
@@ -197,6 +244,10 @@ def _build_parser() -> argparse.ArgumentParser:
p_reg.add_argument("--bits", type=int, default=32, help="32 (PoC) or 128 (prod)")
p_reg.add_argument("--artifact", action="append", default=[], dest="artifacts")
p_reg.add_argument("--auth-token", default=None, help="HMAC auth token for the job (auto-generated if secure broker is detected)")
p_reg.add_argument("--job-type", default="direct", choices=["direct", "loop", "discuss"])
p_reg.add_argument("--reviewer", default=None)
p_reg.add_argument("--reviewer-session", default=None)
p_reg.add_argument("--max-iterations", type=int, default=5)
p_list = sub.add_parser("list", help="list jobs (optionally by status)")
p_list.add_argument("--status", default=None)
@@ -209,6 +260,16 @@ def _build_parser() -> argparse.ArgumentParser:
p_status.add_argument("--job", required=True)
p_status.add_argument("--set", required=True, dest="status")
p_update = sub.add_parser("update", help="update a job record")
p_update.add_argument("--job", required=True)
p_update.add_argument("--status", default=None)
p_update.add_argument("--agent-session", default=None)
p_update.add_argument("--prompt", default=None)
p_update.add_argument("--iteration", type=int, default=None)
p_feedback = sub.add_parser("get-feedback", help="get the last feedback detail (completed/error) for a job")
p_feedback.add_argument("--job", required=True)
p_pick = sub.add_parser("pick", help="claim a pending job for a session; prints id")
p_pick.add_argument("--agent-session", default="tmux:claude")
@@ -222,7 +283,7 @@ def _build_parser() -> argparse.ArgumentParser:
help="summarise every job under the logs dir instead")
p_logs.add_argument("--logs-dir", default=None,
help="override the audit-log root (default: $DELEGATE_JOB_LOGS_DIR "
"or <cwd>/.hermes/delegate_job_logs)")
"or <cwd>/.mam/delegate_job_logs)")
p_logs.add_argument("--tail", type=int, default=0,
help="show only the last N events (0 = all)")
p_logs.add_argument("--json", action="store_true",
@@ -247,6 +308,10 @@ def main(argv: Optional[List[str]] = None) -> int:
expected_artifacts=args.artifacts,
bits=args.bits,
auth_token=args.auth_token,
job_type=args.job_type,
reviewer=args.reviewer,
reviewer_session=args.reviewer_session,
max_iterations=args.max_iterations,
)
print(job_id)
return 0
@@ -279,6 +344,27 @@ def main(argv: Optional[List[str]] = None) -> int:
return 1
return 0
if args.command == "update":
fields = {}
if args.status is not None:
fields["status"] = args.status
if args.agent_session is not None:
fields["agent_session"] = args.agent_session
if args.prompt is not None:
fields["prompt"] = args.prompt
if args.iteration is not None:
fields["iteration"] = args.iteration
try:
mqtt_common.update_job_status(args.job, rd, **fields)
except FileNotFoundError as exc:
print(str(exc), file=sys.stderr)
return 1
return 0
if args.command == "get-feedback":
print(get_feedback(args.job, rd))
return 0
if args.command == "pick":
job_id = pick_pending(args.agent_session, rd)
if job_id is None:
@@ -1,6 +1,6 @@
---
name: tmux-agent-orchestrate-monitor
description: "Run a long-lived Kanban worker that polls .hermes/agent-sessions.yaml against the actual tmux/agent runtime state and reconciles them. Use when you want live visibility into which agent sessions are running, which are dead, which have stale YAML entries, and which have new session ids that haven't been recorded yet. Designed to be dispatched as a Kanban goal_mode task (--goal) so it keeps running until the user stops it."
name: multi-agent-mux-monitor
description: "Run a long-lived Kanban worker that polls .mam/agent-sessions.yaml against the actual tmux/agent runtime state and reconciles them. Use when you want live visibility into which agent sessions are running, which are dead, which have stale YAML entries, and which have new session ids that haven't been recorded yet. Designed to be dispatched as a Kanban goal_mode task (--goal) so it keeps running until the user stops it."
version: 1.0.0
author: godopu
license: MIT
@@ -9,14 +9,14 @@ environments: [kanban, terminal, tmux]
metadata:
hermes:
tags: [agent, tmux, claude, antigravity, agy, monitor, kanban, observation, reconciliation]
related_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-resume, tmux-agent-orchestrate-stop, kanban-orchestrator]
prereq_skills: [kanban-worker, tmux-agent-orchestrate-create]
related_skills: [multi-agent-mux-create, multi-agent-mux-resume, multi-agent-mux-stop, kanban-orchestrator]
prereq_skills: [kanban-worker, multi-agent-mux-create]
---
# Agent Sessions Monitor — Live Reconciliation via Kanban Worker
> **Companion skills**: `tmux-agent-orchestrate-create` / `tmux-agent-orchestrate-resume` / `tmux-agent-orchestrate-stop` (mutators); this skill is the **observer**.
> **Single source of truth**: `./.hermes/agent-sessions.yaml`.
> **Companion skills**: `multi-agent-mux-create` / `multi-agent-mux-resume` / `multi-agent-mux-stop` (mutators); this skill is the **observer**.
> **Single source of truth**: `./.mam/agent-sessions.yaml`.
## What this skill does
@@ -59,16 +59,16 @@ hermes kanban create \
--title "agent-sessions monitor (live reconcile)" \
--assignee default \
--workspace worktree \
--branch wt/tmux-agent-orchestrate-monitor \
--branch wt/multi-agent-mux-monitor \
--goal \
--goal-max-turns 100 \
--max-runtime 8h \
--max-retries 1 \
--skill tmux-agent-orchestrate-monitor \
--skill multi-agent-mux-monitor \
--body "$(cat <<'EOF'
You are the agent-sessions monitor. Every 30 seconds, do:
1. Read .hermes/agent-sessions.yaml
1. Read .mam/agent-sessions.yaml
2. Run `tmux ls` and `tmux list-panes -F 'session=#{session_name} pid=#{pane_pid} cmd=#{pane_current_command} cwd=#{pane_current_path}'`
3. For each session in the YAML, check the corresponding tmux state
4. For each tmux session matching `*-creator-claude` or `*-creator-agy` that's not in the YAML, register it
@@ -79,7 +79,7 @@ If the user comments `stop` or `stop monitoring` on this card, call `kanban_bloc
If you find that a Claude session's `claude_session_id_own` is null but there's a new *.jsonl in the project dir, read the sessionId from the first line and update the YAML.
Use the helper script at skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh for the YAML updates — it handles all the merge logic and writes a structured comment to this card.
Use the helper script at .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh for the YAML updates — it handles all the merge logic and writes a structured comment to this card.
EOF
)"
```
@@ -94,17 +94,17 @@ The worker calls this script every 30s. It:
```bash
# Reconcile + auto-update YAML (atomic, flock-guarded). Emits JSON drift to stdout.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --once --emit-diff
bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --once --emit-diff
# Read-only: compute drift WITHOUT writing the YAML (use for "what's running?" checks).
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --once --emit-diff --dry-run
bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --once --emit-diff --dry-run
# Push-based MQTT Monitor: listen to delegated job events on the broker and update the YAML instantly.
# Bounded run that exits after 5 min idle, or 1 h wall-clock; falls back to polling if the broker is down.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --subscribe --idle-timeout 300 --timeout 3600
bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --subscribe --idle-timeout 300 --timeout 3600
# Persistent monitor (no timeouts): runs until interrupted; still polls if the broker is unreachable.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --subscribe --idle-timeout 0
bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --subscribe --idle-timeout 0
```
Flags: `--once` (single pass), `--emit-diff` (print JSON), `--dry-run` (P1-E — no mutation), `--subscribe` (push-based MQTT subscription monitoring). `--subscribe` sub-flags: `--timeout N` (exit after N seconds of wall-clock; `0` = no limit, default), `--idle-timeout N` (exit after N seconds with no message; default `3600`, `0` = never idle-out). On a broker connection failure (connect error **or** non-zero CONNACK), `--subscribe` falls back to a polling loop that re-runs `--once --emit-diff` every `RECONCILE_POLL_INTERVAL` (default 15) seconds until `--timeout`. Terminal-event YAML updates are written through `lib.sh::atomic_dump_yaml` (flock + schema-validate + `.bak`). There are **no** `--workspace` / `--agent` / `--comment-card` flags; the worker turns the emitted JSON `drifts[]` into `kanban_comment` calls itself.
@@ -126,7 +126,7 @@ tmux: no session
**Skip-set**: the auto-terminate only fires for sessions whose status is `running`.
Rows already in a deliberate end state — `terminated`, `archived`, or **`stopped`**
(set by `tmux-agent-orchestrate-stop`) — are
(set by `multi-agent-mux-stop`) — are
left untouched. This is critical: a `stopped` row keeps its `resumable: true` and
captured `*_session_id_own`, so the monitor must **not** overwrite it with
`terminated ("auto-detected")` when its tmux is (expectedly) gone.
@@ -165,8 +165,8 @@ disk: ~/.claude/projects/.../87dc548e-...jsonl: missing
- **Don't run the monitor without `--goal`** — without goal mode, a single turn will spawn, do one reconcile, and complete. Goal mode keeps the worker alive across many turns.
- **The 30s poll is a default** — workers may override if they detect heavy churn. A workspace with 5+ agent sessions should bump to 60s to avoid noise.
- **`kanban_comment` rate limits** — Kanban may throttle if you comment too fast. Coalesce: only comment when the diff is *new* (not the same drift on every poll). The script tracks a state file at `.cache/tmux-agent-orchestrate-monitor/<workspace>.state` in the workspace root for this (overridable via `AGENT_SESSIONS_STATE_DIR`).
- **Don't fight the user's explicit action** — if `tmux-agent-orchestrate-stop` is mid-flight and the monitor sees the same session in two states within 5s, prefer the user's most recent action. The monitor should not auto-revert a fresh `terminated` to `running` because of a stale `tmux has-session` check.
- **`kanban_comment` rate limits** — Kanban may throttle if you comment too fast. Coalesce: only comment when the diff is *new* (not the same drift on every poll). The script tracks a state file at `.cache/multi-agent-mux-monitor/<workspace>.state` in the workspace root for this (overridable via `AGENT_SESSIONS_STATE_DIR`).
- **Don't fight the user's explicit action** — if `multi-agent-mux-stop` is mid-flight and the monitor sees the same session in two states within 5s, prefer the user's most recent action. The monitor should not auto-revert a fresh `terminated` to `running` because of a stale `tmux has-session` check.
- **The monitor should never modify the conversation artifacts** (jsonl, db) — only the YAML. If you see a stale UUID, comment about it but don't delete the file.
- **TUI capture-pane is expensive** — only capture when you need to update `last_visible_status`, not every poll.
@@ -180,7 +180,7 @@ The `--body` of the dispatched task IS the worker's behavior spec. Here's a test
## Loop (every 30s)
1. Read agent-sessions.yaml
2. Bash: `bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --emit-diff`
2. Bash: `bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --emit-diff`
3. Parse the JSON diff from stdout
4. If `drifts` is non-empty:
- For each drift, call `kanban_comment` with the diff message
@@ -203,7 +203,7 @@ If `$HERMES_KANBAN_TASK` card has any comment containing "stop" or "stop monitor
- Do NOT modify conversation artifacts (jsonl, db, brain/)
- Do NOT spawn/delete tmux sessions — that's the create/delete skills' job
- Do NOT call tmux-agent-orchestrate-create or tmux-agent-orchestrate-stop — only the user initiates those
- Do NOT call multi-agent-mux-create or multi-agent-mux-stop — only the user initiates those
- Do NOT call `git commit` / `git push`
```
@@ -226,7 +226,7 @@ When using `--subscribe` with the default PoC public broker
```bash
# Run reconcile once and inspect output
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --emit-diff --once \
bash .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh --emit-diff --once \
| python3 -m json.tool
```
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# reconcile.sh — tmux-agent-orchestrate-monitor 의 부속 스크립트
# reconcile.sh — multi-agent-mux-monitor 의 부속 스크립트
# YAML ↔ tmux ↔ 디스크 artifact 간 drift 감지 (+ YAML 자동 갱신).
#
# Usage:
@@ -7,7 +7,7 @@
# bash reconcile.sh --once --emit-diff --dry-run # drift 만 계산, 쓰기 안 함 (P1-E)
#
# --dry-run: 부수효과 없는 read-only. "지금 뭐 돌고 있지?" 질문에 안전.
# tmux-agent-orchestrate-status 스킬이 이걸 재사용.
# multi-agent-mux-status 스킬이 이걸 재사용.
#
# 출력 (JSON): {timestamp, yaml_path, tmux_sessions_alive, tmux_confirmed, drifts, actions}
#
@@ -16,7 +16,7 @@ set -euo pipefail
source "$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/lib.sh"
STATE_DIR="${AGENT_SESSIONS_STATE_DIR:-$WORKSPACE_ROOT/.cache/tmux-agent-orchestrate-monitor}"
STATE_DIR="${AGENT_SESSIONS_STATE_DIR:-$WORKSPACE_ROOT/.cache/multi-agent-mux-monitor}"
ONCE=0
EMIT_DIFF=0
@@ -55,26 +55,42 @@ if [ "$SUBSCRIBE" = "1" ]; then
# The MQTT subscribe loop exits 3 to signal "broker unavailable → poll instead".
set +e
YAML_PATH="$AGENT_SESSIONS_YAML" HOME_DIR="$HOME_DIR" CLAUDE_PROJECT_DIR="$CLAUDE_PROJECT_DIR" LOCAL_BIN="$LOCAL_BIN" \
SUB_TIMEOUT="$SUB_TIMEOUT" SUB_IDLE_TIMEOUT="$SUB_IDLE_TIMEOUT" \
WORKSPACE_ROOT="$WORKSPACE_ROOT" SUB_TIMEOUT="$SUB_TIMEOUT" SUB_IDLE_TIMEOUT="$SUB_IDLE_TIMEOUT" \
SKILLS_DIR="$SKILLS_DIR" LIB_SH="$LIB_SH" \
"$PYBIN" - <<'PYEOF'
import os, sys, json, time, subprocess
lib_sh = os.environ.get('LIB_SH', '')
skills_dir = os.environ.get('SKILLS_DIR', '')
yaml_path = os.environ.get('YAML_PATH', '')
workspace_root = os.environ.get('WORKSPACE_ROOT', '')
timeout = int(os.environ.get('SUB_TIMEOUT', '0') or '0') # 0 = no overall timeout
idle_timeout = int(os.environ.get('SUB_IDLE_TIMEOUT', '3600') or '0') # 0 = no idle timeout
# Locate skills/tmux-agent-orchestrate-delegate-job/scripts to import mqtt_common — relative first, then
# Prevent duplicate wildcard subscribers for this workspace (concurrency race)
import fcntl
lock_file_path = os.path.join(workspace_root or '.', '.mam', 'monitor.lock')
try:
os.makedirs(os.path.dirname(lock_file_path), exist_ok=True)
lock_file = open(lock_file_path, 'w')
fcntl.flock(lock_file, fcntl.LOCK_EX | fcntl.LOCK_NB)
except BlockingIOError:
print("MQTT Monitor: another subscriber is already running for this workspace. Exiting.", flush=True)
sys.exit(0)
except Exception as e:
print(f"MQTT Monitor: failed to acquire monitor lock ({e}). Exiting.", flush=True)
sys.exit(1)
# Locate skills/multi-agent-mux-delegate-job/scripts to import mqtt_common — relative first, then
# an upward walk from cwd. No hardcoded absolute path (review item 6).
cand = os.path.join(skills_dir, 'tmux-agent-orchestrate-delegate-job', 'scripts') if skills_dir else ''
cand = os.path.join(skills_dir, 'multi-agent-mux-delegate-job', 'scripts') if skills_dir else ''
if cand and os.path.isdir(cand):
sys.path.append(cand)
else:
d = os.getcwd()
while d and d != '/':
hit = None
for sub in (('skills', 'tmux-agent-orchestrate-delegate-job', 'scripts'), ('tmux-agent-orchestrate-delegate-job', 'scripts')):
for sub in (('.agents', 'skills', 'multi-agent-mux-delegate-job', 'scripts'), ('skills', 'multi-agent-mux-delegate-job', 'scripts'), ('multi-agent-mux-delegate-job', 'scripts')):
p = os.path.join(d, *sub)
if os.path.isdir(p):
hit = p
@@ -85,6 +101,7 @@ else:
d = os.path.dirname(d)
import mqtt_common
import registry
# Executed INSIDE lib.sh::atomic_dump_yaml (system python3 + PyYAML), under the
# YAML flock with schema-validate + .bak (review item 5). Marks matching running
@@ -132,6 +149,7 @@ def handle_terminal(jid, event):
state = {'last_msg': time.time(), 'connected': False, 'failed': False}
last_seqs = {}
def on_message(_client, _userdata, msg):
@@ -140,7 +158,48 @@ def on_message(_client, _userdata, msg):
payload = json.loads(msg.payload.decode("utf-8"))
jid = payload.get("job_id")
event = payload.get("event")
if jid and event in ("completed", "error"):
if not jid or not event:
return
if workspace_root:
registry_dir = os.path.join(workspace_root, '.mam', 'jobs')
else:
yaml_dir = os.path.dirname(yaml_path) if yaml_path else ""
registry_dir = os.path.join(yaml_dir, 'jobs') if yaml_dir else '.mam/jobs'
try:
job = registry.load_job(jid, registry_dir)
except FileNotFoundError:
# Silently ignore events for jobs not in the local registry
return
expected_token = job.get("auth_token")
if not mqtt_common.verify_hmac(payload, expected_token):
print(f"MQTT Monitor: drop event for job {jid}: HMAC verify failed", flush=True)
return
seq = payload.get("seq")
if seq is None or not isinstance(seq, int):
print(f"MQTT Monitor: drop event for job {jid}: missing or invalid seq", flush=True)
return
if seq <= last_seqs.get(jid, 0):
print(f"MQTT Monitor: drop event for job {jid}: seq {seq} not monotonic (last {last_seqs.get(jid, 0)})", flush=True)
return
last_seqs[jid] = seq
# Append the event to events.ndjson audit trail
mqtt_common.append_event(jid, {
"event": "received",
"source_event": event,
"seq": seq,
"topic": msg.topic,
"timestamp": payload.get("timestamp"),
"detail": payload.get("detail", ""),
})
print(f"MQTT Monitor: recorded event {event} for job {jid} (seq={seq})", flush=True)
if event in ("completed", "error"):
print(f"MQTT Monitor: received terminal event {event} for job {jid}", flush=True)
handle_terminal(jid, event)
except Exception as e:
@@ -204,7 +263,7 @@ PYEOF
if [ "$sub_rc" = "3" ]; then
echo "MQTT Monitor: broker unavailable — falling back to polling (interval ${POLL_INTERVAL}s)" >&2
_self="$SKILLS_DIR/tmux-agent-orchestrate-monitor/scripts/reconcile.sh"
_self="$SKILLS_DIR/multi-agent-mux-monitor/scripts/reconcile.sh"
_start=$(date +%s)
while :; do
bash "$_self" --once --emit-diff >/dev/null 2>&1 || true
@@ -223,7 +282,7 @@ mkdir -p "$STATE_DIR"
# atomic_dump_yaml(flock + temp+rename) 로 같은 소스를 돌린다. atomic 래퍼에서는
# 'actions' 가 없으면 SystemExit(0) 으로 쓰기를 건너뛴다 (불필요한 재포맷 방지).
read -r -d '' RECON_SRC <<'PYEOF' || true
import os, json, glob, subprocess, time
import os, json, glob, subprocess, time, sqlite3
from datetime import datetime, timezone
import yaml
@@ -344,14 +403,28 @@ if tmux_confirmed:
name = t['name']
if name in yaml_session_names:
continue
if not (name.endswith('-creator-claude') or name.endswith('-creator-agy')):
if name.endswith('-creator-claude'):
agent = 'claude'
elif name.endswith('-creator-agy'):
agent = 'agy'
elif name.endswith('-creator-hermes'):
agent = 'hermes'
elif name.endswith('-creator-cline'):
agent = 'cline'
else:
continue
srv = t.get('server', 'default')
pm = pane_meta(name, srv)
if not pm:
continue
agent = 'claude' if name.endswith('-creator-claude') else 'agy'
cmd_full = 'claude' if agent == 'claude' else 'agy --dangerously-skip-permissions'
if agent == 'claude':
cmd_full = 'claude --dangerously-skip-permissions'
elif agent == 'agy':
cmd_full = 'agy --dangerously-skip-permissions'
elif agent == 'hermes':
cmd_full = 'hermes'
elif agent == 'cline':
cmd_full = 'cline -i'
server_opt = f"-L {srv} " if srv != 'default' else ""
entry = {
'name': name,
@@ -371,7 +444,7 @@ if tmux_confirmed:
entry['tui'] = {'model': '(unknown — capture after first message)', 'provider': 'anthropic',
'plan': '(unknown)', 'account': '(unknown)', 'version': '(unknown)'}
entry['claude_session_id_own'] = None
else:
elif agent == 'agy':
entry['child_pid'] = 0
entry['agy_conversation_id_own'] = None
entry['mcp_attachments'] = [
@@ -381,6 +454,12 @@ if tmux_confirmed:
'endpoint': 'https://stitch.googleapis.com/mcp'
}
]
elif agent == 'hermes':
entry['child_pid'] = 0
entry['hermes_conversation_id_own'] = None
elif agent == 'cline':
entry['child_pid'] = 0
entry['cline_conversation_id_own'] = None
d.setdefault('tmux_sessions', []).append(entry)
yaml_session_names.add(name)
drifts.append({'class': 'B', 'name': name,
@@ -446,6 +525,66 @@ for s in d.get('tmux_sessions', []):
except Exception:
pass
# === drift C (hermes): hermes 새 session id materialize (per-row own id) ===
for s in d.get('tmux_sessions', []):
if not s.get('name', '').endswith('-creator-hermes'):
continue
if s.get('status') != 'running':
continue
if s.get('hermes_conversation_id_own'):
continue
cwd = (s.get('pane') or {}).get('cwd', '')
if not cwd:
continue
hdb = f"{home}/.hermes/state.db"
if os.path.exists(hdb):
try:
conn = sqlite3.connect(hdb)
r = conn.execute("SELECT id FROM sessions WHERE cwd=? ORDER BY started_at DESC LIMIT 1", (cwd,)).fetchone()
conn.close()
if r:
cid = r[0]
s['hermes_conversation_id_own'] = cid
drifts.append({'class': 'C', 'name': s['name'], 'msg': f"{s['name']}: conversation id materialized: {cid}"})
actions.append(f"updated conversation id: {cid}")
except Exception:
pass
# === drift C (cline): cline 새 session id materialize (per-row own id) ===
for s in d.get('tmux_sessions', []):
if not s.get('name', '').endswith('-creator-cline'):
continue
if s.get('status') != 'running':
continue
if s.get('cline_conversation_id_own'):
continue
cwd = (s.get('pane') or {}).get('cwd', '')
if not cwd:
continue
sessions_dir = f"{home}/.cline/data/sessions"
if os.path.isdir(sessions_dir):
candidates = []
for session_folder in glob.glob(f"{sessions_dir}/*"):
if os.path.isdir(session_folder):
folder_name = os.path.basename(session_folder)
json_file = f"{session_folder}/{folder_name}.json"
if os.path.exists(json_file):
candidates.append(json_file)
candidates.sort(key=os.path.getmtime, reverse=True)
for j in candidates:
try:
with open(j) as f:
sdata = json.load(f)
if sdata.get('cwd') == cwd or sdata.get('workspace_root') == cwd:
cid = sdata.get('session_id')
if cid:
s['cline_conversation_id_own'] = cid
drifts.append({'class': 'C', 'name': s['name'], 'msg': f"{s['name']}: session id materialized: {cid}"})
actions.append(f"updated session id: {cid}")
break
except Exception:
pass
# === drift D: stale UUID (cache 의 artifact 가 사라짐) — 보고만, 변경 없음 ===
ai = d.get('agent_identities', {}) or {}
cl = (ai.get('claude') or {})
@@ -460,6 +599,28 @@ if ag.get('conversation_id'):
if not os.path.exists(f"{home}/.gemini/antigravity-cli/conversations/{cid}.db"):
drifts.append({'class': 'D', 'name': '(agy identity cache)',
'msg': f"stale UUID in agent_identities.agy.conversation_id: {cid} (.db missing)"})
hr = (ai.get('hermes') or {})
if hr.get('session_id'):
sid = hr['session_id']
hdb = f"{home}/.hermes/state.db"
has_session = False
if os.path.exists(hdb):
try:
conn = sqlite3.connect(hdb)
r = conn.execute("SELECT 1 FROM sessions WHERE id=?", (sid,)).fetchone()
conn.close()
has_session = r is not None
except Exception:
pass
if not has_session:
drifts.append({'class': 'D', 'name': '(hermes identity cache)',
'msg': f"stale UUID in agent_identities.hermes.session_id: {sid} (session missing from db)"})
cn = (ai.get('cline') or {})
if cn.get('session_id'):
sid = cn['session_id']
if not os.path.exists(f"{home}/.cline/data/sessions/{sid}/{sid}.json"):
drifts.append({'class': 'D', 'name': '(cline identity cache)',
'msg': f"stale UUID in agent_identities.cline.session_id: {sid} (session file missing)"})
result = {
'timestamp': now_iso,
@@ -1,6 +1,6 @@
---
name: tmux-agent-orchestrate-resume
description: "Resume an existing agent (claude, antigravity/agy) conversation by UUID into a tmux session. Reads .hermes/agent-sessions.yaml for the saved session/conversation id, spawns (or reuses) a tmux session of the matching name, and runs `claude -r <id>` or `agy --conversation <id>` inside. Use when you want to reattach to a previous session's context, or revive a session whose tmux died but the agent's conversation is still on disk."
name: multi-agent-mux-resume
description: "Resume an existing agent (claude, antigravity/agy) conversation by UUID into a tmux session. Reads .mam/agent-sessions.yaml for the saved session/conversation id, spawns (or reuses) a tmux session of the matching name, and runs `claude -r <id>` or `agy --conversation <id>` inside. Use when you want to reattach to a previous session's context, or revive a session whose tmux died but the agent's conversation is still on disk."
version: 1.0.0
author: godopu
license: MIT
@@ -9,15 +9,15 @@ environments: [terminal, tmux]
metadata:
hermes:
tags: [agent, tmux, claude, antigravity, agy, multi-agent, context, resume, session-id]
related_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-stop, tmux-agent-orchestrate-monitor, claude-code]
prereq_skills: [tmux-agent-orchestrate-create]
related_skills: [multi-agent-mux-create, multi-agent-mux-stop, multi-agent-mux-monitor, claude-code]
prereq_skills: [multi-agent-mux-create]
---
# Multi-Agent Resume — Reattach to a Saved Conversation
> **Companion skills**: `tmux-agent-orchestrate-create` (start a fresh agent), `tmux-agent-orchestrate-stop` (terminate), `tmux-agent-orchestrate-monitor` (live status).
> **Tmux Isolation**: `TMUX_SERVER_NAME` env var를 create에서 설정한 경우, 동일 서버에서 동작합니다. 자세한 격리 패턴은 [tmux-agent-orchestrate-create/SKILL.md](../tmux-agent-orchestrate-create/SKILL.md) 참조.
> **Single source of truth**: `./.hermes/agent-sessions.yaml`.
> **Companion skills**: `multi-agent-mux-create` (start a fresh agent), `multi-agent-mux-stop` (terminate), `multi-agent-mux-monitor` (live status).
> **Tmux Isolation**: `TMUX_SERVER_NAME` env var를 create에서 설정한 경우, 동일 서버에서 동작합니다. 자세한 격리 패턴은 [multi-agent-mux-create/SKILL.md](../multi-agent-mux-create/SKILL.md) 참조.
> **Single source of truth**: `./.mam/agent-sessions.yaml`.
## What this skill does
@@ -26,12 +26,12 @@ metadata:
Three cases this skill handles:
1. **tmux is dead, conversation lives**`agent-sessions.yaml` has the UUID. The JSONL/db is on disk. Re-spawn the tmux session + run `claude -r <id>` / `agy --conversation <id>`.
2. **tmux is alive but empty** — You started a session with `tmux-agent-orchestrate-create` but haven't sent a message yet (so no session id was assigned). The user can either send their first message (and the id is auto-assigned), or you can read the *workspace's* most recent conversation from `$HOME_DIR/.gemini/antigravity-cli/cache/last_conversations.json` (defaults to `~/.gemini/...`) for agy, or the latest `*.jsonl` in `$CLAUDE_PROJECT_DIR/<workspace-key>/` (defaults to `~/.claude/projects/`) for claude.
2. **tmux is alive but empty** — You started a session with `multi-agent-mux-create` but haven't sent a message yet (so no session id was assigned). The user can either send their first message (and the id is auto-assigned), or you can read the *workspace's* most recent conversation from `$HOME_DIR/.gemini/antigravity-cli/cache/last_conversations.json` (defaults to `~/.gemini/...`) for agy, or the latest `*.jsonl` in `$CLAUDE_PROJECT_DIR/<workspace-key>/` (defaults to `~/.claude/projects/`) for claude.
3. **tmux is alive AND the agent inside is already running** — Just attach. No re-spawn needed.
### Resuming a `stopped` session (`stopped → running`)
When a session was ended via `tmux-agent-orchestrate-stop` (which captures the ID and gracefully stops by default),
When a session was ended via `multi-agent-mux-stop` (which captures the ID and gracefully stops by default),
its row is `status: stopped` with `resumable: true` and the conversation id
already recorded in `claude_session_id_own` / `agy_conversation_id_own`. This is the
ideal resume path:
@@ -55,26 +55,26 @@ ideal resume path:
- claude: `ls -t $CLAUDE_PROJECT_DIR/<workspace-key>/*.jsonl | head -1` and parse the `sessionId` from the first line
- agy: `jq -r '."<workspace>"' $HOME_DIR/.gemini/antigravity-cli/cache/last_conversations.json`
If all three are empty → the workspace has no conversation yet. Fall back to `tmux-agent-orchestrate-create`.
If all three are empty → the workspace has no conversation yet. Fall back to `multi-agent-mux-create`.
## Workflow
```bash
WORKSPACE=/path/to/project
AGENT=claude # or agy or hermes
SESSION_NAME=<workspace>-creator-<agent> # same convention as tmux-agent-orchestrate-create
SESSION_NAME=<workspace>-creator-<agent> # same convention as multi-agent-mux-create
# 1. Resolve the session id
UUID=$(bash skills/tmux-agent-orchestrate-resume/scripts/resolve_session_id.sh \
UUID=$(bash .agents/skills/multi-agent-mux-resume/scripts/resolve_session_id.sh \
--workspace "$WORKSPACE" --agent "$AGENT")
if [ -z "$UUID" ]; then
echo "No saved session for $WORKSPACE ($AGENT). Use tmux-agent-orchestrate-create first."
echo "No saved session for $WORKSPACE ($AGENT). Use multi-agent-mux-create first."
exit 1
fi
# Resolve the isolated tmux server name
source skills/lib.sh
source .agents/skills/lib.sh
export TMUX_SERVER_NAME="$(resolve_tmux_server "$SESSION_NAME")"
# 2. If tmux is alive, attach. Done.
@@ -107,8 +107,8 @@ case "$AGENT" in
esac
# 4. Update agent-sessions.yaml: status running, last_visible_status
# (Also automatically publishes a `progress --detail "resumed"` event to the tmux-agent-orchestrate-delegate-job registry if a delegate_job_id exists)
bash skills/tmux-agent-orchestrate-resume/scripts/update_yaml_resumed.sh \
# (Also automatically publishes a `progress --detail "resumed"` event to the multi-agent-mux-delegate-job registry if a delegate_job_id exists)
bash .agents/skills/multi-agent-mux-resume/scripts/update_yaml_resumed.sh \
--session "$SESSION_NAME" --uuid "$UUID"
# 5. Attach
@@ -120,8 +120,8 @@ tmux attach -t "$SESSION_NAME"
- **`claude -r` requires the SAME project directory** — if the workspace path differs from when the session was created, claude will create a new project dir key (`-home-...-different-name`) and put the resume in a different location. Always `-c` (cd to workspace) before running.
- **agy's `--conversation` flag name varies by version** — older versions used `--resume` or `-r`. Check `agy --help | grep -E "conversation|resume"` and use the right flag. v1.0.x: `--conversation`.
- **The first message after resume might re-trigger TUI dialogs** — if the original session was created with `--dangerously-skip-permissions`, those flags are NOT persisted; you must re-apply them on resume. The script above re-passes them.
- **Don't resume if the session is brand new and empty**`tmux-agent-orchestrate-create` already set up an empty container; sending a probe message ("init") is the right way to materialize a session id, NOT `claude -r` with a placeholder.
- **`agy --conversation <id>` will fail if the conversation was deleted from disk** — check `~/.gemini/antigravity-cli/conversations/<uuid>.db` exists before attempting resume. If missing, the conversation is gone; you need a fresh session via `tmux-agent-orchestrate-create`.
- **Don't resume if the session is brand new and empty**`multi-agent-mux-create` already set up an empty container; sending a probe message ("init") is the right way to materialize a session id, NOT `claude -r` with a placeholder.
- **`agy --conversation <id>` will fail if the conversation was deleted from disk** — check `~/.gemini/antigravity-cli/conversations/<uuid>.db` exists before attempting resume. If missing, the conversation is gone; you need a fresh session via `multi-agent-mux-create`.
## Verification
@@ -132,7 +132,7 @@ tmux list-panes -t "$SESSION_NAME" -F 'cmd=#{pane_current_command} cwd=#{pane_cu
# 2. agent-sessions.yaml updated
python3 -c "
import yaml
d = yaml.safe_load(open('.hermes/agent-sessions.yaml'))
d = yaml.safe_load(open('.mam/agent-sessions.yaml'))
s = [s for s in d['tmux_sessions'] if s['name'] == '$SESSION_NAME'][0]
print(f' status: {s[\"status\"]}')
print(f' pane.cmd_full: {s[\"pane\"][\"cmd_full\"]}')
@@ -146,6 +146,6 @@ tmux capture-pane -t "$SESSION_NAME" -p -S -30
## When NOT to use this skill
- **No saved session yet**`tmux-agent-orchestrate-create`
- **Killing an existing session**`tmux-agent-orchestrate-stop`
- **No saved session yet**`multi-agent-mux-create`
- **Killing an existing session**`multi-agent-mux-stop`
- **Just attaching**`tmux attach -t <name>` (no skill needed)
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# resolve_session_id.sh — tmux-agent-orchestrate-resume 의 부속 스크립트
# resolve_session_id.sh — multi-agent-mux-resume 의 부속 스크립트
# Usage:
# bash resolve_session_id.sh --workspace <path> --agent <claude|agy>
# 출력: stdout 으로 UUID 한 줄 (없으면 빈 줄 + exit 0)
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# update_yaml_resumed.sh — tmux-agent-orchestrate-resume 의 부속 스크립트
# update_yaml_resumed.sh — multi-agent-mux-resume 의 부속 스크립트
# Resume 한 세션의 agent-sessions.yaml 엔트리를 status=running + resume 메타로 갱신.
# resume UUID 를 per-row own id (claude_session_id_own / agy_conversation_id_own)
# 에 박는다 — agent_identities 전역은 더 이상 primary 아님 (cache 로 강등, P0-C/단계 e).
@@ -41,6 +41,7 @@ if [ -z "$AGENT" ]; then
*-creator-claude) AGENT=claude ;;
*-creator-agy) AGENT=agy ;;
*-creator-hermes) AGENT=hermes ;;
*-creator-cline) AGENT=cline ;;
*) echo "ERROR: cannot infer agent from '$SESSION_NAME'; pass --agent" >&2; exit 2 ;;
esac
fi
@@ -51,7 +52,7 @@ NOW_ISO=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
PANE_PID=$(tmux list-panes -t "$SESSION_NAME" -F '#{pane_pid}' 2>/dev/null | head -1 || true)
PANE_PID="${PANE_PID:-}"
CHILD_PID=0
if { [ "$AGENT" = "agy" ] || [ "$AGENT" = "hermes" ]; } && [ -n "$PANE_PID" ]; then
if { [ "$AGENT" = "agy" ] || [ "$AGENT" = "hermes" ] || [ "$AGENT" = "cline" ]; } && [ -n "$PANE_PID" ]; then
CHILD_PID=$(pgrep -P "$PANE_PID" -x "$AGENT" 2>/dev/null | head -1 || true)
CHILD_PID="${CHILD_PID:-0}"
fi
@@ -144,6 +145,13 @@ elif agent == 'hermes':
cp = os.environ.get('CHILD_PID', '0')
if cp.isdigit() and int(cp) > 0:
target['child_pid'] = int(cp)
elif agent == 'cline':
target['pane']['cmd'] = 'cline'
target['pane']['cmd_full'] = f'cline -i --id {uuid}'
target['cline_conversation_id_own'] = uuid
cp = os.environ.get('CHILD_PID', '0')
if cp.isdigit() and int(cp) > 0:
target['child_pid'] = int(cp)
snap = d.setdefault('snapshot', {})
snap['taken_at'] = now
@@ -1,5 +1,5 @@
---
name: tmux-agent-orchestrate-status
name: multi-agent-mux-status
description: "Read-only instant snapshot of all agent tmux sessions — name, YAML status, tmux alive, pane cmd/cwd, resume UUID on disk, and any drift. No Kanban, no mutation. Reuses reconcile.sh --dry-run for the diff logic. Use when you want to know 'what's running RIGHT NOW' without spinning up a Kanban monitor worker."
version: 1.0.0
author: godopu
@@ -9,36 +9,36 @@ environments: [terminal, tmux]
metadata:
hermes:
tags: [agent, tmux, claude, antigravity, agy, status, read-only, snapshot]
related_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-resume, tmux-agent-orchestrate-stop, tmux-agent-orchestrate-monitor]
prereq_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-monitor]
related_skills: [multi-agent-mux-create, multi-agent-mux-resume, multi-agent-mux-stop, multi-agent-mux-monitor]
prereq_skills: [multi-agent-mux-create, multi-agent-mux-monitor]
---
# Multi-Agent Status — Read-Only Instant Snapshot
> **Companion skills**: `tmux-agent-orchestrate-create` (start), `tmux-agent-orchestrate-resume` (re-attach), `tmux-agent-orchestrate-stop` (terminate), `tmux-agent-orchestrate-monitor` (live polling).
> **Companion skills**: `multi-agent-mux-create` (start), `multi-agent-mux-resume` (re-attach), `multi-agent-mux-stop` (terminate), `multi-agent-mux-monitor` (live polling).
> **Tmux Isolation**: `status` 명령은 YAML에 등록된 모든 세션의 격리 서버(`tmux_server` 필드)를 자동으로 조회하여 상태를 확인하므로, `TMUX_SERVER_NAME` 환경변수를 수동으로 지정하지 않아도 모든 격리 서버의 세션 상태를 통합 조회합니다.
> **Single source of truth**: `./.hermes/agent-sessions.yaml`.
> **Single source of truth**: `./.mam/agent-sessions.yaml`.
## What this skill does
Print a single table of every agent tmux session, comparing YAML state to actual tmux state. **No mutation. No Kanban. No polling loop.**
This is the "what's running right now?" answer — faster than dispatching `tmux-agent-orchestrate-monitor` (which polls every 30s) and safer than `reconcile.sh --once --emit-diff` (which mutates as a side effect).
This is the "what's running right now?" answer — faster than dispatching `multi-agent-mux-monitor` (which polls every 30s) and safer than `reconcile.sh --once --emit-diff` (which mutates as a side effect).
## Pre-flight
```bash
command -v tmux
command -v python3
test -f .hermes/agent-sessions.yaml
test -f .mam/agent-sessions.yaml
```
If `agent-sessions.yaml` doesn't exist or is malformed → print clear error, exit 1. **Do not create it.** (Use `tmux-agent-orchestrate-create` first.)
If `agent-sessions.yaml` doesn't exist or is malformed → print clear error, exit 1. **Do not create it.** (Use `multi-agent-mux-create` first.)
## Workflow
```bash
bash skills/tmux-agent-orchestrate-status/scripts/status.sh [--json]
bash .agents/skills/multi-agent-mux-status/scripts/status.sh [--json]
```
The script:
@@ -98,17 +98,17 @@ lab-paper-pdf2md-creator-claude default running alive clau
| Class | Detection | Meaning |
|---|---|---|
| `A` | YAML `running`, tmux dead | session died without going through `tmux-agent-orchestrate-stop`. *Could* auto-terminate but won't — that's `tmux-agent-orchestrate-monitor`'s job. |
| `B` | tmux alive, not in YAML | ad-hoc session someone started without `tmux-agent-orchestrate-create`. Suggest: "use tmux-agent-orchestrate-create to register, or tmux kill-session to clean up." |
| `C` | YAML has `claude_session_id_own: null` AND a new *.jsonl exists | new session id materialized; suggest: "run tmux-agent-orchestrate-resume or reconcile to register it." |
| `D` | YAML has UUID in `agent_identities`, but the on-disk artifact is gone | stale UUID; user should `tmux-agent-orchestrate-stop --purge-conversation` to clean up. |
| `A` | YAML `running`, tmux dead | session died without going through `multi-agent-mux-stop`. *Could* auto-terminate but won't — that's `multi-agent-mux-monitor`'s job. |
| `B` | tmux alive, not in YAML | ad-hoc session someone started without `multi-agent-mux-create`. Suggest: "use multi-agent-mux-create to register, or tmux kill-session to clean up." |
| `C` | YAML has `claude_session_id_own: null` AND a new *.jsonl exists | new session id materialized; suggest: "run multi-agent-mux-resume or reconcile to register it." |
| `D` | YAML has UUID in `agent_identities`, but the on-disk artifact is gone | stale UUID; user should `multi-agent-mux-stop --purge-conversation` to clean up. |
## Pitfalls
- **Do NOT use this skill to drive mutations** — the output is a snapshot, not a call to action. If you need to fix drifts, dispatch `tmux-agent-orchestrate-monitor` (Kanban worker) or run `tmux-agent-orchestrate-resume` / `tmux-agent-orchestrate-stop` manually.
- **Do NOT use this skill to drive mutations** — the output is a snapshot, not a call to action. If you need to fix drifts, dispatch `multi-agent-mux-monitor` (Kanban worker) or run `multi-agent-mux-resume` / `multi-agent-mux-stop` manually.
- **Read-only is enforced by script**`status.sh` opens the YAML with `open(path)` (no `'w'`), never calls `tmux kill-session`, never writes anywhere. The `reconcile.sh --dry-run` mode is the same path.
- **If `agent-sessions.yaml` is malformed** — print the YAML error verbatim and exit 1. Do NOT attempt recovery (that's `tmux-agent-orchestrate-stop --purge-conversation` or manual edit's job).
- **Sessions outside the `<workspace>-creator-*` naming convention** are still shown but tagged `ad-hoc` — they didn't go through `tmux-agent-orchestrate-create` and aren't tracked in YAML.
- **If `agent-sessions.yaml` is malformed** — print the YAML error verbatim and exit 1. Do NOT attempt recovery (that's `multi-agent-mux-stop --purge-conversation` or manual edit's job).
- **Sessions outside the `<workspace>-creator-*` naming convention** are still shown but tagged `ad-hoc` — they didn't go through `multi-agent-mux-create` and aren't tracked in YAML.
## When to use
@@ -119,6 +119,6 @@ lab-paper-pdf2md-creator-claude default running alive clau
## When NOT to use
- Continuous live tracking → `tmux-agent-orchestrate-monitor` (Kanban worker)
- Continuous live tracking → `multi-agent-mux-monitor` (Kanban worker)
- Recovering from corruption → manual edit + `.bak` restore
- Polling more than once a minute → `tmux-agent-orchestrate-monitor` (it dedupes)
- Polling more than once a minute → `multi-agent-mux-monitor` (it dedupes)
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# status.sh — tmux-agent-orchestrate-status 의 부속 스크립트 (READ-ONLY)
# status.sh — multi-agent-mux-status 의 부속 스크립트 (READ-ONLY)
# 한 번 호출로 현재 agent 세션 상태표를 출력. 부수효과 없음.
# reconcile.sh --dry-run 을 재사용해 drift 를 계산하고 (P1-E), YAML/디스크에서
# 보강한 표를 그린다. YAML 을 절대 수정하지 않는다.
@@ -9,12 +9,12 @@ set -euo pipefail
source "$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/lib.sh"
RECONCILE="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/tmux-agent-orchestrate-monitor/scripts/reconcile.sh"
RECONCILE="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/multi-agent-mux-monitor/scripts/reconcile.sh"
JSON=0
[ "${1:-}" = "--json" ] && JSON=1
[ -f "$AGENT_SESSIONS_YAML" ] || { echo "ERROR: $AGENT_SESSIONS_YAML not found. Run tmux-agent-orchestrate-create first." >&2; exit 1; }
[ -f "$AGENT_SESSIONS_YAML" ] || { echo "ERROR: $AGENT_SESSIONS_YAML not found. Run multi-agent-mux-create first." >&2; exit 1; }
# read-only drift snapshot — reconcile.sh --dry-run (no side effects)
DRIFT_JSON="$(bash "$RECONCILE" --once --emit-diff --dry-run)"
@@ -24,9 +24,9 @@ if [ "$JSON" = "1" ]; then
exit 0
fi
# Project root (parent of skills/) holds the tmux-agent-orchestrate-delegate-job .hermes registry.
# Project root (parent of .agents/) holds the multi-agent-mux-delegate-job .mam registry.
# Resolved relative to this script — no hardcoded absolute path (review item 6).
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../../../" && pwd)"
DRIFT_JSON="$DRIFT_JSON" env_python "$AGENT_SESSIONS_YAML" PROJECT_ROOT="$PROJECT_ROOT" <<'PYEOF'
import os, json, glob
@@ -95,9 +95,9 @@ def get_job_status(s):
# Candidate locations (review item 6: project-root-relative, no hardcoded abs paths):
# 1) cwd-relative registry 2) project-root registry 3) project-root audit log
candidates = [
os.path.join('.hermes', 'jobs', f"{jid}.json"),
os.path.join(project_root, '.hermes', 'jobs', f"{jid}.json"),
os.path.join(project_root, '.hermes', 'delegate_job_logs', jid, 'status.json'),
os.path.join('.mam', 'jobs', f"{jid}.json"),
os.path.join(project_root, '.mam', 'jobs', f"{jid}.json"),
os.path.join(project_root, '.mam', 'delegate_job_logs', jid, 'status.json'),
]
for path in candidates:
if os.path.exists(path):
@@ -1,6 +1,6 @@
---
name: tmux-agent-orchestrate-stop
description: "Stop an agent tmux session (claude, antigravity/agy) and update .hermes/agent-sessions.yaml. Default stops gracefully and marks status=stopped with conversation preserved for resume. Does NOT delete on-disk conversation artifacts (jsonl/db) — those are preserved unless --purge-conversation is passed. Use when ending a work session, switching to a different one, or cleaning up before a fresh start."
name: multi-agent-mux-stop
description: "Stop an agent tmux session (claude, antigravity/agy) and update .mam/agent-sessions.yaml. Default stops gracefully and marks status=stopped with conversation preserved for resume. Does NOT delete on-disk conversation artifacts (jsonl/db) — those are preserved unless --purge-conversation is passed. Use when ending a work session, switching to a different one, or cleaning up before a fresh start."
version: 1.0.0
author: godopu
license: MIT
@@ -9,23 +9,23 @@ environments: [terminal, tmux]
metadata:
hermes:
tags: [agent, tmux, claude, antigravity, agy, multi-agent, stop, terminate, cleanup]
related_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-resume, tmux-agent-orchestrate-monitor]
prereq_skills: [tmux-agent-orchestrate-create, tmux-agent-orchestrate-resume]
related_skills: [multi-agent-mux-create, multi-agent-mux-resume, multi-agent-mux-monitor]
prereq_skills: [multi-agent-mux-create, multi-agent-mux-resume]
---
# Multi-Agent Stop — Stop an Agent tmux Session
> **Companion skills**: `tmux-agent-orchestrate-create` (start), `tmux-agent-orchestrate-resume` (re-attach), `tmux-agent-orchestrate-monitor` (live status).
> **Companion skills**: `multi-agent-mux-create` (start), `multi-agent-mux-resume` (re-attach), `multi-agent-mux-monitor` (live status).
> **Tmux Isolation**: `stop` 명령은 YAML의 `tmux_server` 필드를 자동으로 파싱하여 해당 격리 서버의 세션을 안전하게 종료(kill)하므로, `TMUX_SERVER_NAME` 환경변수를 수동으로 지정할 필요가 없습니다.
> **Single source of truth**: `./.hermes/agent-sessions.yaml`.
> **Single source of truth**: `./.mam/agent-sessions.yaml`.
## What this skill does
Stop an agent's tmux session gracefully, resolve and store the conversation ID, and **mark the YAML entry (status=stopped)**. Preserves:
- The tmux session's recorded `pane.pid / cmd / cwd / mcp_attachments` for audit
- The agent's on-disk conversation (claude `*.jsonl`, agy `conversations/*.db`) — so the user can `tmux-agent-orchestrate-resume` later
- The `start_command` so a future `tmux-agent-orchestrate-create --session <name>` reproduces the same tmux spec
- The agent's on-disk conversation (claude `*.jsonl`, agy `conversations/*.db`) — so the user can `multi-agent-mux-resume` later
- The `start_command` so a future `multi-agent-mux-create --session <name>` reproduces the same tmux spec
The stop command is always **graceful by default**:
1. Sends exit keys to the agent TUI (`/exit` for Claude, `Exit` for Agy) and waits 3 seconds.
@@ -37,7 +37,7 @@ The stop command is always **graceful by default**:
```bash
SESSION_NAME=<workspace>-creator-<agent> # convention
AGENT_SESSIONS_YAML=.hermes/agent-sessions.yaml
AGENT_SESSIONS_YAML=.mam/agent-sessions.yaml
# 1) Session is registered?
python3 -c "
@@ -45,7 +45,7 @@ import yaml
d = yaml.safe_load(open('$AGENT_SESSIONS_YAML'))
names = [s['name'] for s in d.get('tmux_sessions', [])]
if '$SESSION_NAME' not in names:
print('NOT in YAML — refusing to stop (no audit trail). Use tmux-agent-orchestrate-create first, or pass --force-no-yaml.')
print('NOT in YAML — refusing to stop (no audit trail). Use multi-agent-mux-create first, or pass --force-no-yaml.')
raise SystemExit(1)
"
@@ -65,16 +65,16 @@ fi
```bash
# 1. Stop gracefully (default — captures ID, shuts down safely, status=stopped)
bash skills/tmux-agent-orchestrate-stop/scripts/stop_session.sh \
bash .agents/skills/multi-agent-mux-stop/scripts/stop_session.sh \
--session "$SESSION_NAME"
# 2. Stop gracefully + record a custom stop reason
bash skills/tmux-agent-orchestrate-stop/scripts/stop_session.sh \
bash .agents/skills/multi-agent-mux-stop/scripts/stop_session.sh \
--session "$SESSION_NAME" --reason api_error
# 3. Stop gracefully + clean up on-disk conversation (DANGEROUS)
# — this prevents any future resume (status=terminated, resumable=false).
bash skills/tmux-agent-orchestrate-stop/scripts/stop_session.sh \
bash .agents/skills/multi-agent-mux-stop/scripts/stop_session.sh \
--session "$SESSION_NAME" --purge-conversation
```
@@ -94,19 +94,19 @@ If `--purge-conversation` is used: `status: terminated`, `terminated_at`, `termi
The script:
1. Verifies the session is in agent-sessions.yaml
2. If `delegate_job_id` is set, automatically publishes a `progress --detail "terminating"` event to the tmux-agent-orchestrate-delegate-job registry
2. If `delegate_job_id` is set, automatically publishes a `progress --detail "terminating"` event to the multi-agent-mux-delegate-job registry
3. Captures the `last_visible_status` from `tmux capture-pane` (so we have a final TUI snapshot for audit)
4. Attempts graceful exit keys → SIGTERM kill-session → SIGKILL fallback
5. For `purge-conversation`: deletes `~/.claude/projects/.../jsonl` (claude) or `~/.gemini/antigravity-cli/conversations/...db` + `brain/...` (agy)
6. Updates the YAML entry and SQLite database atomically
7. If `delegate_job_id` is set, publishes a `completed` event to the tmux-agent-orchestrate-delegate-job registry
7. If `delegate_job_id` is set, publishes a `completed` event to the multi-agent-mux-delegate-job registry
## Pitfalls
- **Don't delete on-disk artifacts by default** — the agent's `*.jsonl` / `conversations/*.db` is the data that `tmux-agent-orchestrate-resume` needs. `--purge-conversation` is for when the user is genuinely done with the conversation and wants zero recovery chance.
- **YAML is append-only until you write a stop** — if a previous run left the entry as `running` but tmux is actually dead (crash, host reboot), the YAML is stale. Running `tmux-agent-orchestrate-stop` will detect "tmux already dead, just update YAML" and proceed.
- **Don't delete the `claude_session_id_own: null` placeholder** — when the user creates a fresh session with `tmux-agent-orchestrate-create` and never sent a message, the entry has `claude_session_id_own: null`. Stopping must preserve that field.
- **Monitor skill may still be tracking** — if `tmux-agent-orchestrate-monitor` is running a heartbeat loop, stopping a session while it watches will trigger its `tmux ls != yaml` reconciliation. That's expected — let the monitor run, it will mark the entry as `terminated` on its own.
- **Don't delete on-disk artifacts by default** — the agent's `*.jsonl` / `conversations/*.db` is the data that `multi-agent-mux-resume` needs. `--purge-conversation` is for when the user is genuinely done with the conversation and wants zero recovery chance.
- **YAML is append-only until you write a stop** — if a previous run left the entry as `running` but tmux is actually dead (crash, host reboot), the YAML is stale. Running `multi-agent-mux-stop` will detect "tmux already dead, just update YAML" and proceed.
- **Don't delete the `claude_session_id_own: null` placeholder** — when the user creates a fresh session with `multi-agent-mux-create` and never sent a message, the entry has `claude_session_id_own: null`. Stopping must preserve that field.
- **Monitor skill may still be tracking** — if `multi-agent-mux-monitor` is running a heartbeat loop, stopping a session while it watches will trigger its `tmux ls != yaml` reconciliation. That's expected — let the monitor run, it will mark the entry as `terminated` on its own.
## Verification
@@ -133,4 +133,4 @@ print(f' preserved: pane.pid={s[\"pane\"][\"pid\"]}, cmd={s[\"pane\"][\"cmd\"]}
- **Just detaching**`tmux detach` (Ctrl-B d) or just close the terminal. The tmux session keeps running.
- **Stopping the agent inside but keeping tmux** → send `Ctrl-C` or `/exit` (claude) / `Ctrl-D` (agy) via `tmux send-keys`. The tmux session stays but the agent process is gone.
- **Replacing an existing session with a new one**`tmux-agent-orchestrate-stop` first, then `tmux-agent-orchestrate-create`.
- **Replacing an existing session with a new one**`multi-agent-mux-stop` first, then `multi-agent-mux-create`.
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# stop_session.sh — tmux-agent-orchestrate-stop 의 부속 스크립트
# stop_session.sh — multi-agent-mux-stop 의 부속 스크립트
# Usage:
# bash stop_session.sh --session <name> [--agent claude|agy] \
# [--mode soft|hard] [--purge-conversation] [--yes]
@@ -76,6 +76,7 @@ if [ -z "$AGENT" ]; then
*-creator-claude) AGENT=claude ;;
*-creator-agy) AGENT=agy ;;
*-creator-hermes) AGENT=hermes ;;
*-creator-cline) AGENT=cline ;;
*) echo "ERROR: cannot infer agent from '$SESSION_NAME'; pass --agent" >&2; exit 2 ;;
esac
fi
@@ -139,7 +140,7 @@ fi
if [ "$PURGE" = "1" ] && [ "$YES" != "1" ]; then
echo "DANGER: --purge-conversation will DELETE this workspace's on-disk conversation."
echo " workspace: ${TARGET_CWD:-<unknown>}"
echo " This means: no future tmux-agent-orchestrate-resume for this session."
echo " This means: no future multi-agent-mux-resume for this session."
echo " Re-run with --yes to confirm."
exit 3
fi
@@ -184,6 +185,7 @@ graceful_stop() {
claude) exitkey="/exit" ;;
agy) exitkey="Exit" ;;
hermes) exitkey="/exit" ;;
cline) exitkey="/exit" ;;
*) exitkey="/exit" ;;
esac
echo "graceful: send-keys '$exitkey' to $SESSION_NAME"
@@ -263,6 +265,8 @@ if captured and not purge:
target['agy_conversation_id_own'] = captured
elif agent == 'hermes':
target['hermes_conversation_id_own'] = captured
elif agent == 'cline':
target['cline_conversation_id_own'] = captured
target['resumable'] = True
# --purge-conversation: 워크스페이스 격리된 UUID 의 디스크 artifact 만 삭제 (P0-C)
@@ -294,15 +298,21 @@ if purge and purge_uuid:
if os.path.exists(hdb):
try:
import sqlite3
conn = sqlite3.connect(hdb)
conn.execute("DELETE FROM sessions WHERE id=?", (purge_uuid,))
conn.execute("DELETE FROM messages WHERE session_id=?", (purge_uuid,))
conn.commit()
conn.close()
hconn = sqlite3.connect(hdb)
hconn.execute("DELETE FROM sessions WHERE id=?", (purge_uuid,))
hconn.execute("DELETE FROM messages WHERE session_id=?", (purge_uuid,))
hconn.commit()
hconn.close()
print(f"purged db records for session: {purge_uuid}", flush=True)
except Exception as e:
print(f"WARN: purge hermes db records failed: {e}", flush=True)
target['hermes_conversation_id_own'] = None
elif agent == 'cline':
sessions_dir = f"{home}/.cline/data/sessions/{purge_uuid}"
if os.path.isdir(sessions_dir):
shutil.rmtree(sessions_dir)
print(f"purged: {sessions_dir}", flush=True)
target['cline_conversation_id_own'] = None
# agent_identities 는 cache — 이 워크스페이스 것일 때만 비운다
ai = (d.get('agent_identities') or {}).get(agent) or {}
if ai.get('project_cwd') == ws:
@@ -317,6 +327,8 @@ if purge and purge_uuid:
ai['conversation_brain_dir'] = None
elif agent == 'hermes' and ai.get('session_id') == purge_uuid:
ai['session_id'] = None
elif agent == 'cline' and ai.get('session_id') == purge_uuid:
ai['session_id'] = None
elif purge and not purge_uuid:
print("WARN: --purge-conversation requested but no workspace-scoped UUID resolved; nothing purged", flush=True)
@@ -337,5 +349,5 @@ echo " captured: ${CAPTURED_UUID:-<none>}"
echo " purge: $PURGE${PURGE_UUID:+ (uuid $PURGE_UUID)}"
echo " time: $NOW_ISO"
echo
echo "Recovery: tmux-agent-orchestrate-create + tmux-agent-orchestrate-resume 로 동일 컨텍스트 복원 가능"
echo "Recovery: multi-agent-mux-create + multi-agent-mux-resume 로 동일 컨텍스트 복원 가능"
echo " (단 --purge-conversation 사용 시 복원 불가)"
+8 -8
View File
@@ -1,5 +1,5 @@
# ---------------------------------------------------------------------------
# .env.example — committable template for the tmux-agent-orchestrate-* skills
# .env.example — committable template for the multi-agent-mux-* skills
#
# This file is tracked in git and contains NO secrets. To get a working local
# config, copy it to `.env` (which is git-ignored) and edit as needed:
@@ -20,12 +20,12 @@
# ===========================================================================
# Single source of truth for the agent session registry YAML.
#default: <workspace>/.hermes/agent-sessions.yaml
# AGENT_SESSIONS_YAML=/path/to/workspace/.hermes/agent-sessions.yaml
#default: <workspace>/.mam/agent-sessions.yaml
# AGENT_SESSIONS_YAML=/path/to/workspace/.mam/agent-sessions.yaml
# Where the monitor (reconcile.sh) keeps its drift-state cache.
#default: <workspace>/.cache/tmux-agent-orchestrate-monitor
# AGENT_SESSIONS_STATE_DIR=/path/to/workspace/.cache/tmux-agent-orchestrate-monitor
#default: <workspace>/.cache/multi-agent-mux-monitor
# AGENT_SESSIONS_STATE_DIR=/path/to/workspace/.cache/multi-agent-mux-monitor
# Root directory that holds Claude Code per-project conversation logs (*.jsonl).
#default: $HOME/.claude/projects
@@ -72,6 +72,6 @@
#default: (unset → no client key)
# MQTT_KEYFILE=/path/to/client.key
# Directory for delegate-job audit logs (sits beside .hermes/jobs/).
#default: <cwd>/.hermes/delegate_job_logs
# DELEGATE_JOB_LOGS_DIR=/path/to/workspace/.hermes/delegate_job_logs
# Directory for delegate-job audit logs (sits beside .mam/jobs/).
#default: <cwd>/.mam/delegate_job_logs
# DELEGATE_JOB_LOGS_DIR=/path/to/workspace/.mam/delegate_job_logs
+3 -3
View File
@@ -8,7 +8,7 @@ test-sessions*.yaml.bak
test-sessions*.yaml.lock
# delegate-job 임시/런타임 산출물
.hermes/
.mam/
.venv/
__pycache__/
*.pyc
@@ -19,5 +19,5 @@ __pycache__/
!.env.example
# 빌드/배포 HTML 산출물
skills/tmux-agent-orchestrate-delegate-job/USER_MANUAL.html
skills/tmux-agent-orchestrate-delegate-job/mqtt-broker-setup.html
.agents/skills/multi-agent-mux-delegate-job/USER_MANUAL.html
.agents/skills/multi-agent-mux-delegate-job/mqtt-broker-setup.html
+52 -32
View File
@@ -10,30 +10,50 @@
본 프로젝트를 새로운 환경에 복제(Clone)한 후, 핵심 구성 요소들의 위치와 역할을 먼저 파악해야 합니다.
* `skills/`: 멀티 에이전트 구동 및 비동기 잡 처리를 수행하는 셸 스크립트 모음
* `.agents/`: 오케스트레이션 및 에이전트 커스텀 스킬 디렉터리
* `AGENT.md`: 에이전트 간의 역할 분담(PM, Worker, Reviewer) 및 이벤트 발행 규약 정의
* `AGENT.ko.md`: 에이전트 간의 역할 분담(PM, Worker, Reviewer) 및 이벤트 발행 규약 정의 (한국어)
* `skills/`: 멀티 에이전트 구동 및 비동기 잡 처리를 수행하는 셸 스크립트 모음
* `lib.sh`: 오케스트레이션의 핵심 셸 함수 및 가상환경(venv) 자동 연동 라이브러리
* `tmux-agent-orchestrate-create/`: 격리된 tmux 에이전트 세션을 시작하는 스크립트
* `tmux-agent-orchestrate-stop/`: 세션을 정상적으로 중지하고 상태를 업데이트하는 스크립트
* `tmux-agent-orchestrate-resume/`: 중지된 에이전트 세션을 이전 대화 상태 그대로 복원하는 스크립트
* `tmux-agent-orchestrate-status/`: 전체 에이전트 세션의 현재 구동 상태를 조회하는 스크립트
* `tmux-agent-orchestrate-monitor/`: tmux 상태와 레지스트리 상태를 동기화하는 모니터 스크립트
* `tmux-agent-orchestrate-delegate-job/`: 비동기 잡 분할 실행 모듈
* `multi-agent-mux-create/`: 격리된 tmux 에이전트 세션을 시작하는 스크립트
* `multi-agent-mux-stop/`: 세션을 정상적으로 중지하고 상태를 업데이트하는 스크립트
* `multi-agent-mux-resume/`: 중지된 에이전트 세션을 이전 대화 상태 그대로 복원하는 스크립트
* `multi-agent-mux-status/`: 전체 에이전트 세션의 현재 구동 상태를 조회하는 스크립트
* `multi-agent-mux-monitor/`: tmux 상태와 레지스트리 상태를 동기화하는 모니터 스크립트
* `multi-agent-mux-delegate-job/`: 비동기 잡 분할 실행 모듈
* `requirements.txt`: Python 의존성 목록 (paho-mqtt, pyyaml)
* `scripts/`: 핵심 비즈니스 로직을 구동하는 Python 스크립트 디렉터리
* `registry.py`: 잡의 등록, 클레임 및 원자적 파일 락 제어 (CLI 지원)
* `job_subscriber.py`: 백그라운드 이벤트 구독기 및 오디팅 로그 생성기
* `publish_event.py`: 실행 상태 및 에러 트랩 시 이벤트 퍼블리셔
* `mqtt_common.py`: 공통 MQTT 브로커 연결 유틸리티
* `AGENT.md`: 에이전트 간의 역할 분담(PM, Worker, Reviewer) 및 이벤트 발행 규약 정의
* `MESSAGING.md`: 에이전트 간 주고받는 MQTT 메시지 와이어 스킴 가이드라인
---
## 2. 환경 설정 파일 생성 (.env)
## 2. 워크스페이스 프로비저닝 (단계 0 — 참고용)
에이전트가 이 안내서를 읽기 전에 워크스페이스 설치가 선행되어야 합니다. 이 단계는 이미 개발자(사람)가 설치기를 통해 수행 완료한 상태입니다. 에이전트가 환경 검증 작업을 수행할 때 파일들이 어떤 경로로 복사되고 설치되었는지에 대한 기초 사실(Reference)을 제공하기 위해 여기에 기록해 둡니다.
### 단계 0.1: 설치 스크립트 실행 (이미 기수행됨)
인스톨러를 통해 워크스페이스를 기동하기 위해 사람이 다음 명령을 구동하였습니다:
```bash
curl -fsSL https://git.godopu.com/tmpl/multi-agent-mux/raw/branch/main/deploy/install.sh | bash
```
인스톨러는 시스템 의존성 유효성을 검증하고 리포지토리를 불러와 오케스트레이션을 위한 `.agents/` 스켈레톤 및 관련 문서 파일(본 `BOOTSTRAP.ko.md` 포함)을 복사하고 `.env``.venv` 가상환경 설정을 자동으로 수행했습니다.
> [!NOTE]
> **검증 전용 경고**: 본 가이드를 읽고 있는 AI 에이전트는 인스톨러 스크립트를 스스로 **다시 실행해서는 안 됩니다.** 즉시 환경 변수 설정 및 검증 테스트 단계로 넘어가십시오.
---
## 3. 환경 설정 파일 생성 (.env)
메시징 브로커 설정 및 실행 경로를 설정하기 위해 로컬 환경 설정 파일(`.env`)을 생성하고 수정해야 합니다.
### 단계 2.1: 자동 생성 스크립트 실행
### 단계 3.1: 자동 생성 스크립트 실행
프로젝트 루트에서 제공되는 환경 설정 템플릿 복사 스크립트를 실행합니다.
```bash
@@ -44,11 +64,11 @@
./scripts/generate-env.sh --force
```
### 단계 2.2: 환경 변수 수정 및 설정
### 단계 3.2: 환경 변수 수정 및 설정
생성된 `.env` 파일을 열어 설정을 필요에 따라 구성합니다.
> [!NOTE]
> `generate-env.sh`로 생성된 기본 `.env` 파일은 모든 환경 변수 항목이 주석 처리되어 있습니다. 주석 처리된 상태로 둘 경우 로컬 프로젝트 루트를 기준으로 한 상대 경로(`.hermes/` 등) 및 기본 공개 브로커 주소가 자동 지정되므로 그대로 사용하셔도 무방합니다.
> `generate-env.sh`로 생성된 기본 `.env` 파일은 모든 환경 변수 항목이 주석 처리되어 있습니다. 주석 처리된 상태로 둘 경우 로컬 프로젝트 루트를 기준으로 한 상대 경로(`.mam/` 등) 및 기본 공개 브로커 주소가 자동 지정되므로 그대로 사용하셔도 무방합니다.
1. **MQTT Broker 설정 (`MQTT_BROKER`)**:
* 기본값은 HiveMQ 공개 브로커(`broker.hivemq.com`)로 잡혀 있으나, 보안 및 프라이버시가 중요한 프로덕션 작업 시에는 개인/사설 브로커 주소로 변경할 것을 강력히 권장합니다.
@@ -60,15 +80,15 @@
> [!WARNING]
> **보안 모드 기본값 안내**:
> 시스템의 기본 설정은 **무인증 PoC 모드**입니다. 잡 등록 시 `auth_token`이 명시적으로 주입되지 않으면(또는 `null`인 경우) HMAC 서명 검증이 생략됩니다.
> 공개 브로커 사용 환경이나 실제 프로덕션 단계에서는 잡 등록 시 `auth_token`을 고유 난수값으로 생성 및 주입하여 HMAC 보안 서명을 활성화해야 합니다. (자세한 보안 규약은 [MESSAGING.md](./MESSAGING.md) 및 [AGENT.ko.md](./AGENT.ko.md)의 `2.3 보안 프로토콜` 섹션을 참조하십시오. 현재 CLI를 통한 자동 토큰 생성/주입 기능 지원은 향후 로드맵의 `FW-N6` 과제로 처리 예정입니다.)
> 공개 브로커 사용 환경이나 실제 프로덕션 단계에서는 잡 등록 시 `auth_token`을 고유 난수값으로 생성 및 주입하여 HMAC 보안 서명을 활성화해야 합니다. (자세한 보안 규약은 [MESSAGING.md](./MESSAGING.md) 및 [AGENT.ko.md](.agents/AGENT.ko.md)의 `2.3 보안 프로토콜` 섹션을 참조하십시오. 현재 CLI를 통한 자동 토큰 생성/주입 기능 지원은 향후 로드맵의 `FW-N6` 과제로 처리 예정입니다.)
---
## 3. 의존성 및 가상환경 설정 (Venv Setup)
## 4. 의존성 및 가상환경 설정 (Venv Setup)
오케스트레이션 및 MQTT 메시징을 구동하기 위한 Python 3 의존성을 설정합니다.
### 단계 3.1: Python 가상환경 구축
### 단계 4.1: Python 가상환경 구축
프로젝트 루트에서 `.venv` 가상환경을 생성하고 활성화합니다.
```bash
@@ -79,30 +99,30 @@ python3 -m venv .venv
source .venv/bin/activate
```
### 단계 3.2: 의존성 패키지 설치
`tmux-agent-orchestrate-delegate-job` 디렉터리에 기재된 `requirements.txt` 의존성 목록을 가상환경에 설치합니다.
### 단계 4.2: 의존성 패키지 설치
`multi-agent-mux-delegate-job` 디렉터리에 기재된 `requirements.txt` 의존성 목록을 가상환경에 설치합니다.
```bash
# 의존성 패키지(pyyaml, paho-mqtt 등) 설치
pip install -r skills/tmux-agent-orchestrate-delegate-job/requirements.txt
pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
```
---
## 4. 디렉터리 준비 및 보안 감시 가이드
## 5. 디렉터리 준비 및 보안 감시 가이드
에이전트 제어 상태 및 잡 기록을 위해 로컬 레지스트리 디렉터리가 정상적으로 생성되었는지 확인합니다.
1. **필수 로컬 디렉터리 구조**:
* `.hermes/jobs/`: 등록된 비동기 잡의 세부 메타데이터가 파일 형태로 저장되는 디렉터리
* `.hermes/delegate_job_logs/`: 에이전트가 발행하는 모든 백플레인 이벤트 흐름이 기록되는 audit log (`events.ndjson`) 보존 디렉터리
* `.mam/jobs/`: 등록된 비동기 잡의 세부 메타데이터가 파일 형태로 저장되는 디렉터리
* `.mam/delegate_job_logs/`: 에이전트가 발행하는 모든 백플레인 이벤트 흐름이 기록되는 audit log (`events.ndjson`) 보존 디렉터리
2. **Git 커밋 제어 (.gitignore)**:
* 새 프로젝트 초기화 시 아래 파일들이 절대 리포지토리에 커밋되지 않도록 `.gitignore` 상태를 점검합니다. `!.env.example` 예외 처리가 유지되어야 템플릿이 보존됩니다:
```text
.env
.env.*
!.env.example
.hermes/
.mam/
.venv/
__pycache__/
*.pyc
@@ -110,19 +130,19 @@ pip install -r skills/tmux-agent-orchestrate-delegate-job/requirements.txt
---
## 5. 실행 환경 검증 및 부트스트랩 테스트
## 6. 실행 환경 검증 및 부트스트랩 테스트
환경 구축이 오작동 없이 안전하게 완료되었는지 아래의 체크리스트를 실행해 검증합니다.
> [!IMPORTANT]
> 아래의 모든 검증 명령은 반드시 **프로젝트 루트 디렉터리**(`.hermes/` 디렉터리가 직접 보이는 위치)에서 실행해야 합니다. 잡 레지스트리 디렉터리 기본 경로가 프로젝트 루트 하위의 `./.hermes/jobs` 상대 경로를 기준으로 탐색되기 때문입니다.
> 아래의 모든 검증 명령은 반드시 **프로젝트 루트 디렉터리**(`.mam/` 디렉터리가 직접 보이는 위치)에서 실행해야 합니다. 잡 레지스트리 디렉터리 기본 경로가 프로젝트 루트 하위의 `./.mam/jobs` 상대 경로를 기준으로 탐색되기 때문입니다.
### 검증 테스트 1: 잡 레지스트리 정상 구동 여부
Python 스크립트 및 venv 라이브러리가 올바르게 로드되는지 확인하기 위해 잡 목록을 조회합니다.
```bash
# 가상환경(.venv) 파이썬 인터프리터를 사용하여 실행
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py list
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
```
* **출력 기대 결과**: 에러 메시지 없이 빈 JSON 배열 `[]` 또는 현재 등록된 pending/running 잡 목록이 성공적으로 출력되어야 합니다.
@@ -131,35 +151,35 @@ Python 스크립트 및 venv 라이브러리가 올바르게 로드되는지 확
```bash
# 1. 테스트용 임시 잡 등록 및 발급된 8자리 Hex 잡 ID 획득
JID=$(.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py register \
JID=$(.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py register \
--agent "test-agent" \
--prompt "Bootstrap check command" \
--timeout 120)
echo "Generated Job ID: $JID"
# 2. 획득한 잡 ID에 대해 백그라운드 이벤트 구독기(Subscriber) 구동
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/job_subscriber.py --job "$JID" &
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/job_subscriber.py --job "$JID" &
# 3. 구독자의 MQTT Broker 소켓 연결 및 수신부 초기화 완료를 보장하기 위해 2초 대기
sleep 2
# 4. 테스트 시작 이벤트 발행 (Subscribe-before-Publish 원칙 준수)
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/publish_event.py \
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/publish_event.py \
--job "$JID" \
--event started \
--detail "Bootstrap MQTT verification connection check"
# 5. 이벤트 수신이 터미널(stdout) 및 .hermes/delegate_job_logs/events.ndjson 로그 파일에 정상 기록되는지 확인
# 5. 이벤트 수신이 터미널(stdout) 및 .mam/delegate_job_logs/events.ndjson 로그 파일에 정상 기록되는지 확인
# 6. 검증 완료 후 백그라운드 프로세스 종료 및 테스트 잡 레코드 수동 정리
kill %1
rm -f ".hermes/jobs/$JID.json" ".hermes/jobs/$JID.lock"
rm -f ".mam/jobs/$JID.json" ".mam/jobs/$JID.lock"
```
---
## 6. 에이전트 온보딩 가이드 (New Agent Onboarding)
## 7. 에이전트 온보딩 가이드 (New Agent Onboarding)
본 환경 구축을 무사히 마쳤다면, 협업하는 에이전트는 즉시 프로젝트 루트에 있는 **[AGENT.ko.md](./AGENT.ko.md)** 문서를 읽어야 합니다.
본 환경 구축을 무사히 마쳤다면, 협업하는 에이전트는 즉시 .agents/ 디렉터리에 있는 **[AGENT.ko.md](.agents/AGENT.ko.md)** 문서를 읽어야 합니다.
해당 문서에는 에이전트가 각 역할(PM, Worker, Reviewer)로 구동될 때 지켜야 할 **수술적 변경 규칙, 교차 검증 통과 규약, Tmux 뷰포트 유실 방지를 위한 스냅샷 패턴** 등이 서술되어 있어 안정적인 멀티 에이전트 워크플로우에 즉시 기여할 수 있도록 돕습니다.
+52 -32
View File
@@ -10,30 +10,50 @@ A new agent can follow the steps in this guide sequentially to establish a stabl
Before cloning this project into a new environment, you must first understand the locations and roles of its core components:
* `skills/`: A collection of shell scripts that execute multi-agent coordination and asynchronous job processing.
* `.agents/`: Orchestration and custom agent skills root.
* `AGENT.md`: Definition of agent roles (PM, Worker, Reviewer) and event publication rules.
* `AGENT.ko.md`: Definition of agent roles (PM, Worker, Reviewer) and event publication rules (Korean).
* `skills/`: A collection of shell scripts that execute multi-agent coordination and asynchronous job processing.
* `lib.sh`: The core orchestration shell functions and virtual environment (venv) auto-loading library.
* `tmux-agent-orchestrate-create/`: Script to launch isolated tmux agent sessions.
* `tmux-agent-orchestrate-stop/`: Script to gracefully stop agent sessions and update states.
* `tmux-agent-orchestrate-resume/`: Script to restore stopped agent sessions back to their previous conversation state.
* `tmux-agent-orchestrate-status/`: Script to query the current running state of all agent sessions.
* `tmux-agent-orchestrate-monitor/`: Monitor script to sync tmux states with the registry.
* `tmux-agent-orchestrate-delegate-job/`: Asynchronous job splitting and delegation module.
* `multi-agent-mux-create/`: Script to launch isolated tmux agent sessions.
* `multi-agent-mux-stop/`: Script to gracefully stop agent sessions and update states.
* `multi-agent-mux-resume/`: Script to restore stopped agent sessions back to their previous conversation state.
* `multi-agent-mux-status/`: Script to query the current running state of all agent sessions.
* `multi-agent-mux-monitor/`: Monitor script to sync tmux states with the registry.
* `multi-agent-mux-delegate-job/`: Asynchronous job splitting and delegation module.
* `requirements.txt`: Python dependency list (`paho-mqtt`, `pyyaml`).
* `scripts/`: Python scripts running the core business logic.
* `registry.py`: Job registration, claiming, and atomic file lock control (CLI supported).
* `job_subscriber.py`: Background event subscriber and audit log generator.
* `publish_event.py`: Event publisher for runtime states and error traps.
* `mqtt_common.py`: Common utility for connecting to the MQTT broker.
* `AGENT.md`: Definition of agent roles (PM, Worker, Reviewer) and event publication rules.
* `MESSAGING.md`: Messaging scheme and wire protocol guidelines for MQTT communication between agents.
---
## 2. Environment Configuration (.env)
## 2. Workspace Provisioning (Step 0 — Reference Only)
Before any agent can read this guide, the workspace must be provisioned. This step has already been performed by the human developer using the installer. It is documented here strictly as a reference for verifying agents to understand how the assets arrived.
### Step 0.1: Run the Installer One-Liner (Pre-Cloned/Provisioned)
The human developer bootstrapped the workspace by running:
```bash
curl -fsSL https://git.godopu.com/tmpl/multi-agent-mux/raw/branch/main/deploy/install.sh | bash
```
The installer verified system dependencies, staged the repository, copied the runtime `.agents/` scaffolding and documentation files (including this `BOOTSTRAP.md`), and set up the default `.env` and `.venv` environments.
> [!NOTE]
> **Verify Only**: AI agents reading this guide must **not** attempt to run the installation script again. Proceed directly to configuration and verification steps.
---
## 3. Environment Configuration (.env)
To set up the messaging broker and execution paths, you must create and modify a local environment configuration file (`.env`).
### Step 2.1: Run the Generation Script
### Step 3.1: Run the Generation Script
Run the environment template copy script provided in the project root:
```bash
@@ -44,11 +64,11 @@ Run the environment template copy script provided in the project root:
./scripts/generate-env.sh --force
```
### Step 2.2: Modify Environment Variables
### Step 3.2: Modify Environment Variables
Open the generated `.env` file to configure settings as needed.
> [!NOTE]
> The default `.env` file generated by `generate-env.sh` has all environment variables commented out. If left commented out, the system defaults to using relative paths (`.hermes/`, etc.) relative to the local project root, and the public MQTT broker. You can use it as-is without uncommenting anything.
> The default `.env` file generated by `generate-env.sh` has all environment variables commented out. If left commented out, the system defaults to using relative paths (`.mam/`, etc.) relative to the local project root, and the public MQTT broker. You can use it as-is without uncommenting anything.
1. **MQTT Broker Setup (`MQTT_BROKER`)**:
* The default broker is HiveMQ's public sandbox broker (`broker.hivemq.com`). However, for production work where security and privacy are critical, we strongly recommend changing this to a private broker address.
@@ -60,15 +80,15 @@ Open the generated `.env` file to configure settings as needed.
> [!WARNING]
> **Security Mode Default Warning**:
> The system's default setting is the **unauthenticated PoC mode**. If an `auth_token` is not explicitly provided (or is `null`) during job registration, HMAC signature verification is skipped.
> In a public broker environment or production phase, you must generate and inject a unique random `auth_token` during job registration to enable HMAC signature security. (For detailed security protocols, refer to section `2.3 Security Protocol` in [MESSAGING.md](./MESSAGING.md) and [AGENT.md](./AGENT.md). Automated token generation and injection via CLI is on the roadmap under task `FW-N6`.)
> In a public broker environment or production phase, you must generate and inject a unique random `auth_token` during job registration to enable HMAC signature security. (For detailed security protocols, refer to section `2.3 Security Protocol` in [MESSAGING.md](./MESSAGING.md) and [AGENT.md](.agents/AGENT.md). Automated token generation and injection via CLI is on the roadmap under task `FW-N6`.)
---
## 3. Dependency and Virtualenv Setup
## 4. Dependency and Virtualenv Setup
Set up the Python 3 dependencies required to run the orchestration and MQTT messaging backplane.
### Step 3.1: Build Python Virtual Environment
### Step 4.1: Build Python Virtual Environment
Create and activate a `.venv` virtual environment in the project root:
```bash
@@ -79,30 +99,30 @@ python3 -m venv .venv
source .venv/bin/activate
```
### Step 3.2: Install Dependency Packages
Install the required packages listed in `requirements.txt` under `tmux-agent-orchestrate-delegate-job`:
### Step 4.2: Install Dependency Packages
Install the required packages listed in `requirements.txt` under `multi-agent-mux-delegate-job`:
```bash
# Install dependencies (pyyaml, paho-mqtt, etc.)
pip install -r skills/tmux-agent-orchestrate-delegate-job/requirements.txt
pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
```
---
## 4. Directory Structure and Security Audit Guide
## 5. Directory Structure and Security Audit Guide
Ensure that the local registry directories required to track agent states and jobs are successfully created:
1. **Required Directory Structure**:
* `.hermes/jobs/`: Holds detailed metadata files for registered asynchronous jobs.
* `.hermes/delegate_job_logs/`: Holds the audit logs (`events.ndjson`) for all backplane events published by agents.
* `.mam/jobs/`: Holds detailed metadata files for registered asynchronous jobs.
* `.mam/delegate_job_logs/`: Holds the audit logs (`events.ndjson`) for all backplane events published by agents.
2. **Git Ignore Configuration (`.gitignore`)**:
* When initializing a new project, verify that the following entries are configured in `.gitignore` to prevent committing local runtimes to the repository. The exception `!.env.example` must be kept to preserve the template:
```text
.env
.env.*
!.env.example
.hermes/
.mam/
.venv/
__pycache__/
*.pyc
@@ -110,19 +130,19 @@ Ensure that the local registry directories required to track agent states and jo
---
## 5. Execution Verification and Bootstrap Tests
## 6. Execution Verification and Bootstrap Tests
To verify that the environment has been successfully built without runtime errors, run the following verification checklist.
> [!IMPORTANT]
> All verification commands below must be executed from the **project root directory** (where the `.hermes/` directory is directly visible). This is because the default job registry path resolved by scripts is relative to the current working directory under `./.hermes/jobs`.
> All verification commands below must be executed from the **project root directory** (where the `.mam/` directory is directly visible). This is because the default job registry path resolved by scripts is relative to the current working directory under `./.mam/jobs`.
### Verification Test 1: Registry Script Load Check
Verify that the Python scripts and virtual environment libraries load correctly by listing jobs:
```bash
# Run using the python interpreter in the virtual environment
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py list
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
```
* **Expected Output**: The command should exit successfully and print an empty JSON array `[]` (or a list of pending/running jobs if any exist) without any python traceback errors.
@@ -131,36 +151,36 @@ Test the end-to-end communication through the broker to verify that events are p
```bash
# 1. Register a temporary test job and capture its 8-character Hex Job ID
JID=$(.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py register \
JID=$(.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py register \
--agent "test-agent" \
--prompt "Bootstrap check command" \
--timeout 120)
echo "Generated Job ID: $JID"
# 2. Run the background event subscriber (Subscriber) for this Job ID
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/job_subscriber.py --job "$JID" &
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/job_subscriber.py --job "$JID" &
# 3. Wait 2 seconds to allow the Subscriber to establish its MQTT socket connection
sleep 2
# 4. Publish a start event (adhering to the Subscribe-before-Publish rule)
.venv/bin/python3 skills/tmux-agent-orchestrate-delegate-job/scripts/publish_event.py \
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/publish_event.py \
--job "$JID" \
--event started \
--detail "Bootstrap MQTT verification connection check"
# 5. Verify that the event is printed to stdout and written to the audit log:
# .hermes/delegate_job_logs/events.ndjson
# .mam/delegate_job_logs/events.ndjson
# 6. Stop the background subscriber and clean up the test job records
kill %1
rm -f ".hermes/jobs/$JID.json" ".hermes/jobs/$JID.lock"
rm -f ".mam/jobs/$JID.json" ".mam/jobs/$JID.lock"
```
---
## 6. Onboarding Collaborating Agents (New Agent Onboarding)
## 7. Onboarding Collaborating Agents (New Agent Onboarding)
Once the setup is verified, onboarding agents should immediately read the **[AGENT.md](./AGENT.md)** guidelines in the project root.
Once the setup is verified, onboarding agents should immediately read the **[AGENT.md](.agents/AGENT.md)** guidelines in the .agents/ directory.
The guidelines describe essential workflows—such as **surgical change constraints, cross-verification review loops, and pane snapshotting to prevent viewport truncation**—allowing new agents to quickly and safely integrate with the multi-agent workflow.
+4 -2
View File
@@ -7,7 +7,7 @@
## 요약
- **처리 항목**: FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N4, Infra Pattern (총 24개)
- **처리 항목**: FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N7, FW-W3, Infra Pattern (총 28개)
- **Working tree**: clean
- **검증 결과**: 모든 장기 과제, 신규 발견 항목 및 분석 인프라 개선 완료 (agy-existing, claude-existing 교차 검증 PASS)
@@ -44,6 +44,7 @@
| FW-N5 | `job-protocol.md` 보안 프로토콜 규격 갱신 (HMAC 서명 기준) | `6a88f10, 450722b` | Hermes 직접 | 문서/설계 정합성 패스 완료 (PASS) |
| FW-N6 | `registry.py``auth_token` 자동 생성 및 CLI 연동 지원 | `6a88f10` | Hermes 직접 | `--auth-token` 인자 추가 및 보안 브로커 감지 시 자동 생성 처리 완료 (PASS) |
| FW-N7 | `job_subscriber.py` 내 시퀀스 단조 증가 검증을 통한 Replay Attack 방어 | `6a88f10` | Hermes 직접 | Watcher 내 last_seq 추적 및 seq 단조 증가 검사 로직 구현 완료 (PASS) |
| FW-W3 | 개별 잡 와치독을 단일 와일드카드 구독자로 통합 | `358c72b` | Antigravity | watchdog.sh를 제거하고 reconcile.sh --subscribe 단일 구독자로 이벤트 처리 및 와치독 역할 통합 완료 (PASS) |
---
@@ -98,4 +99,5 @@ a6f7c04 feat(delegate-job): bump default --timeout 600s -> 3600s (1h wall-clock
## 날짜
2026-06-21 (Sun) 03:52 ~ 07:00 KST
- 2026-06-21 (Sun) 03:52 ~ 07:00 KST (FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N7, Infra)
- 2026-06-22 (Mon) 23:44 ~ KST (FW-W3)
+4 -2
View File
@@ -7,7 +7,7 @@
## Summary
- **Completed Items**: FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N4, Infra Pattern (total of 24 items)
- **Completed Items**: FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N7, FW-W3, Infra Pattern (total of 28 items)
- **Working Tree**: clean
- **Verification Results**: All long-term tasks, newly discovered items, and analysis infrastructure improvements have been completed (mutual verification PASS from `agy-existing` and `claude-existing`).
@@ -44,6 +44,7 @@
| FW-N5 | Update `job-protocol.md` security protocol spec (to HMAC signatures) | `6a88f10, 450722b` | Hermes Direct | Documentation/Design consistency pass completed (PASS) |
| FW-N6 | Support auto-generated `auth_token` and CLI integration in `registry.py` | `6a88f10` | Hermes Direct | Added `--auth-token` argument, auto-generation on secure broker detection (PASS) |
| FW-N7 | Prevent Replay Attacks via sequence monotonic increase validation in `job_subscriber.py` | `6a88f10` | Hermes Direct | Added seq tracking in watcher to verify monotonic increase (PASS) |
| FW-W3 | Consolidate per-job watchdogs into shared wildcard subscriber | `358c72b` | Antigravity | Consolidate watchdog logic to reconcile.sh --subscribe, remove watchdog.sh (PASS) |
---
@@ -100,4 +101,5 @@ a6f7c04 feat(delegate-job): bump default --timeout 600s -> 3600s (1h wall-clock
## Date
2026-06-21 (Sun) 03:52 ~ 07:00 KST
- 2026-06-21 (Sun) 03:52 ~ 07:00 KST (FW-01 ~ FW-16, FW-L1 ~ FW-L3, FW-N1 ~ FW-N7, Infra)
- 2026-06-22 (Mon) 23:44 ~ KST (FW-W3)
+35 -3
View File
@@ -1,18 +1,35 @@
# FUTURE_WORKS.md
> **목적**: `tmux_agent_orchestration` 프로젝트의 향후 작업 후보를 추적한다.
> **목적**: `multi-agent-mux` 프로젝트의 향후 작업 후보를 추적한다.
> 완료된 항목은 `DONE.ko.md`를 참조.
> **최종 갱신**: 2026-06-21
> **최종 갱신**: 2026-06-24
---
## 향후 개선 작업 로드맵
현재 대기 중인 향후 작업(Future Works) 항목입니다. 본 항목들은 `Understand_Anything_Analysis.md` 보고서의 보안 동시성 분석을 바탕으로 제안되었습니다.
현재 대기 중인 향후 작업(Future Works) 항목입니다. 본 항목들은 시스템의 보안, 동시성, 이식성 및 워크플로우 분석을 바탕으로 제안되었습니다.
| ID | 과제명 | 우선순위 | 작업량 | 해결 분야 / 설명 | 의존성 |
|---|---|---|---|---|---|
| **FW-L4** | Job Registry의 SQLite 마이그레이션 및 NFS flock 한계 극복 | P3 (Low) | 대 | **동시성/인프라 확장성**: 세션 레지스트리와 마찬가지로 개별 JSON 파일 락(`fcntl.flock`) 방식의 잡 레지스트리를 SQLite 데이터베이스 트랜잭션 구조로 통합 마이그레이션하여, NFS 등 분산/네트워크 FS 환경에서의 안정성을 완전 확보 | **조건부** (실제 멀티 호스트/NFS 배포 필요 발생 시 착수) |
| **FW-P1** | lib.sh 내 GNU/Linux 유저랜드 가정 제거 | P2 (Medium) | 소 | **이식성**: `lib.sh`에 포함된 GNU coreutils 전용 명령(`df --output=target` 및 리눅스 mount 포맷 분석)을 이식 가능한 명령어로 대체하여 macOS/BSD에서 NFS 감지가 자동 무력화되는 사각지대 해결 | 없음 |
| **FW-P2** | 윈도우 환경을 위한 명시적인 동시성 제어 전략 제공 | P1 (High) | 중 | **이식성 / 동시성**: `fcntl`이 POSIX 전용이므로 `mqtt_common.py` 임포트 실패 시 예외가 발생하는 문제를 스타트업 시점에 감지하여 사용자 친화적 경고와 함께 조기 종료하게 하거나, 윈도우용 `msvcrt.locking` 등으로 락 메커니즘을 동적 매핑함. 이벤트 감사 로그를 기록하는 `_file_lock`은 설계 사양대로 best-effort(무영향) 속성을 유지함 | 없음 |
| **FW-P3** | 가상환경(virtualenv) 로딩 및 의존성 사전 검증 강화 | P2 (Medium) | 중 | **이식성**: requirements.txt의 paho-mqtt 2.x 의존성 선언 외에, UV/Poetry 등 독립 툴 체인에서 가상환경 인터프리터 불일치를 조기 차단하고, 실행 진입점(entrypoint)에서 필수 라이브러리 탑재 여부를 즉시 검증하는 진단 로직 추가 | 없음 |
| **FW-P4** | 기본 MQTT 브로커 및 네임스페이스 보안 강화 | P1 (High) | 중 | **이식성 / 보안**: 공용 브로커인 `broker.hivemq.com`과 열린 네임스페이스 대신, 사설 TLS 브로커 크레덴셜을 기본 템플릿으로 제공하여 원격 세션 탈취 및 도청 공격 위협 원천 방지 | 없음 |
| **FW-P5** | zsh 환경 하에서의 BASH_SOURCE 경로 오작동 해결 | P2 (Medium) | 소 | **이식성**: zsh 쉘에서 `lib.sh`를 대화형으로 sourcing할 때 `${BASH_SOURCE[0]}`가 공백으로 평가되어 스킬 경로(`SKILL_DIR`)를 잘못 설정하는 오류 해결 | 없음 |
| **FW-P6** | 마커 파일 조회를 통한 프로젝트 루트 동적 감지 | P1 (High) | 중 | **이식성**: `lib.sh`, `status.sh`, `reconcile.sh` 등 여러 스크립트에서 `../..` 등 상대 경로 깊이를 하드코딩하여 발생하는 취약성 해결. `.git`, `.mam`, `.env` 등을 찾는 상위 탐색 마커-파일 워크 방식을 적용하고, 단일한 `WORKSPACE_ROOT` 환경변수로 통일하여 오케스트레이션 안정성 확보 | 없음 |
| **FW-P7** | 모니터 종료 경로에 대한 HMAC 서명 검증 및 활성 상태 체크 강화 | P1 (High) | 중 | **이식성 / 보안**: `reconcile.sh``verify_hmac` 서명 검증 없이 `completed`/`error` 이벤트만으로 세션을 즉시 강제 종료하는 리스크 해결. 모니터링 이벤트 핸들러(`on_message`)에서 보안 토큰 검증을 필수 처리하고, `kill-session` 전 실제 tmux 활성 여부와 예상 아티팩트 보존 상태를 대조하게 설계 | 없음 |
| **FW-W1** | 글로벌 레지스트리 락을 세밀한 락(Fine-grained locks)으로 대체 | P2 (Medium) | 중 | **동시성 / 확장성**: 모든 세션 및 progress/sequence 업데이트가 단일 `.mam/jobs/` 글로벌 fcntl lock을 거치며 생기는 병목 차단. 잡 단위의 개별 락 파일 도입 | 없음 |
| **FW-W2** | 블라인드 TUI 키 입력 방지를 위한 실행 준비도 검증 | P2 (Medium) | 대 | **워크플로우**: 세션 생성, 재개, 중지 시 단순 sleep(예: 6초) 대신 터미널 스크린 스크랩이나 준비도 프로브(Readiness Probe)를 활용하여 다이얼로그나 예외 창을 안전하게 차단 | 없음 |
| **FW-W4** | 구독자 시퀀스 번호(last_seq)의 디스크 영속화 | P1 (High) | 중 | **워크플로우 / 보안**: 와치독 재기동 시 시퀀스 카운터가 리셋되는 구조적 취약을 방지하기 위해 `subscriber.last_seq`를 디스크/DB에 기록하여 잡 라이프타임 전체를 커버하는 Replay 방어선 유지 | 없음 |
| **FW-W5** | 리뷰어 판정을 위한 구조적 메시지 스키마 정의 | P2 (Medium) | 중 | **워크플로우**: PM 에이전트가 터미널 스크롤백 문자열을 무가공 grep 파싱하는 대신, 전용 리뷰 피드백 토픽(예: `reviews/<job_id>/verdicts`) 및 정형화된 JSON 포맷(`PASS`/`NOT_PASS` + 차단 요인) 도입 | 없음 |
| **FW-W6** | 모니터링 복구 루프의 Hermes 에이전트 지원 확장 | P2 (Medium) | 중 | **워크플로우 / 일관성**: `reconcile.sh` 내 자동 등록(drift-B) 및 ID 동기화(drift-C) 로직에 `hermes` 세션을 완전 편입시켜 Claude/Agy 세션과 동일한 모니터링 및 복구 수준 지원 | 없음 |
| **FW-W7** | derive_session_name 내 디렉터리 경로 슬러그 이름 충돌 해결 | P2 (Medium) | 소 | **워크플로우 / 충돌 방지**: 마지막 2개 디렉터리만 슬러그화할 때 발생하는 동일 이름의 중첩 디렉터리 세션 이름 충돌(예: `/projectA/src``/projectB/src` 가 동일한 세션명으로 슬러그화됨)을 해결하기 위해 워크스페이스 범위 해시 값을 포함하는 세션명 명명 규칙 적용 | 없음 |
| ~~**FW-D1**~~ | ✅ **해결됨 (2026-06-24)** — 설치 스크립트가 더 이상 in-place 추출하지 않음 | — | — | **배포 / 안전성**: `deploy/install.sh`는 이제 다운로드를 `mktemp -d` 임시 디렉터리에 스테이징하고 `.agents/skills/lib.sh` 존재를 검증한 뒤, 런타임 자산(`.agents/`, `.env.example`)만 per-file no-clobber 가드(`[ ! -e ]`)로 타겟에 복사한다. 따라서 기존 타겟 파일이 항상 우선하며 레포 개발 문서가 워크스페이스에 들어가지 않는다. fetch 후 sanity 체크도 디렉터리가 아닌 파일을 검사하도록 변경 | 완료 |
| **FW-D2** | 설치 스크립트가 다운로드하는 소스를 sourcing 전에 고정 및 검증 | P2 (Medium) | 소 | **배포 / 공급망**: 설치 스크립트는 네트워크로 이동형 `main` 브랜치를 clone/추출하고, 워크스페이스는 이후 해당 셸 스크립트(`lib.sh` 등)를 `source`한다. *부분 해결 (2026-06-24): 복사 전에 스테이징된 트리에 `.agents/skills/lib.sh`가 존재하는지 검증함.* **남은 작업:** 릴리스 태그나 커밋 SHA로 고정하고 공개 체크섬을 검증하여 구조적 존재 여부뿐 아니라 콘텐츠 무결성까지 보장 | 없음 |
| **FW-D3** | `install.sh``lib.sh` 간 NFS 감지 로직 중복 제거 | P2 (Medium) | 소 | **배포 / 이식성**: `deploy/install.sh``lib.sh::_check_is_nfs`에 이미 존재하는 GNU 전용 `df --output=target` + `mount` NFS 검사를 재구현한다. FW-P1 이식성 수정이 이 두 번째 사본까지 포함하도록, 단일 공유 헬퍼로 추출하여 macOS/BSD에서 두 호출 지점 모두 올바르게 동작하게 한다 | FW-P1 |
| **FW-D4** | CI shellcheck 커버리지 공백 해소 | P3 (Low) | 소 | **배포 / 품질**: `deploy/gitea-ci.yml`은 5개 스크립트만 shellcheck하며, `status.sh`, `resolve_session_id.sh`, `update_yaml_resumed.sh`, `scripts/generate-env.sh`는 검사되지 않는다. 추적되는 모든 `*.sh`를 glob 처리하여 신규 스크립트가 자동 포함되도록 한다 | 없음 |
---
@@ -20,3 +37,18 @@
1. **SQLite 통합(FW-L4)의 조건부 연기**:
* 세션 레지스트리와 달리 개별 잡 데이터는 JSON 파일 구조가 관리 및 디버깅 직관성이 우수하며, 현재 배포 환경은 단일 호스트 로컬 FS로 제한되어 있어 `fcntl.flock` 잠금만으로 안전하게 운용 가능하므로 낮은 우선순위(P3)로 배정하고 필요 시 착수합니다.
2. **윈도우 환경을 위한 명시적인 동시성 제어 전략 제공 (FW-P1, FW-P2)**:
* 동시성 제어 시스템에서 오류 시 락 없이 그냥 실행되는 침묵형 오작동(Silent failover)은 가장 위험한 구조입니다. 윈도우 환경에서 `fcntl` 모듈 누락 시 묵인하지 않고 진입점에서 명시적인 조기 경고를 내어 POSIX 환경이나 전용 래퍼 실행을 유도하고, 혹은 `msvcrt.locking` 파일 제어 전략을 동적 매핑하여 플랫폼 전반의 안전성을 담보해야 합니다.
3. **마커 파일을 통한 동적 루트 앵커링 (FW-P6)**:
* 하위 경로 탐색 시 특정 파일의 상대 경로 깊이(`../..` 등)에 의존하는 구조는 디렉터리 리팩토링이나 래퍼 이동 시 치명적 취약점으로 작용합니다. 디렉터리 트리를 따라 `.git`이나 `.mam` 등 알려진 루트 표시 마커를 동적으로 검색하는 방식을 채택하여 스크립트 실행 안정성과 이식 속도를 획기적으로 개선합니다.
4. **모니터 종료 권한 제어 강화 (FW-P7)**:
* 세션 강제 종료(`tmux kill-session`) 권한은 안전하게 제어되어야 합니다. 모니터(`reconcile.sh`)가 와일드카드 토픽을 무검증 수신하여 즉시 세션을 정리하면 위조 주입 공격에 취약해집니다. 종료 이벤트 수신부에 HMAC 서명 검증을 의무화하고, 세션 강제 중지 전 예상되는 작업 결과물(Artifact) 존속 상태를 교차 검토하도록 설계합니다.
5. **개별 잡 와치독의 단일 와일드카드 구독자 통합 (FW-W3)**:
* 매 잡마다 개별적으로 실행되어 2분 주기로 끊고 재연결하던 `watchdog.sh` 프로세스 방식 대신, 상시 기동되는 `reconcile.sh --subscribe` 단일 와일드카드 구독자 구조로 이벤트 처리, HMAC 보안 검증 및 시퀀스 추적 로직을 완전히 통일했습니다. 이를 통해 불필요한 MQTT 커넥션 급증을 원천 차단하고 세션 정리 과정을 간소화했으며, 메모리 캐시 기반 시퀀스 추적을 통해 Replay 공격 차단 정합성을 동시 실행 중인 모든 잡에 대해 안정적으로 제공합니다.
6. **배포 설치 스크립트 강화 (FW-D1 ~ FW-D4)**:
* `deploy/install.sh`와 Gitea 템플릿은 가장 최근에 추가된(DONE.md 검증 라운드 이후) 리뷰가 가장 적은 영역이며, 검증된 오케스트레이션 코드가 실행되기 *이전*에 동작하는 유일한 경로입니다. **FW-D1(릴리스 차단 항목)은 이제 해결되었습니다(2026-06-24):** 처음 제안된 `tar --exclude` 거부목록(denylist) 방식 — 리뷰 결과 이식성이 없고, 더 심각하게는 비앵커드 `--exclude="scripts"` 패턴이 스킬 트리 내부의 `scripts/` 디렉터리까지 제거하여 조용히 깨진 설치를 만든다는 점이 확인됨 — 대신, 임시 디렉터리 스테이징 + 런타임 자산 허용목록(allowlist) 복사 + per-file no-clobber 가드로 재구성했습니다. 이로써 파괴적 덮어쓰기 위험과 개발 문서 오염을 한 번에 해소했습니다. FW-D2는 부분 해결(복사 전 스테이징 트리 구조 검증)되었고, 남은 공급망 강화 작업은 fetch를 태그/SHA + 체크섬으로 고정하는 것입니다. FW-D3(NFS 감지 분기, FW-P1에 통합)와 FW-D4(CI 린트 커버리지)는 일관성/품질 부채로 남아 있습니다.
+38 -3
View File
@@ -1,18 +1,36 @@
# FUTURE_WORKS.md
> **Purpose**: Track future work candidates for the `tmux_agent_orchestration` project.
> **Purpose**: Track future work candidates for the `multi-agent-mux` project.
> For completed items, see `DONE.md`.
> **Last Updated**: 2026-06-21
> **Last Updated**: 2026-06-24
---
## Future Improvements Roadmap
Below is the list of pending future work items. These items were proposed based on the security and concurrency analysis in the `Understand_Anything_Analysis.md` report.
Below is the list of pending future work items. These items were proposed based on the security, concurrency, portability, and workflow analysis of the system.
| ID | Task | Priority | Effort | Domain / Description | Dependencies |
|---|---|---|---|---|---|
| **FW-L4** | Migrate Job Registry to SQLite to overcome NFS flock limitations | P3 (Low) | Large | **Concurrency/Infrastructure Scalability**: Similar to the Session Registry, migrate the individual JSON file lock (`fcntl.flock`) registry structure into an integrated SQLite database transaction structure, guaranteeing full reliability in distributed/network file systems like NFS. | **Conditional** (commence only when multi-host/NFS deployment is required) |
| **FW-P1** | Eliminate GNU/Linux userland assumptions in lib.sh | P2 (Medium) | Small | **Portability**: Replace GNU coreutils-specific commands (like `df --output=target` and Linux-specific mount formats) in `lib.sh` with portable equivalents, resolving silent failures of NFS detection on macOS/BSD. | None |
| **FW-P2** | Add explicit Windows concurrency strategy in mqtt_common.py | P1 (High) | Medium | **Portability / Concurrency**: Detect non-POSIX systems at module initialization and either fail fast with a descriptive warning or substitute alternative lock strategies (e.g. `msvcrt.locking`), while preserving the best-effort nature of the `_file_lock` log appender. | None |
| **FW-P3** | Align virtualenv loading and dependency verifications | P2 (Medium) | Medium | **Portability**: Prevent local interpreter mismatches in Poetry/UV environments and ensure the launch scripts fail early with clear diagnostic warnings if required Python dependencies are missing at startup. | None |
| **FW-P4** | Secure default MQTT broker and namespaces | P1 (High) | Medium | **Portability / Security**: Prevent remote session hijack and eavesdropping by providing a private TLS-enabled broker template rather than defaulting to `broker.hivemq.com` in public namespaces. | None |
| **FW-P5** | Resolve BASH_SOURCE path resolution under zsh | P2 (Medium) | Small | **Portability**: Fix `lib.sh` interactive sourcing issues under zsh shell where `${BASH_SOURCE[0]}` resolves to empty. | None |
| **FW-P6** | Anchor project root dynamically via marker-file lookup | P1 (High) | Medium | **Portability**: Resolve structural fragility caused by hardcoded `../..` relative directory traversal in `lib.sh`, `status.sh`, and `reconcile.sh`. Use an upward search for root markers (`.git`, `.mam`, `.env`) to export a single source of truth for `WORKSPACE_ROOT`. | None |
| **FW-P7** | Enforce HMAC verification and liveness checks on monitor termination | P1 (High) | Medium | **Portability / Security**: Prevent remote session killing by unauthorized or spoofed events. Integrate `verify_hmac` inside the monitor (`reconcile.sh`'s `on_message` handler) and confirm expected artifacts exist before executing `tmux kill-session`. | None |
| **FW-P8** | Unify `.env` loading in `lib.sh` to prevent split-brain path resolution | P1 (High) | Small | **Portability / Consistency**: Sourcing the `.env` file inside `lib.sh` is critical to prevent split-brain path resolution where shell scripts query the default session database path while Python scripts query a custom path defined in `.env`. Sourcing `.env` at the top of `lib.sh` ensures all shell utilities automatically inherit user overrides for `TMUX_SERVER_NAME`, `AGENT_SESSIONS_YAML`, etc. | None |
| **FW-W1** | Replace global registry lock with fine-grained locks | P2 (Medium) | Medium | **Concurrency / Scaling**: Eliminate throughput bottlenecks where all progress/sequence updates channel through a single fcntl lock on `.mam/jobs/`. Implement per-job lock files. | None |
| **FW-W2** | Implement readiness probes for blind TUI key inputs | P2 (Medium) | Large | **Workflow**: Replace fixed timing sleeps in create, resume, and stop scripts with dynamic terminal readiness probes (e.g. scrapers or CLI checking hooks) to dismiss trust dialogs robustly. | None |
| **FW-W4** | Persist subscriber sequence numbers alongside job records | P1 (High) | Medium | **Workflow / Security**: Persist `subscriber.last_seq` to disk or SQLite to prevent sequence counter reset on subscriber restart, locking down the replay defense window for the full job lifetime. | None |
| **FW-W5** | Define structured message schema for reviewer verdicts | P2 (Medium) | Medium | **Workflow**: Create a dedicated reviewer topic (e.g., `reviews/<job_id>/verdicts`) emitting structured JSON verdicts (`PASS` / `NOT_PASS` + details) to eliminate raw text grepping by the PM. | None |
| **FW-W6** | Expand monitor reconciliation support to Hermes agent | P2 (Medium) | Medium | **Workflow / Consistency**: Fully integrate `hermes` sessions into auto-registration (drift-B) and ID materialization (drift-C) under `reconcile.sh` to match Claude/Agy monitoring coverage. | None |
| **FW-W7** | Resolve path slug collisions in derive_session_name | P2 (Medium) | Small | **Workflow / Collision Avoidance**: Update `derive_session_name` to handle same-name nested directories (e.g. `/projectA/src` and `/projectB/src` both slugify to identical session names) by incorporating workspace-scoped identifiers or hash digests. | None |
| ~~**FW-D1**~~ | ✅ **RESOLVED (2026-06-24)** — installer no longer extracts in-place | — | — | **Deploy / Safety**: `deploy/install.sh` now stages the download into a `mktemp -d` dir, verifies `.agents/skills/lib.sh` is present, then copies only the runtime assets (`.agents/`, `.env.example`) into the target with per-file no-clobber guards (`[ ! -e ]`), so existing target files always win and repo dev docs never land in the workspace. The post-fetch sanity check now tests a file, not just the directory. | Done |
| **FW-D2** | Pin and verify the source the installer downloads before sourcing it | P2 (Medium) | Small | **Deploy / Supply-chain**: The installer clones/extracts the moving `main` branch over the network, and the workspace later `source`s those shell scripts (`lib.sh` et al.). *Partially addressed (2026-06-24): the staged tree is now verified to contain `.agents/skills/lib.sh` before any file is copied.* **Remaining:** pin to a release tag or commit SHA and/or verify a published checksum so the fetched content is integrity-checked, not merely structurally present. | None |
| **FW-D3** | De-duplicate NFS detection between `install.sh` and `lib.sh` | P2 (Medium) | Small | **Deploy / Portability**: `deploy/install.sh` re-implements the GNU-specific `df --output=target` + `mount` NFS check already present in `lib.sh::_check_is_nfs`. The FW-P1 portability fix must cover this second copy — extract a single shared helper so both call sites stay correct on macOS/BSD. | FW-P1 |
| **FW-D4** | Close CI shellcheck coverage gaps | P3 (Low) | Small | **Deploy / Quality**: `deploy/gitea-ci.yml` shellchecks only 5 scripts; `status.sh`, `resolve_session_id.sh`, `update_yaml_resumed.sh`, and `scripts/generate-env.sh` are never linted. Glob all tracked `*.sh` so new scripts are covered automatically. | None |
---
@@ -20,3 +38,20 @@ Below is the list of pending future work items. These items were proposed based
1. **Conditional Deferral of SQLite Integration (FW-L4)**:
* Unlike the session registry, maintaining individual job data in JSON files is highly intuitive for management and debugging. Since the current deployment is constrained to a single-host local file system, `fcntl.flock` locks are sufficient. Thus, this is assigned a low priority (P3) and will be tackled conditionally.
2. **Explicit Concurrency Strategy on Windows (FW-P1, FW-P2)**:
* Silent failovers are the worst design patterns for concurrency. Instead of letting Windows environments run without a lock (which occurs when fcntl fails silently), we detect POSIX availability at startup. We either fail fast to prompt the user to use a POSIX-compliant shell/wrapper, or dynamically load `msvcrt.locking` to provide a matching file locking mechanism. This guarantees consistent synchronization behaviors across Windows and Unix platforms.
3. **Dynamic Root Anchor (FW-P6)**:
* Hardcoding relative depth limits (like `../..` relative to a skill's location) creates direct fragility when moving directories or refactoring. By walking up the directory tree to search for known anchors (like `.git` or `.mam`), we establish a single canonical root path and prevent scripts from breaking when their execution wrappers are relocated.
4. **Monitor Termination Authorization (FW-P7)**:
* Auto-termination must not trust unauthenticated events. Since `reconcile.sh` listens to a wildcard topic, any client on a public broker could spoof a terminal message and trigger `tmux kill-session`. Requiring HMAC signature verification on the terminal event path, combined with artifact validation, mitigates spoofing and accidental session cleanup.
5. **Consolidation of per-job watchdogs (FW-W3)**:
* Instead of spawning an independent `watchdog.sh` process for each job which reconnects every 2 minutes, we consolidated the event handling, HMAC security verification, and sequence tracking into a single, persistent wildcard subscriber running under `reconcile.sh --subscribe`. This drastically reduces MQTT broker connections, simplifies cleanup logic, and leverages python's memory storage to handle replay attack prevention (monotonic sequence numbers) for concurrent jobs.
6. **Consistent `.env` Sourcing across Shell and Python (FW-P8)**:
* Sourcing the `.env` configuration file inside `lib.sh` ensures that shell utilities and Python scripts are fully aligned. Without this, customized database locations or isolated tmux server names declared in `.env` are only honored by the Python-based MQTT subsystems, while the shell orchestrators silently fall back to default socket files and paths.
7. **Deployment Installer Hardening (FW-D1 ~ FW-D4)**:
* `deploy/install.sh` and the Gitea templates are the newest, least-reviewed surface (added after the DONE.md verification round) and the one path that runs *before* any of the reviewed orchestration code. **FW-D1 (the release blocker) is now resolved (2026-06-24):** rather than the originally proposed `tar --exclude` denylist — which review showed was non-portable and, worse, stripped the skills' own nested `scripts/` directories via the unanchored `--exclude="scripts"` pattern, yielding a silently broken install — the installer was rebuilt around temp-dir staging + an allowlist copy of runtime assets with per-file no-clobber guards. This closes the destructive-overwrite hole and the dev-doc clutter in one move. FW-D2 is partially addressed (the staged tree is structurally verified before copy); the remaining supply-chain hardening is pinning the fetch to a tag/SHA + checksum. FW-D3 (NFS detection drift, folded into FW-P1) and FW-D4 (CI lint coverage) remain open consistency/quality debt.
+1 -1
View File
@@ -64,4 +64,4 @@ Strong success criteria let you loop independently. Weak criteria ("make it work
**These guidelines are working if:** fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.
Read AGENT.md first before working and follow the instructions for orchestration.
Read .agents/AGENT.md first before working and follow the instructions for orchestration.
+17 -17
View File
@@ -1,6 +1,6 @@
# Messaging System Technical Analysis & Architecture Report
This report provides a comprehensive, deep-dive analysis of the messaging system implemented in the `tmux-agent-orchestrate-delegate-job` skill. It covers the MQTT broker architecture, event protocols, job lifecycles, codebase internals, cross-system integration, and a list of known limitations along with production recommendations.
This report provides a comprehensive, deep-dive analysis of the messaging system implemented in the `multi-agent-mux-delegate-job` skill. It covers the MQTT broker architecture, event protocols, job lifecycles, codebase internals, cross-system integration, and a list of known limitations along with production recommendations.
---
@@ -183,8 +183,8 @@ stateDiagram-v2
### 3.1 Step-by-Step Lifecycle Phase Details
#### Phase 1: Registration (`register`)
* **Trigger**: A delegator triggers `registry.py register` (or the `tmux-agent-orchestrate-delegate-job submit` command).
* **Registry State**: Flips from non-existent to `pending` inside `.hermes/jobs/<job_id>.json`. A `last_seq` counter is initialized to `0`.
* **Trigger**: A delegator triggers `registry.py register` (or the `multi-agent-mux-delegate-job submit` command).
* **Registry State**: Flips from non-existent to `pending` inside `.mam/jobs/<job_id>.json`. A `last_seq` counter is initialized to `0`.
* **Locking**: Exclusive fcntl file lock acquired over `.lock` during write.
* **Durable Audit Log**: Writes `<logs>/<job_id>/meta.json`, sets status to `pending` in `status.json`, and appends a `registered` event line to `events.ndjson`.
@@ -219,9 +219,9 @@ stateDiagram-v2
Two concurrency control schemes co-exist in this workspace to coordinate state modification:
1. **`lib.sh::atomic_dump_yaml()`**: Used for workspace-wide tmux session inventory (`agent-sessions.yaml`).
* **Locking**: Uses POSIX advisory locking via python's `fcntl.flock(lock_fh, fcntl.LOCK_EX)` over a sidecar lock file `<yaml_path>.lock`.
* **Locking**: Uses SQLite database transaction serialization via `BEGIN IMMEDIATE` on `agent-sessions.db`.
* **Safe Mutation**: The mutation source code is passed in an environment variable `AGENT_SESSIONS_MUTATION` and executed dynamically using `exec(compile(..., 'exec'), globals())`. This isolates the execution and avoids command-injection vectors.
* **Atomicity**: Writes to a temp file in the same directory using `tempfile.mkstemp()`, then performs an `os.replace()` rename. POSIX guarantees the replacement is atomic, preventing half-written YAML reads. A `.bak` backup copy is also preserved.
* **Atomicity**: Updates the SQLite tables and then, if a session transitions to a finished state, writes to a temp file in the same directory using `tempfile.mkstemp()` and performs an `os.replace()` rename. POSIX guarantees the replacement is atomic, preventing half-written YAML reads. A `.bak` backup copy is also preserved.
2. **`registry.py::register_job() / pick_pending() / _atomic_write_record()`**: Used for job-level metadata JSON files (`<job_id>.json`).
* **Locking**: Wraps operations in a `registry_lock(registry_dir)` context manager, implementing an advisory exclusive lock on `.lock` via `fcntl.flock`.
* **Atomicity**: In `_atomic_write_record()`, it uses `tempfile.mkstemp` inside the parent registry folder, serializes the updated job record to the temp file, flushes it, triggers a physical disk sync via `os.fsync(fh.fileno())`, and executes `os.replace` to replace the main JSON record file. The file permission is restricted to `0o600` immediately.
@@ -272,7 +272,7 @@ The delegated messaging system functions as a critical control backplane, bindin
```mermaid
graph LR
User["User/Cron Client"] -->|submit| Wrap["tmux-agent-orchestrate-delegate-job (Bash)"]
User["User/Cron Client"] -->|submit| Wrap["multi-agent-mux-delegate-job (Bash)"]
Wrap -->|registers| Reg["registry.py (Live Registry)"]
Wrap -->|spawns background| Sub["job_subscriber.py"]
Wrap -->|spawns tmux pane| Tmux["tmux Session (Agent Pane)"]
@@ -285,9 +285,9 @@ graph LR
Mon -->|updates| Inv["agent-sessions.yaml <br> (lib.sh::atomic_dump_yaml)"]
```
### 5.1 Orchestration Wrappers (`tmux-agent-orchestrate-*`)
1. **`tmux-agent-orchestrate-delegate-job (submit)`**:
* Registers a job, spawns `job_subscriber.py` to capture standard output streams to `.hermes/jobs/<job_id>.subscriber.out`, and sleeps for `1` second.
### 5.1 Orchestration Wrappers (`multi-agent-mux-*`)
1. **`multi-agent-mux-delegate-job (submit)`**:
* Registers a job, spawns `job_subscriber.py` to capture standard output streams to `.mam/jobs/<job_id>.subscriber.out`, and sleeps for `1` second.
* Boots the agent pane in tmux:
```bash
tmux new-session -d -s "$sess" -c "$WORKDIR" \
@@ -295,10 +295,10 @@ graph LR
```
* Pre-seeds agent instruction headers via stdin to enforce that the agent runs `publish_event.py` for its transitions.
* Blocks on `wait $sub_pid`, and finally prints the audit log directory.
2. **`tmux-agent-orchestrate-monitor` (`reconcile.sh` & `watchdog.sh`)**:
* **Watchdog Integration**: Starts a subscriber monitoring loop (`watchdog.sh`) to detect orphaned agent panes or locked workspaces.
2. **`multi-agent-mux-monitor` (`reconcile.sh`)**:
* **Wildcard Monitor Integration**: Runs a unified background subscriber loop (`reconcile.sh --subscribe`) to capture progress, verify security tokens (HMAC) and sequences, write audit logs, and automatically clean up tmux sessions upon terminal events.
* **Reconciliation loop**: Subscribes to the global job topic. On terminal events, it invokes `lib.sh::atomic_dump_yaml` to sync status drifts (e.g. setting tmux sessions to `terminated` in `agent-sessions.yaml` once the agent exits).
3. **`tmux-agent-orchestrate-create / stop / resume`**:
3. **`multi-agent-mux-create / stop / resume`**:
* Integrates the job life status into session metadata updates, ensuring standard tmux cleanup triggers state updates in the registry and audit logs.
---
@@ -308,7 +308,7 @@ graph LR
### 6.1 Limitations
1. **Single-Host File Locking Vulnerability**:
The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.hermes/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.mam/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
2. **Bearer Token Leakage over Plaintext (Public Broker)**:
The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events.
3. **Subscriber Network Drop Orphanage**:
@@ -336,8 +336,8 @@ graph LR
This project manages **two distinct state domains** that are often confused:
### Session States (YAML — `.hermes/agent-sessions.yaml`)
Managed by `skills/lib.sh` and the 6 `tmux-agent-orchestrate-*` skills.
### Session States (YAML — `.mam/agent-sessions.yaml`)
Managed by `.agents/skills/lib.sh` and the 6 `multi-agent-mux-*` skills.
Valid values (see `lib.sh` valid-status set):
| State | Meaning | Set by |
@@ -347,8 +347,8 @@ Valid values (see `lib.sh` valid-status set):
| `terminated` | hard-killed via `--mode hard`; tmux session destroyed | `stop` (hard mode), `monitor` reconcile |
| `archived` | soft-stopped via `--mode soft`; tmux left alive, YAML-only update | `stop` (soft mode) |
### Job States (Registry — `.hermes/jobs/<id>.json`)
Managed by `skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py`.
### Job States (Registry — `.mam/jobs/<id>.json`)
Managed by `.agents/skills/multi-agent-mux-delegate-job/scripts/registry.py`.
Valid values:
| State | Meaning | Set by |
+193
View File
@@ -0,0 +1,193 @@
# tmux-agent-orchestration (다중 에이전트 Tmux 오케스트레이션 및 메시징 백플레인)
Tmux와 MQTT 브로커를 기반으로 구축된 고신뢰성 **다중 에이전트 오케스트레이션 및 메시징 백플레인** 프레임워크입니다. Claude, Hermes 등 다중 LLM 백엔드 에이전트 전반에 걸쳐 장시간 수행되는 작업(코드 생성, 리팩토링, 보안 검토 등)을 조정, 격리 및 감사할 수 있도록 설계되었습니다.
---
## 🚀 개요
최근의 에이전트 워크플로우는 세션 타임아웃, 프로세스 격리 부재, 터미널 뷰포트 잘림(스크롤백 한계로 인한 디버그 로그 유실), 복잡한 동시성 경쟁 등의 문제를 자주 겪습니다.
**tmux-agent-orchestration**은 다음과 같은 솔루션을 통해 이를 해결합니다:
1. **Tmux 기반 프로세스 격리:** 개별적으로 격리된 tmux 환경 내부에 에이전트 CLI 세션을 띄워, 백그라운드에서 끊김 없이 장시간 작업이 영속되도록 보장합니다.
2. **비동기 이벤트 기반 아키텍처:** MQTT 브로커를 메시징 백플레인으로 활용하여, 에이전트 간 상태 전이 단계(`started`, `progress`, `completed`, `error`)를 긴밀하게 제어 및 조정합니다.
3. **Multi-Agent Mux (MAM):** 파일 기반의 어드바이저리 락(`fcntl`) 및 SQLite WAL 데이터베이스(`.mam/agent-sessions.db`)를 결합하여, 동시성 작업 선점 경쟁을 방지하고 에이전트 세션의 라이프사이클을 드리프트 없이 관리합니다.
4. **리뷰어 기반 고신뢰 검증 루프:** Worker 에이전트가 구현한 코드 변경 사항에 대해 상이한 강점을 지닌 전문 검증 에이전트(예: 논리 흐름을 정밀 검토하는 Claude, 셸 문법 및 안전성을 빠르게 확인하는 Hermes)들로부터 최종 `PASS` 판정을 획득한 뒤 머지하도록 교차 검증 루프를 자동화합니다.
---
## 🛠️ 핵심 스킬 및 구조
모든 오케스트레이션 스킬들은 `.agents/skills/` 디렉터리 하위에 정의되어 있습니다:
* **`multi-agent-mux-create`**: 격리된 tmux 세션을 생성하고 특정 에이전트 CLI를 백그라운드에서 구동합니다. 프로세스 PID 캡처, 메타데이터 레지스트리 업데이트 및 에이전트 인증 검증을 처리합니다.
* **`multi-agent-mux-stop`**: 에이전트 CLI 세션을 정상 종료 키 입력(`/exit` 또는 `Exit`)을 통해 안전하게 닫고, 격리된 대화 히스토리 및 데이터베이스 로그를 삭제(purge)하는 클린업 작업을 수행합니다.
* **`multi-agent-mux-resume`**: 디스크 또는 캐시에서 특정 워크스페이스의 세션 UUID를 조회하여 기존 대화 상태(`claude -r <uuid>` 또는 `hermes --resume <uuid>`) 그대로 세션을 복구하고 재개합니다.
* **`multi-agent-mux-status`**: 활성화된 모든 세션의 실시간 작동 상태를 쿼리하여 PID 정합성, 실행 명령 포맷, tmux 실제 상태와 데이터베이스 간의 동기화 드리프트를 감지합니다.
* **`multi-agent-mux-monitor`**: 백그라운드에서 Kanban Reconcile 프로세스로 실행되어, 실시간 tmux 세션 변화를 모니터링하고 `.mam/agent-sessions.yaml` 메타데이터 파일에 상태를 동기화합니다.
* **`multi-agent-mux-delegate-job`**: 태스크를 비동기식 독립 잡으로 위임 및 관리하는 핵심 모듈입니다:
* `registry.py`: 파일 락(`fcntl`)을 활용해 경쟁 조건 없이 잡을 원자적으로 등록 및 점유(claim)합니다.
* `job_subscriber.py`: MQTT 백플레인 채널을 구독하여 실시간 상태 이벤트를 수집하고 이를 감사 로그(audit trail)에 기록합니다.
* `publish_event.py`: 실행 상태 전환 및 세부 에러 내용을 워크스페이스 스크립트에서 백플레인으로 발행합니다.
* `mqtt_common.py`: 브로커 접속 규칙, 크레덴셜 인증 및 HMAC 서명 기능을 공통 관리합니다.
---
## 📐 전체 아키텍처 구성 (Big-Picture Architecture)
이 시스템은 크게 두 가지 계층(Layer)을 통해 다중 워크스페이스에서 작동하는 LLM 에이전트들을 조율합니다:
1. **Layer A — Tmux 오케스트레이션 (lib.sh + status/resume/stop/create)**: 워크스페이스별 에이전트 세션을 독립된 tmux 인스턴스로 분리 실행하고, `.mam/agent-sessions.yaml` 및 SQLite 데이터베이스(`.mam/agent-sessions.db`)를 통해 에이전트 세션 메타데이터의 단일 참조 지점(Single Source of Truth)을 유지합니다.
2. **Layer B — 비동기 잡 위임 (delegate-job)**: 에이전트에 특정 태스크를 전송하고 비동기 이벤트 채널(MQTT)을 통해 진행 상황과 완료 여부를 모니터링합니다.
두 레이어는 파일 I/O 처리를 위한 하나의 핵심 관문인 `lib.sh::atomic_dump_yaml`을 공유합니다. 모든 YAML/DB 쓰기 작업은 SQLite 데이터베이스 트랜잭션 락과 데이터 스키마 유효성 검증을 거칩니다.
### 데이터 흐름 개요 (Data Flow)
```text
+-----------+ register_job +-------------------+
| delegator | ---------------> | .mam/jobs/<id>.json| <-- 실시간 잡 정보
+-----------+ +---------+---------+
|
| atomic rename + fsync
v
+-----------------+
| audit log | <-- 추가 전용
| .mam/delegate_ | events.ndjson
| job_logs/<id>/ |
+--------+--------+
^
| (최선 노력 미러링)
|
+-----------+ publish_event +-----+-----+ +---------+
| agent | ---------------> | MQTT broker | <--- | monitor |
| (claude) | +-------------+ +----+----+
+-----------+ |
^ v
| 구독자(subscriber) atomic_dump_yaml
| (job_subscriber.py) (.mam/agent-sessions.yaml)
| ^
+-------- 위임 대기 영역 -----------------+ |
+---+---+
| reconcil|
| e.sh |
+--------+
```
### 🔒 Tmux 서버 격리 (Tmux Server Isolation)
에이전트 세션 간의 충돌 및 시스템 전역 tmux 프로세스와의 혼선을 막기 위해 독립된 서버 소켓 환경을 보장합니다:
* **워크스페이스별 심(Shim):** `_init_tmux_isolation``_resolve_real_tmux_path` 함수가 `/tmp/multi-agent-tmux-shim/<TMUX_SERVER_NAME>/tmux` 경로에 독립된 심 디렉터리를 구성하고, 일반 tmux 명령 실행 시 자동으로 `tmux -L <server>` 형태의 독립 소켓 서버를 사용하게 만듭니다.
* **PATH 환경변수 변조:** 자식 프로세스를 생성할 때 `PATH` 변수 맨 앞에 심 디렉터리 경로를 삽입합니다. 이로 인해 에이전트의 내부 셸에서 수행되는 모든 `tmux` 명령어는 해당 격리 서버 소켓으로 강제 제약됩니다.
* **환경 복구:** `TMUX_SERVER_NAME``default`로 설정하는 경우 PATH 오버라이드가 정리되고 기본 전역 tmux 서버를 사용하게 됩니다.
### 🛡️ 동시성 설계 및 쓰기 직렬화
여러 에이전트가 동시에 실행될 때의 레이스 컨디션을 방지하기 위해 락 기반의 실행 패턴을 고수합니다:
* **SQLite 데이터베이스 락 (`BEGIN IMMEDIATE`):** `agent-sessions.yaml` 또는 SQLite 레지스트리에 쓰기 연산을 진행할 때, 반드시 `lib.sh` 내부의 `atomic_dump_yaml` 함수를 거쳐 SQLite 데이터베이스 `.mam/agent-sessions.db``BEGIN IMMEDIATE` 트랜잭션 락을 획득하도록 직렬화합니다.
* **이중 인터프리터 분리 구조:** 라이브러리 간 의존성 충돌과 실행 도구의 안정성을 보장하기 위해 환경을 이원화했습니다. MQTT 및 비동기 작업 통신에는 가상환경 `.venv` (paho-mqtt 필요)의 Python을 사용하고, YAML 직렬화 쓰기 및 유효성 검증을 담당하는 `atomic_dump_yaml`은 시스템 전역 `python3` (시스템 PyYAML 필요)을 호출합니다.
* **NFS 및 네트워크 파일시스템 대응:** 네트워크 디바이스(NFS, CIFS, SSHFS)에서는 파일 락(`flock`) 및 SQLite WAL 기능이 오작동할 수 있습니다. `lib.sh`는 쓰기 대상 파일시스템 경로의 마운트 타입을 체크하여, 네트워크 파일시스템 감지 시 경고 로그를 출력하고 SQLite의 저널 모드를 `WAL`에서 `DELETE`로 자동 전환해 동시성 안전을 강화합니다.
---
## 📐 아키텍처 및 조정 루프 (Review Loop)
Project Manager(PM), Worker, Reviewer 역할 간의 협업 구조는 엄격한 교차 검증 루프를 따릅니다:
```mermaid
sequenceDiagram
autonumber
actor User as User
participant PM as Project Manager
participant W as Worker
participant R as Reviewers
participant M as MQTT Backplane
User->>PM: 요구사항 전달
Note over PM: 태스크 수립 및 Job 등록
PM->>M: Job 등록 및 Subscriber 시작
PM->>W: 작업 위임 (Job ID 및 가이드 제공)
W->>M: 'started' 이벤트 발행
Note over W: 코드 변경 및 자체 검증 수행
W->>M: 'completed' (또는 'error') 발행
PM->>R: 병렬 리뷰 요청 (Diff 제공)
Note over R: 에이전트 교차 검증 (Claude, Hermes)
alt 리뷰 피드백 (NOT PASS)
R->>PM: NOT PASS (대안 코드 블록과 함께 피드백 전달)
Note over PM: PM 직접 수정 또는 다시 위임 결정
PM->>W: 리뷰 피드백 반영 재요청
else 검증 통과 (PASS)
R->>PM: PASS
end
PM->>User: 최종 머지 완료 보고 및 변경사항 커밋
```
---
## 🔒 보안 프로토콜 및 Replay 공격 방어
공용 MQTT 브로커 환경에서도 통신의 무결성을 보장하기 위해 **HMAC-SHA256 암호화 서명** 체계를 갖추고 있습니다:
* **PoC 모드 (무인증):** 기본 설정 모드로 `auth_token``null`로 세팅되어 간단한 로컬 환경 검증 시 암호화 시그니처 체크를 건너뜁니다.
* **프로덕션 모드 (인증 필수):** 각 잡마다 무작위 암호화 토큰이 발급되며, 백플레인을 오가는 모든 페이로드에 토큰 기반으로 생성된 `hmac_sig` 서명을 탑재해야 수신측에서 수용합니다.
* **Replay 공격 방어:** 각 이벤트는 단조 증가하는 정수형 시퀀스 번호(`seq`)를 포함합니다. 구독자(`job_subscriber.py`)는 특정 잡에 대해 이미 수용한 최대 시퀀스 번호보다 크지 않은 시퀀스 번호를 가진 페이로드를 즉시 폐기합니다. 페이로드 바디에 대한 HMAC 서명과 결합하여, 이 메커니즘은 클록 동기화에 의존하지 않고 재전송(re-injected) 및 순서가 바뀐 패킷을 차단합니다. 메시지 내 타임스탬프 필드는 참고용 메타데이터이며, 백플레인은 시간 오차(clock-skew) 윈도우를 별도로 검증하지 않습니다.
---
## 📁 리포지토리 구성
```text
.
├── .agents/
│ ├── AGENT.md # 에이전트 역할 행동 강령 및 뷰포트 스냅샷 규칙
│ ├── AGENT.ko.md # 에이전트 역할 행동 강령 (한국어 백업)
│ └── skills/ # 코어 오케스트레이션 셸 스크립트 및 라이브러리
│ ├── lib.sh # 공통 오케스트레이션 셸 함수 라이브러리
│ ├── multi-agent-mux-create/
│ ├── multi-agent-mux-stop/
│ ├── multi-agent-mux-resume/
│ ├── multi-agent-mux-status/
│ ├── multi-agent-mux-monitor/
│ └── multi-agent-mux-delegate-job/
│ ├── requirements.txt # 파이썬 의존성 패키지 선언
│ └── scripts/ # 파이썬 기반 백플레인 구현체
├── .mam/ # Multi-Agent Mux 메타데이터 (git-ignored)
│ ├── agent-sessions.db # SQLite WAL 세션 데이터베이스
│ ├── agent-sessions.yaml # 텍스트 형식의 세션 레지스트리 스냅샷
│ └── jobs/ # 비동기 잡 메타데이터 JSON 파일들
├── scripts/
│ └── generate-env.sh # 환경 파일(.env) 템플릿 복사 스크립트
├── BOOTSTRAP.ko.md # 프로젝트 초기 설치 가이드 (한국어 백업)
├── BOOTSTRAP.md # 프로젝트 초기 설치 및 검증 상세 가이드
├── MESSAGING.md # MQTT 메시징 프로토콜 와이어 규격서
└── README.md # 프로젝트 대표 소개 파일
```
---
## 🚦 빠른 시작
자세한 빌드 절차는 **[BOOTSTRAP.md](./BOOTSTRAP.md)** 문서를 참조하십시오. 아래는 간략한 요약입니다:
1. **환경 설정 파일(.env) 생성:**
```bash
./scripts/generate-env.sh
```
2. **가상환경 생성 및 의존성 패키지 설치:**
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
```
3. **레지스트리 및 환경 작동 자가 검증:**
```bash
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
```
---
## 📝 협업 에이전트 준수 사항
이 프로젝트에 새로 합류한 에이전트는 다음 규칙을 준수해야 합니다:
1. **[AGENT.md](.agents/AGENT.md)** 문서를 정독하여 프로젝트 매니저(PM), 작업자(Worker), 리뷰어(Reviewer) 간의 역할 및 개발 제약조건을 인지하십시오.
2. 장시간 명령을 실행하는 경우 터미널 스크롤백 로그 유실을 방지하기 위해 `AGENT.md` (제4장)에 기재된 **뷰포트 스냅샷 규칙(Pane Snapshotting Rules)**을 반드시 적용하십시오.
3. 리뷰어 세션에 diff 검증을 요청하기 전에는 어떠한 코어 파일의 임의 수정도 프로덕션 브랜치에 승인 없이 머지할 수 없습니다.
+210 -1
View File
@@ -1 +1,210 @@
# README.md
# tmux-agent-orchestration
An advanced, high-reliability **Multi-Agent Orchestration & Messaging Backplane** framework built on Tmux and MQTT. It is designed to coordinate, isolate, and audit long-running agent tasks (such as code generation, refactoring, and security reviews) across multiple LLM backend clients (e.g., Claude, Hermes).
---
## 🚀 Overview
Modern agentic workflows often suffer from session timeout, lack of process isolation, terminal viewport truncation (scrollback limits), and complex concurrency issues.
**tmux-agent-orchestration** addresses these problems by providing:
1. **Tmux-based Process Isolation:** Spawning LLM client sessions inside dedicated, isolated tmux environments to support persistent background runs.
2. **Asynchronous Event-Driven Architecture:** Leveraging an MQTT broker as a message backplane to coordinate state transitions (`started`, `progress`, `completed`, `error`) between collaborating agents.
3. **Multi-Agent Mux (MAM):** Combining local file-based locks (fcntl) and an ACID-compliant SQLite WAL database (`.mam/agent-sessions.db`) to manage concurrent job claims and track running agent sessions without drift.
4. **Automated Review & Quality Loop:** Implementing parallel reviewer loops where worker agents must receive a `PASS` rating from various specialized verification agents (e.g., Claude for high-level logic, Hermes for shell syntax/safety) before merging code.
---
## 📦 Installation & Setup
You can bootstrap the Multi-Agent Mux (MAM) framework in any workspace directory with a single command:
```bash
curl -fsSL https://git.godopu.com/tmpl/multi-agent-mux/raw/branch/main/deploy/install.sh | bash
```
Alternatively, if you have already cloned the repository locally, run the installer directly:
```bash
bash deploy/install.sh
```
The idempotent installer automatically validates system dependencies (tmux, python3, and PyYAML), creates the python virtual environment (`.venv`), installs dependencies, copies `.env.example` as `.env`, and initializes the `.agents/` scaffolding.
---
## 🛠️ Core Skills & Scaffolding
All orchestration functionalities are structured under the `.agents/skills/` directory:
* **`multi-agent-mux-create`**: Spawns isolated tmux sessions running specified agent CLI wrappers. It captures system processes, updates metadata registries, and enforces authentication checks.
* **`multi-agent-mux-stop`**: Gracefully terminates agent CLI sessions (using key macros like `/exit` or `Exit`) and handles disk purge operations (removing conversation JSON files and SQLite logs for deleted workspaces).
* **`multi-agent-mux-resume`**: Restores stopped sessions by resolving workspace UUIDs from disk or cache, and invokes the underlying agent using session-resume parameters (e.g., `claude -r <uuid>` or `hermes --resume <uuid>`).
* **`multi-agent-mux-status`**: Queries the running states of all active sessions, detecting PID mismatches, command signatures, and drifts between actual tmux instances and the registry database.
* **`multi-agent-mux-monitor`**: A long-running Kanban reconcile worker that dynamically monitors tmux sessions and synchronizes states to `.mam/agent-sessions.yaml`.
* **`multi-agent-mux-delegate-job`**: The core asynchronous task distribution module containing:
* `registry.py`: Atomically registers and claims jobs using file advisory locks (`fcntl`).
* `job_subscriber.py`: Connects to the MQTT backplane, captures live events, and appends them to audit trails.
* `publish_event.py`: Emits execution status transitions and error details from workspace scripts.
* `mqtt_common.py`: Manages connection policies, authentication, and HMAC signing.
---
## 📐 Big-Picture Architecture
The system coordinates LLM agents across multiple workspaces through two core layers:
1. **Layer A — Tmux Orchestration (lib.sh + status/resume/stop/create)**: Runs the agents (one tmux session per agent-workspace combination) and maintains an authoritative registry in `.mam/agent-sessions.yaml` (+ `.mam/agent-sessions.db`).
2. **Layer B — Async Job Delegation (delegate-job)**: Dispatches a task to an agent and observes progress and completion via an event channel.
These two layers share one lock-guarded chokepoint for file I/O: `lib.sh::atomic_dump_yaml`. Every write is protected by an exclusive SQLite database transaction lock and schema validation.
### Data Flow Overview
```text
+-----------+ register_job +-------------------+
| delegator | ---------------> | .mam/jobs/<id>.json| <-- live record
+-----------+ +---------+---------+
|
| atomic rename + fsync
v
+-----------------+
| audit log | <-- append-only
| .mam/delegate_ | events.ndjson
| job_logs/<id>/ |
+--------+--------+
^
| (best-effort mirrors)
|
+-----------+ publish_event +-----+-----+ +---------+
| agent | ---------------> | MQTT broker | <--- | monitor |
| (claude) | +-------------+ +----+----+
+-----------+ |
^ v
| subscriber atomic_dump_yaml
| (job_subscriber.py) (.mam/agent-sessions.yaml)
| ^
+-------- delegator waits here ----------+ |
+---+---+
| reconcil|
| e.sh |
+--------+
```
### 🔒 Tmux Server Isolation
To prevent workspace tmux processes from interfering with each other or with system tmux servers, the framework enforces isolated tmux environments:
* **Per-Workspace Shim:** `_init_tmux_isolation` and `_resolve_real_tmux_path` instantiate a per-workspace shim directory under `/tmp/multi-agent-tmux-shim/<TMUX_SERVER_NAME>/tmux` that intercepts tmux commands and wraps them in `tmux -L <server>`.
* **PATH Rewriting:** The `PATH` environment variable is dynamically prepended with the shim path in all child processes. This ensures any `tmux` invocation within the agent's process tree is restricted to its isolated socket server.
* **Environment Restoration:** If `TMUX_SERVER_NAME` is set to `default`, the PATH override is removed, reverting to the default global tmux server.
### 🛡️ Concurrency Design & Write Serialization
The framework implements lock-guarded execution pathways to prevent race conditions during parallel agent operations:
* **SQLite Database Locks (`BEGIN IMMEDIATE`):** Every mutation of `agent-sessions.yaml` and the SQLite registry runs through `atomic_dump_yaml` inside `lib.sh`, which serializes writes via an exclusive `BEGIN IMMEDIATE` transaction lock on the SQLite database `.mam/agent-sessions.db`.
* **Dual-Interpreter Strategy:** To minimize dependency bloat and guarantee stability, the backplane splits execution environments: the virtual environment `.venv` handles MQTT communication and async jobs (requiring `paho-mqtt`), while the system `python3` executes `atomic_dump_yaml` (relying on system-wide `PyYAML`).
* **NFS and Network FS Safeguards:** Since file locking (`flock`) and SQLite WAL behave unreliably over network protocols (NFS, CIFS, SSHFS), `lib.sh` performs filesystem detection. If a network mount is identified, it outputs a safety warning and SQLite automatically switches its journaling mode from `WAL` to `DELETE`.
---
## 📐 Architecture & Coordination Loop
The interaction between roles (Project Manager, Worker, and Reviewer) is structured as a strict iterative loop:
```mermaid
sequenceDiagram
autonumber
actor User as User
participant PM as Project Manager
participant W as Worker
participant R as Reviewers
participant M as MQTT Backplane
User->>PM: Hand over requirements
Note over PM: Plan tasks & register jobs
PM->>M: Register Job & start Subscriber
PM->>W: Delegate task (Provide Job ID & Brief)
W->>M: Publish 'started' event
Note over W: Implement & verify code
W->>M: Publish 'completed' (or 'error')
PM->>R: Request parallel reviews (Provide Diff)
Note over R: Parallel analysis (Claude, Hermes)
alt Review Feedback (NOT PASS)
R->>PM: NOT PASS (Feedback with code blocks)
Note over PM: Apply fixes or re-delegate
PM->>W: Re-delegate with comments
else Verification PASS
R->>PM: PASS
end
PM->>User: Commit changes & Report completion
```
---
## 🔒 Security & Replay Attack Defense
To ensure communication integrity across public MQTT brokers, the backplane integrates an **HMAC-SHA256 signature protocol**:
* **PoC Mode (Unauthenticated):** Default mode where `auth_token` is `null`, skipping cryptographic validations for quick setups.
* **Production Mode (Authenticated):** A unique cryptographic token is issued per job. Event payloads must include an `hmac_sig` computed with the token.
* **Replay Attack Mitigation:** Each event carries a monotonically increasing integer sequence counter (`seq`). The subscriber (`job_subscriber.py`) drops any payload whose sequence number is not strictly greater than the highest sequence number it has already accepted for that job. Combined with the HMAC signature on the payload body, this rejects both re-injected and out-of-order packets without relying on clock synchronization. The wire-format timestamp field is advisory metadata only; the backplane does not enforce a clock-skew window.
---
## 📁 Repository Layout
```text
.
├── .agents/
│ ├── AGENT.md # Agent roles, snapshottings, and execution charter
│ ├── AGENT.ko.md # Agent roles, snapshottings, and execution charter (Korean)
│ └── skills/ # Core orchestration shell wrappers & libraries
│ ├── lib.sh # Shared orchestration library
│ ├── multi-agent-mux-create/
│ ├── multi-agent-mux-stop/
│ ├── multi-agent-mux-resume/
│ ├── multi-agent-mux-status/
│ ├── multi-agent-mux-monitor/
│ └── multi-agent-mux-delegate-job/
│ ├── requirements.txt # Python dependency declaration
│ └── scripts/ # Core backplane implementation (Python)
├── .mam/ # Multi-Agent Mux metadata (git-ignored)
│ ├── agent-sessions.db # SQLite WAL session database
│ ├── agent-sessions.yaml # Human-readable session registry
│ └── jobs/ # Asynchronous job metadata files
├── scripts/
│ └── generate-env.sh # Environment bootstrap helper
├── BOOTSTRAP.md # Detailed installation and verification guide
├── MESSAGING.md # MQTT wire protocol specification
└── README.md # Project introduction and overview (this file)
```
---
## 🚦 Quick Start
For detailed setup instructions, please consult the **[BOOTSTRAP.md](./BOOTSTRAP.md)** file. Below is a quick summary:
1. **Initialize Environment Config:**
```bash
./scripts/generate-env.sh
```
2. **Create Virtual Environment and Install Dependencies:**
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
```
3. **Run Registry Diagnostics:**
```bash
.venv/bin/python3 .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list
```
---
## 📝 Guidelines for Collaborating Agents
If you are an AI agent newly onboarded to this project:
1. Read **[AGENT.md](.agents/AGENT.md)** to align on development constraints and roles (PM, Worker, Reviewer).
2. Adhere to the **Pane Snapshotting Rules** in `AGENT.md` (Section 4) to prevent scrollback data loss during long execution steps.
3. Never modify core logic without submitting a diff to the reviewer sessions for evaluation.
+122
View File
@@ -0,0 +1,122 @@
# Multi-Agent Mux: Skill Features and Architecture
이 문서는 `multi-agent-mux` 워크스페이스 내에 구현된 6개의 개별 스킬 및 공통 라이브러리의 핵심 기능, 상태 머신, CLI 사양, 그리고 상호 연동 방식을 종합 정리한 명세입니다. 스킬 최적화 및 팩토링 작업의 기준서로 사용됩니다.
---
## 1. 아키텍처 개요 (Architecture Overview)
`multi-agent-mux`는 다중 자율 에이전트(Claude, Agy, Cline, Hermes 등)를 격리된 Tmux 세션 환경에서 관리하고 상호 통신할 수 있게 돕는 시스템입니다.
* **중앙 상태 레지스트리**: `.mam/agent-sessions.yaml` 및 동기화된 `.mam/agent-sessions.db` (SQLite3)
* **격리 소켓**: 독립된 tmux 서버 소켓 지정 구동 가능 (예: `multi-agent-mux` 서버)
* **이벤트 버스**: MQTT 프로토콜 기반의 실시간 작업 상태 비동기 관찰 (`multi-agent-mux-delegate-job`)
---
## 2. 공통 라이브러리: `lib.sh` (Common Library)
모든 스킬 스크립트가 로드하여 사용하는 핵심 공유 헬퍼 라이브러리입니다.
* **상태 파일 원자적 덤프 (`atomic_dump_yaml`)**:
* NFS(네트워크 파일 시스템) 감지 시 SQLite `PRAGMA journal_mode=DELETE` 폴백, 로컬 환경에서는 `PRAGMA journal_mode=WAL` 설정.
* 독점 잠금(`BEGIN IMMEDIATE`)을 활성화해 멀티프로세스 환경에서 Read-Modify-Write 데이터 유실(lost update race condition) 방지.
* 트랜잭션 커밋 완료 후 `.bak` 백업 파일 생성 및 임시파일 생성 후 `os.replace` 원자적 대체 기법 적용.
* **에이전트 세션 실재성 판단 (`*_exists` 함수군)**:
* `claude`: 프로젝트 디렉터리 하위 `<uuid>.jsonl` 존재성
* `agy`: `.gemini/antigravity-cli/conversations/<uuid>.db` 존재성
* `hermes`: `~/.hermes/state.db``sessions` 테이블 내 존재성 (SQLite 쿼리 검증)
* `cline`: `.cline/data/sessions/<uuid>/<uuid>.json` 존재성
* **세션 ID 해석 엔진 (`find_workspace_uuid` 분기 구조)**:
* **Tier 1 (YAML 직접 조회)**: YAML 내 기록된 에이전트별 전용 필드(`claude_session_id_own` 등) 조회.
* **Tier 2 (디스크 잔해 스캔)**: 워크스페이스 디렉터리(`cwd` / `workspace_root`)와 매칭되는 디스크 상의 세션 로그 중 가장 최근 수정일(`mtime`) 기준 정렬 후 최신 UUID 반환.
* **Tier 3 (아이덴티티 캐시)**: 레지스트리 상단 `agent_identities` 캐시 데이터 연동.
---
## 3. 스킬별 상세 핵심 기능 (Skill Specifications)
### 3.1. `multi-agent-mux-create` (생성 스킬)
* **용도**: 신규 에이전트 동작용 격리된 Tmux 컨테이너 생성 및 레지스트리 신규 등록.
* **핵심 기능**:
* **사전 기능 검증 (Preflight Check)**:
* `claude`: `claude auth status`를 통한 로그인 상태(`"loggedIn": true`) 검증
* `agy`: `agy models`를 통한 API 연동 정상 상태 검증
* `hermes`: `hermes status`를 통한 연동 상태 검증
* `cline`: `cline history --json` 동작 및 설정 상태 사전 검증
* **Tmux 세션 생성 및 초기화**: 에이전트별 최적화된 화면 크기(`-x 140 -y 40`) 및 작업 디렉터리(`-c`)를 적용해 세션 백그라운드 생성.
* **초기 상태 YAML 등록**: 사용자 필수 지정 역할(`--role`), `status: running`, `pane` 세부정보(인덱스, PID, CWD, CMD_FULL), 시작 명령 및 `mcp_attachments` 기록.
* **역할 불변성 보장**: 에이전트 생성 시 부여된 역할(`role`)은 사후 수정이 불가하며, 임의 변경 시도 시 데이터 검증(`atomic_dump_yaml`) 단계에서 예외 처리되어 방어됨.
### 3.2. `multi-agent-mux-resume` (재개 스킬)
* **용도**: 중지되었거나 유실된 에이전트의 이전 컨텍스트 그대로 Tmux 세션 및 TUI 연결 복원.
* **핵심 기능**:
* **세션 ID 해석 위임**: `lib.sh::find_workspace_uuid`을 구동하여 대상 워크스페이스의 UUID 확인.
* **세션 복원 기동**:
* `claude`: `claude --dangerously-skip-permissions -r <UUID>`
* `agy`: `agy --dangerously-skip-permissions --conversation <UUID>`
* `hermes`: `hermes --resume <UUID>`
* `cline`: `cline -i --id <UUID>`
* **TUI 바이패스 자동화 (Claude)**: 기동 직후 백그라운드에서 `Enter``Down``Enter` 키스트로크를 주입하여 권한 우회 및 복구 확인 대화상자 자동 수락.
* **동기화**: `update_yaml_resumed.sh`를 구동해 상태를 `running`으로 전이하고 기동 시점에 맞춘 하위 자식 PID 갱신 및 기존 종료 메타데이터 제거.
### 3.3. `multi-agent-mux-stop` (종료 스킬)
* **용도**: 세션을 안전하게 정리하고, 상태 및 UUID를 안전하게 저장 및 동기화.
* **핵심 기능**:
* **종료 전 TUI 스냅숏 저장**: `tmux capture-pane`을 수행해 최종 화면 상태를 `last_visible_status_at_termination` 필드에 보존.
* **다단계 Graceful 종료 프로토콜**:
1. TUI 안전 종료 키스트로크 주입 (`/exit` 또는 `Exit`) 후 3초 대기.
2. 생존 시 `tmux kill-session` 전송 및 5초 대기.
3. 최후 수단으로 감지된 자식 PID에 `kill -9` 전송.
* **디스크 소거 (--purge-conversation)**:
* `resumable``false`로 설정하고 상태를 `terminated`로 기록.
* 에이전트별 데이터 경로에 접근해 해당 세션 파일 파쇄.
* `claude`: `<proj-key>/<uuid>.jsonl` 삭제
* `agy`: `conversations/<uuid>.db``brain/<uuid>` 폴더 삭제
* `hermes`: `sessions/session_<uuid>.json` 삭제 및 `state.db` 내 이력 삭제 (내부 독자 커넥션 `hconn` 사용으로 상위 YAML DB 충돌 차단)
* `cline`: `~/.cline/data/sessions/<uuid>` 폴더 소거
### 3.4. `multi-agent-mux-delegate-job` (위임 스킬)
* **용도**: 타 에이전트에게 비동기적으로 작업을 위임하고, MQTT 이벤트로 실행 상태 관찰.
* **핵심 기능**:
* **작업 지시 유형 (Delegation Types)**:
* `direct` (기본값): 단일 타겟 세션 기동 후 작업 전달 및 대기.
* `loop` (협업 루프): 구현자(Worker)의 작업 완료 후 검토자(Reviewer)가 코드 검수를 수행하여 `"PASS"` 의견이 나올 때까지 작업 수정을 자동 반복 지시.
* `discuss` (토론/합의): 두 에이전트 간 공동 토론을 추진하여 최종 기획 및 계획 합의 도출.
* **MQTT 이벤트 규격**: `publish_event.py``job_subscriber.py`를 매핑하여 `started``permission_required``progress``completed`/`error` 상태 전이 추적 및 자동 이중 타임아웃 검사 (전체 실행 예산 3600초 + 120초 유휴 타임아웃).
* **감사 로그 기록**: `.mam/delegate_job_logs/<job_id>/``meta.json`, `status.json` 및 원시 NDJSON 형식의 `events.ndjson`을 영속 기록.
### 3.5. `multi-agent-mux-status` (현황 스킬)
* **용도**: 레지스트리를 읽어와 실행 중인 모든 에이전트의 구동 세션 현황을 즉시 표기.
* **핵심 기능**:
* **읽기 전용 안정성**: DB 수정이나 상태 전이 유발 없이 순수 조회만 수행.
* 실시간 tmux 프로세스 상태 정보와 YAML 간의 이름 매핑 정합성을 검증하여 콘솔에 요약 출력.
### 3.6. `multi-agent-mux-monitor` (화해 스킬)
* **용도**: 운영체제 Tmux 런타임과 YAML 레지스트리 데이터 불일치를 백그라운드 루프로 감지해 자동 화해(Reconciliation) 처리.
* **핵심 기능**:
* **Drift 감지 및 복구 매뉴얼**:
* **Drift A (Crash/죽은 세션)**: YAML 상 `running`이나 실제 tmux 프로세스가 죽은 경우 감지 ➔ 상태를 `terminated`로 격하 조정.
* **Drift B (새 세션 감지)**: YAML에 없으나 tmux 상에 임의로 떠 있는 `*-creator-*` 세션을 레지스트리에 자동 등록 및 자식 PID 정보 갱신.
* **Drift C (실시간 UUID 갱신)**: 새로 시작된 에이전트가 첫 명령을 받아 세션 ID를 생성했을 때, 디스크 상의 세션 로그 중 가장 수정시간이 일치하는 최신 UUID를 찾아 `*_conversation_id_own` 필드에 주입.
* **Drift D (캐시 정합성 점검)**: 레지스트리 및 캐시 상의 세션 UUID가 실제 디스크에 존재하는지 검사하여 소거된 세션을 리포트.
---
## 4. 에이전트 상태 머신 (Agent State Machine)
시스템 전반에 걸쳐 에이전트 세션은 아래 흐름을 따라 전이됩니다.
```mermaid
stateDiagram-v2
[*] --> running : multi-agent-mux-create / Drift B
running --> stopped : multi-agent-mux-stop (default)
running --> terminated : multi-agent-mux-stop (--purge-conversation) / Drift A
stopped --> running : multi-agent-mux-resume
terminated --> [*]
```
## 5. 최적화 및 팩토링 작업 시 주의 사항
1. **원자적 쓰기 무력화 금지**: `lib.sh`에 설정된 `atomic_dump_yaml`은 다중 에이전트 병렬 기동 시 데이터 꼬임을 막는 중추 역할을 합니다. DB 잠금 및 트랜잭션 흐름을 훼손하지 않아야 합니다.
2. **Cline 및 Claude의 TUI 입력 바인딩 유지**: 세션 재개나 중지 시, 각 에이전트가 내부적으로 사용하는 프롬프트 제어 명령어(예: `/exit`, `--id <session>`)의 세세한 차이를 유지해야 예외 없이 동작합니다.
3. **데이터베이스 변수 충돌 주의**: 서브셸 또는 인라인 Python 스크립트 실행 시 전역 SQLite 커넥션(`conn`)의 이름 공간을 절대 오염시키지 마십시오. (예: `stop_session.sh` 버그 재발 방지).
+52
View File
@@ -0,0 +1,52 @@
# 🚀 Multi-Agent Mux (MAM) Deployment & Gitea Integration
This directory contains packaging templates and installation scripts to deploy the **Multi-Agent Mux** framework into workspaces hosted on **Gitea** (or GitHub).
---
## 📁 Deployment Directory Structure
* **`install.sh`**: A self-contained, idempotent shell installer that checks system requirements (`tmux`, `python3`, `pip3`), detects NFS/network filesystem mounts, sets up a local python virtual environment (`.venv`), and initializes environment configuration (`.env`).
* **`plugin.json`**: Metadata declaration file to register MAM as an installable plugin for AI Agent coding platforms (such as Claude Code, Antigravity, or other TUI clients).
* **`gitea-ci.yml`**: CI/CD pipeline definition template for Gitea Actions (running ShellCheck linting on bash scripts, validation on python scripts, and compilation tests).
---
## 📦 How to Install and Deploy
### 1. Simple One-Liner Installation (from Gitea repository)
Once you push this repository to your Gitea instance, users can install it in their local workspace directory by running:
```bash
curl -fsSL https://git.godopu.com/tmpl/multi-agent-mux/raw/branch/main/deploy/install.sh | bash
```
Alternatively, if they have cloned the repository, they can execute:
```bash
bash deploy/install.sh
```
### 2. Registering as a Workspace Plugin
To register these skills globally or for a specific workspace:
* **Workspace Level**: Copy the `.agents/` folder into your project root.
* **Global Level (Gemini/Antigravity)**: Register the plugin path in your global config file at `~/.gemini/config/skills.json`:
```json
{
"entries": [
{ "path": "/absolute/path/to/multi-agent-mux/.agents/skills" }
]
}
```
---
## 🤖 Gitea Actions CI/CD Setup
To automate testing and script linting on your Gitea repository:
1. Ensure Gitea Actions is enabled on your Gitea instance.
2. Copy the Gitea CI workflow to your workspace's workflow folder:
```bash
mkdir -p .gitea/workflows
cp deploy/gitea-ci.yml .gitea/workflows/ci.yml
```
3. Commit and push to your Gitea repository. The pipeline will validate shell syntax and python file compilation on every push to `main` and pull requests.
+70
View File
@@ -0,0 +1,70 @@
# ==============================================================================
# Gitea Actions CI/CD Workflow Template
# ==============================================================================
# Place this file in '.gitea/workflows/ci.yml' or use it directly in Gitea's CI.
# It automatically validates shell scripts and checks Python formatting/syntax.
# ==============================================================================
name: Multi-Agent Mux CI
on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main ]
jobs:
lint-shell:
name: Lint Shell Scripts
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v3
- name: Install ShellCheck
run: |
sudo apt-get update
sudo apt-get install -y shellcheck
- name: Run ShellCheck
run: |
echo "🔍 Linting shell scripts..."
shellcheck .agents/skills/lib.sh
shellcheck .agents/skills/multi-agent-mux-create/scripts/create_session.sh
shellcheck .agents/skills/multi-agent-mux-stop/scripts/stop_session.sh
shellcheck .agents/skills/multi-agent-mux-monitor/scripts/reconcile.sh
shellcheck deploy/install.sh
echo "✅ ShellCheck completed successfully."
lint-python:
name: Lint Python Backplane
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pylint
if [ -f .agents/skills/multi-agent-mux-delegate-job/requirements.txt ]; then
pip install -r .agents/skills/multi-agent-mux-delegate-job/requirements.txt
fi
- name: Run Flake8 (Syntax/Error Check)
run: |
echo "🔍 Checking Python syntax with flake8..."
flake8 .agents/skills/multi-agent-mux-delegate-job/scripts/ --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 .agents/skills/multi-agent-mux-delegate-job/scripts/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Run Python Syntax Check (Compile test)
run: |
echo "🔍 Verifying Python file compilation..."
python -m py_compile .agents/skills/multi-agent-mux-delegate-job/scripts/*.py
echo "✅ All Python files compiled successfully."
+256
View File
@@ -0,0 +1,256 @@
#!/usr/bin/env bash
# ==============================================================================
# install.sh — Multi-Agent Mux (MAM) Orchestration Installer
# ==============================================================================
# Idempotent, robust installer to bootstrap MAM orchestration skills
# and Python backplane dependencies on any local workspace.
# ==============================================================================
set -euo pipefail
# --- Configuration & Defaults ---
TARGET_DIR="${1:-$(pwd)}"
VENV_NAME=".venv"
MIN_PYTHON_VERSION="3.9"
echo "===================================================================="
echo "⚡ Starting Multi-Agent Mux (MAM) Installation"
echo "📂 Target Workspace: $TARGET_DIR"
echo "===================================================================="
# --- 1. System Requirements Validation ---
echo "🔍 Checking system dependencies..."
check_cmd() {
local cmd="$1"
if ! command -v "$cmd" &>/dev/null; then
echo "❌ Error: '$cmd' is not installed. Please install it first." >&2
exit 1
fi
}
check_cmd tmux
check_cmd python3
# Verify Python Version
PYTHON_VERSION=$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')
PYTHON_MAJOR="${MIN_PYTHON_VERSION%%.*}"
PYTHON_MINOR="${MIN_PYTHON_VERSION##*.}"
if python3 -c "import sys; exit(0 if sys.version_info >= ($PYTHON_MAJOR, $PYTHON_MINOR) else 1)"; then
echo "✅ Python $PYTHON_VERSION detected."
else
echo "❌ Error: Python version must be $MIN_PYTHON_VERSION or higher. Detected: $PYTHON_VERSION" >&2
exit 1
fi
# Verify PyYAML (needed by system python3 for atomic state writes)
if ! python3 -c "import yaml" &>/dev/null; then
echo "❌ Error: 'PyYAML' is not installed in the system python3. Please install it first" >&2
echo " (e.g., 'pip3 install PyYAML' or 'sudo apt-get install python3-yaml')." >&2
exit 1
fi
echo "✅ PyYAML (system dependency) detected."
# --- 2. Workspace Setup ---
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"
REPO_URL="https://git.godopu.com/tmpl/multi-agent-mux.git"
ARCHIVE_URL="https://git.godopu.com/tmpl/multi-agent-mux/archive/main.tar.gz"
# Helper to verify presence of all core runtime files.
# Keying off a set of core files helps detect and recover from partial/interrupted installations.
check_assets_present() {
local dir="${1:-.}"
local core_files=(
".agents/skills/lib.sh"
".agents/skills/multi-agent-mux-create/scripts/create_session.sh"
".agents/skills/multi-agent-mux-delegate-job/scripts/registry.py"
".agents/skills/multi-agent-mux-status/scripts/status.sh"
)
for f in "${core_files[@]}"; do
if [ ! -f "$dir/$f" ]; then
return 1
fi
done
return 0
}
# Fetch the orchestration assets if missing or incomplete (for curl one-liner installs).
#
# Safety model (FW-D1): we NEVER extract the repo archive directly into the
# target. Running inside an existing project must not overwrite the target's
# own files (README.md, FUTURE_WORKS.md, …) or litter it with this repo's
# development docs. Instead we stage the download into a throwaway temp dir,
# verify it, then copy ONLY the runtime assets (.agents/, documents, .env.example)
# into the target with per-file no-clobber guards so a pre-existing target file
# always wins.
if ! check_assets_present "."; then
echo "📥 Orchestration skills not found or incomplete. Fetching from Gitea repository..."
STAGE_DIR="$(mktemp -d)"
trap 'rm -rf "$STAGE_DIR"' EXIT
if command -v git &>/dev/null; then
echo "🌐 Cloning repository (shallow) into a staging area..."
git clone --depth 1 "$REPO_URL" "$STAGE_DIR"
elif command -v curl &>/dev/null; then
echo "🌐 Downloading and extracting archive into a staging area..."
curl -fsSL "$ARCHIVE_URL" | tar -xz --strip-components=1 -C "$STAGE_DIR"
else
echo "❌ Error: neither 'git' nor 'curl' is available to fetch the skills." >&2
exit 1
fi
# Verify the staged tree before we trust and copy from it.
if ! check_assets_present "$STAGE_DIR"; then
echo "❌ Error: fetched source is missing core runtime assets. Aborting (no files copied)." >&2
exit 1
fi
# Create metadata directory and initialize manifest before copying
mkdir -p .mam
MANIFEST_FILE=".mam/install_manifest.txt"
touch "$MANIFEST_FILE"
# Copy ONLY runtime assets into the target, never overwriting an existing
# target file. We merge per-file (POSIX find + an explicit "[ ! -e ]" guard)
# instead of `cp -n`: `cp -n` is non-portable and now prints a deprecation
# warning on GNU coreutils 9.x, whereas the explicit guard is portable to
# BSD/macOS and makes the no-clobber intent obvious.
mkdir -p .agents
( cd "$STAGE_DIR/.agents" && find . -type f -print ) | while IFS= read -r rel; do
dest=".agents/${rel#./}"
if [ ! -e "$dest" ]; then
mkdir -p "$(dirname "$dest")"
cp "$STAGE_DIR/.agents/$rel" "$dest" || { echo "❌ Error: Failed to copy $rel" >&2; exit 1; }
echo "$dest" >> "$MANIFEST_FILE"
fi
done
# Copy non-dev documents if they don't already exist.
# We skip dev-specific docs like README.md, DONE.md, and FUTURE_WORKS.md.
for doc in MESSAGING.md BOOTSTRAP.md BOOTSTRAP.ko.md INSTRUCTION.md; do
if [ -f "$STAGE_DIR/$doc" ] && [ ! -e "$doc" ]; then
cp "$STAGE_DIR/$doc" . || { echo "❌ Error: Failed to copy $doc" >&2; exit 1; }
echo "$doc" >> "$MANIFEST_FILE"
fi
done
if [ -f "$STAGE_DIR/deploy/remove.sh" ] && [ ! -e "remove.sh" ]; then
cp "$STAGE_DIR/deploy/remove.sh" remove.sh || { echo "❌ Error: Failed to copy remove.sh" >&2; exit 1; }
chmod +x remove.sh
echo "remove.sh" >> "$MANIFEST_FILE"
fi
if [ -f "$STAGE_DIR/deploy/update.sh" ] && [ ! -e "update.sh" ]; then
cp "$STAGE_DIR/deploy/update.sh" update.sh || { echo "❌ Error: Failed to copy update.sh" >&2; exit 1; }
chmod +x update.sh
echo "update.sh" >> "$MANIFEST_FILE"
fi
if [ -f "$STAGE_DIR/.env.example" ] && [ ! -e ".env.example" ]; then
cp "$STAGE_DIR/.env.example" . || { echo "❌ Error: Failed to copy .env.example" >&2; exit 1; }
echo ".env.example" >> "$MANIFEST_FILE"
fi
rm -rf "$STAGE_DIR"
trap - EXIT
echo "✅ Skills staged into workspace (existing files preserved)."
fi
# Sanity check: verify all core files, not just a single one — an empty or
# incomplete layout would yield a silently broken install.
if ! check_assets_present "."; then
echo "❌ Error: Core runtime assets missing after setup. Target layout might be invalid." >&2
exit 1
fi
echo "✅ Orchestration skills present."
echo "📂 Ensuring metadata directory structure (.mam/)..."
mkdir -p .mam/jobs .mam/delegate_job_logs
# File permission lockdown on database directory (if owned by the current user to prevent multi-user system issues)
if [ -O .mam ]; then
chmod 0700 .mam
fi
# --- 3. Check Network File System (NFS) Warnings ---
echo "💾 Detecting file system mount type..."
if command -v df &>/dev/null && command -v mount &>/dev/null; then
MOUNTPOINT="$(df --output=target . 2>/dev/null | tail -1 || echo "")"
if [ -n "$MOUNTPOINT" ]; then
if mount | grep -q "$MOUNTPOINT.*nfs\|$MOUNTPOINT.*cifs\|$MOUNTPOINT.*fuse.sshfs"; then
echo "⚠️ WARNING: Target directory is on a network filesystem (NFS/CIFS/SSHFS)."
echo " SQLite WAL journaling and file locks are UNRELIABLE on network storage."
echo " The sqlite3 registry will fall back to 'DELETE' journaling instead of WAL."
else
echo "✅ File system supports WAL (Local storage detected)."
fi
fi
fi
# --- 4. Python Virtual Environment Setup ---
echo "🐍 Bootstrapping Python virtual environment (.venv)..."
if [ ! -d "$VENV_NAME" ]; then
python3 -m venv "$VENV_NAME"
echo "✅ Virtual environment created."
else
echo "️ Virtual environment (.venv) already exists. Skipping creation."
fi
# Activate virtual environment
# shellcheck disable=SC1091
source "$VENV_NAME"/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install requirements
REQ_FILE=".agents/skills/multi-agent-mux-delegate-job/requirements.txt"
if [ -f "$REQ_FILE" ]; then
echo "📦 Installing backplane dependencies from $REQ_FILE..."
pip install -r "$REQ_FILE"
echo "✅ Dependencies installed successfully."
else
echo "⚠️ WARNING: Could not find requirements file: $REQ_FILE"
echo " Installing default packages (paho-mqtt, pyyaml) manually..."
pip install "paho-mqtt>=2.0.0" pyyaml
fi
# --- 5. Generate Environment Template ---
ENV_FILE=".env"
ENV_EXAMPLE=".env.example"
if [ ! -f "$ENV_FILE" ]; then
if [ -f "$ENV_EXAMPLE" ]; then
echo "📝 Creating configuration from $ENV_EXAMPLE..."
cp "$ENV_EXAMPLE" "$ENV_FILE"
else
echo "📝 Creating default $ENV_FILE..."
touch "$ENV_FILE"
fi
# Always append the active defaults to ensure they are set and not commented out
cat <<EOF >> "$ENV_FILE"
# === Installer-applied active defaults ===
MQTT_BROKER=broker.hivemq.com
MQTT_PORT=1883
MQTT_TLS=0
MQTT_CLIENT_ID_PREFIX=mam-agent
TMUX_SERVER_NAME=default
EOF
chmod 0600 "$ENV_FILE"
echo "✅ Config file .env initialized with chmod 0600."
# Record the newly created .env in the manifest
mkdir -p .mam
touch .mam/install_manifest.txt
echo "$ENV_FILE" >> .mam/install_manifest.txt
else
echo "$ENV_FILE already exists. Skipping config override."
fi
echo "===================================================================="
echo "🎉 Installation complete!"
echo "✨ You can now run the status or monitor skills."
echo "💡 Hint: Try executing: .venv/bin/python .agents/skills/multi-agent-mux-delegate-job/scripts/registry.py list"
echo "===================================================================="
+5
View File
@@ -0,0 +1,5 @@
{
"name": "multi-agent-mux",
"description": "Multi-Agent Orchestration & Messaging Backplane on Tmux & MQTT.",
"disabled": false
}
+211
View File
@@ -0,0 +1,211 @@
#!/usr/bin/env bash
# ==============================================================================
# remove.sh — Multi-Agent Mux (MAM) Orchestration Uninstaller
# ==============================================================================
# Safely removes MAM orchestration skills, virtual environment, and metadata.
# Leaves pre-existing user configurations and files untouched by reading
# the installation manifest (.mam/install_manifest.txt).
# ==============================================================================
set -euo pipefail
TARGET_DIR=""
FORCE=0
PURGE_ENV=0
# Parse arguments
while [[ $# -gt 0 ]]; do
case "$1" in
-y|--yes|--force)
FORCE=1
shift
;;
--purge-env)
PURGE_ENV=1
shift
;;
*)
TARGET_DIR="$1"
shift
;;
esac
done
if [ -z "$TARGET_DIR" ]; then
TARGET_DIR="$(pwd)"
fi
echo "===================================================================="
echo "⚡ Starting Multi-Agent Mux (MAM) Uninstallation"
echo "📂 Target Workspace: $TARGET_DIR"
echo "===================================================================="
if [ ! -d "$TARGET_DIR" ]; then
echo "❌ Error: Target directory '$TARGET_DIR' does not exist." >&2
exit 1
fi
cd "$TARGET_DIR"
# 1. Non-interactive input safety guard (set -e read crash prevention)
if [ ! -t 0 ] && [ $FORCE -eq 0 ]; then
echo "❌ Error: Non-interactive terminal detected. Please run with -y/--yes/--force." >&2
exit 1
fi
# Check if there is anything to remove
MANIFEST_FILE=".mam/install_manifest.txt"
any_exist=0
manifest_files=()
# Load the install manifest if it exists
if [ -f "$MANIFEST_FILE" ]; then
any_exist=1
while IFS= read -r line; do
if [ -n "$line" ]; then
manifest_files+=("$line")
fi
done < "$MANIFEST_FILE"
else
# Fallback to the core MAM directories to check if any exist
fallback_assets=(
".agents/skills/lib.sh"
".agents/skills/multi-agent-mux-create"
".agents/skills/multi-agent-mux-delegate-job"
".agents/skills/multi-agent-mux-monitor"
".agents/skills/multi-agent-mux-resume"
".agents/skills/multi-agent-mux-status"
".agents/skills/multi-agent-mux-stop"
".venv"
".mam"
)
for asset in "${fallback_assets[@]}"; do
if [ -e "$asset" ] || [ -h "$asset" ]; then
any_exist=1
break
fi
done
fi
if [ $any_exist -eq 0 ]; then
echo "️ No MAM assets detected in '$TARGET_DIR'. Nothing to do."
exit 0
fi
# Request confirmation if not forced
if [ $FORCE -eq 0 ]; then
echo "⚠️ WARNING: This will permanently remove the MAM orchestration skills, "
echo " virtual environment (.venv), local metadata (.mam), and docs."
echo " (Your own custom files inside .agents/ will NOT be touched)."
if ! read -p "❓ Are you sure you want to proceed? [y/N]: " -r response; then
response="n"
fi
if [[ ! "$response" =~ ^[yY](es)?$ ]]; then
echo "❌ Uninstallation cancelled by user."
exit 0
fi
fi
delete_asset() {
local asset="$1"
if [ -e "$asset" ] || [ -h "$asset" ]; then
echo "🗑️ Removing: $asset"
rm -rf "$asset"
fi
}
# 2. Uninstall files using the manifest if present
if [ ${#manifest_files[@]} -gt 0 ]; then
echo "📜 Manifest found. Reversing installer-created files..."
for f in ${manifest_files[@]+"${manifest_files[@]}"}; do
# Skip .env and remove.sh for now, they are handled separately
if [ "$f" = ".env" ] || [ "$f" = "remove.sh" ]; then
continue
fi
delete_asset "$f"
done
else
# Fallback: Delete MAM skills manually (only if manifest is missing)
echo "⚠️ No manifest found. Deleting standard MAM skills..."
delete_asset ".agents/skills/lib.sh"
delete_asset ".agents/skills/multi-agent-mux-create"
delete_asset ".agents/skills/multi-agent-mux-delegate-job"
delete_asset ".agents/skills/multi-agent-mux-monitor"
delete_asset ".agents/skills/multi-agent-mux-resume"
delete_asset ".agents/skills/multi-agent-mux-status"
delete_asset ".agents/skills/multi-agent-mux-stop"
fi
# 3. Clean up empty parent directories under .agents recursively to avoid littering
if [ -d ".agents" ]; then
find .agents -depth -type d -exec rmdir {} + 2>/dev/null || true
fi
# 4. Remove virtual environment, monitor cache, and metadata database
delete_asset ".venv"
delete_asset ".cache/multi-agent-mux-monitor"
delete_asset ".mam" # Deletes manifest file too
# 5. Clean up .env file (Only if created by installer, or forced with --purge-env)
# If .env is in manifest, it means MAM created it.
env_created_by_mam=0
for f in ${manifest_files[@]+"${manifest_files[@]}"}; do
if [ "$f" = ".env" ]; then
env_created_by_mam=1
break
fi
done
if [ -f ".env" ]; then
should_delete_env=0
if [ $PURGE_ENV -eq 1 ]; then
should_delete_env=1
elif [ $env_created_by_mam -eq 1 ]; then
# Even if MAM created it, ask or rename to backup to prevent loss of custom secrets
if [ $FORCE -eq 1 ]; then
should_delete_env=1
else
if ! read -p "❓ MAM-created '.env' found. Delete it? (Saying No preserves it) [y/N]: " -r env_response; then
env_response="n"
fi
if [[ "$env_response" =~ ^[yY](es)?$ ]]; then
should_delete_env=1
fi
fi
fi
if [ $should_delete_env -eq 1 ]; then
delete_asset ".env"
else
if [ $env_created_by_mam -eq 1 ]; then
backup_name=".env.mam-backup"
if [ -e "$backup_name" ]; then
backup_name=".env.mam-backup.$(date +%Y%m%d%H%M%S)"
fi
echo "💾 Backing up .env configuration to $backup_name..."
mv ".env" "$backup_name"
else
echo "️ Preserving user-owned .env configuration."
fi
fi
fi
# 6. Remove uninstaller file itself (if we are in the target root)
# Simple check: only delete remove.sh if it is recorded in the manifest
remove_in_manifest=0
for f in ${manifest_files[@]+"${manifest_files[@]}"}; do
if [ "$f" = "remove.sh" ]; then
remove_in_manifest=1
break
fi
done
if [ -f "remove.sh" ] && [ $remove_in_manifest -eq 1 ]; then
echo "🗑️ Removing uninstaller: remove.sh"
# Self-delete is the final action
rm -f "remove.sh"
fi
echo "===================================================================="
echo "🎉 Uninstallation complete!"
echo "===================================================================="
+181
View File
@@ -0,0 +1,181 @@
#!/usr/bin/env bash
# ==============================================================================
# update.sh — Multi-Agent Mux (MAM) Orchestration Updater
# ==============================================================================
# Safely updates MAM skills, virtual environment, and docs to the latest version.
# Preserves user configuration (.env) and local metadata/jobs database (.mam).
# ==============================================================================
set -euo pipefail
TARGET_DIR=""
FORCE=0
# Parse arguments
while [[ $# -gt 0 ]]; do
case "$1" in
-y|--yes|--force)
FORCE=1
shift
;;
*)
TARGET_DIR="$1"
shift
;;
esac
done
if [ -z "$TARGET_DIR" ]; then
TARGET_DIR="$(pwd)"
else
if [ ! -d "$TARGET_DIR" ]; then
echo "❌ Error: Target directory '$TARGET_DIR' does not exist." >&2
exit 1
fi
TARGET_DIR="$(cd "$TARGET_DIR" && pwd)"
fi
echo "===================================================================="
echo "⚡ Starting Multi-Agent Mux (MAM) Update"
echo "📂 Target Workspace: $TARGET_DIR"
echo "===================================================================="
cd "$TARGET_DIR"
# 1. Verification of existing install
if [ ! -f "remove.sh" ]; then
echo "❌ Error: No MAM installation (remove.sh) found in '$TARGET_DIR'." >&2
echo " Please run install.sh first to set up the workspace." >&2
exit 1
fi
# Request confirmation if not forced
if [ $FORCE -eq 0 ]; then
echo "⚠️ WARNING: This will update MAM orchestration skills, virtual environment, "
echo " and docs to the latest version."
echo " (Your configuration, job history, and custom skills will be preserved)."
if [ ! -t 0 ]; then
echo "❌ Error: Non-interactive terminal detected. Please run with -y/--yes/--force." >&2
exit 1
fi
if ! read -p "❓ Proceed with update? [y/N]: " -r response; then
response="n"
fi
if [[ ! "$response" =~ ^[yY](es)?$ ]]; then
echo "❌ Update cancelled by user."
exit 0
fi
fi
# 2. Stage backups of user configurations and metadata to prevent deletion
echo "💾 Backing up configuration and database..."
HAS_ENV=0
if [ -f ".env" ]; then
HAS_ENV=1
mv ".env" ".env.update-tmp"
fi
HAS_MAM=0
if [ -d ".mam" ]; then
HAS_MAM=1
# Copy database and jobs to temporary backup outside of .mam.
# We do NOT move the .mam folder away so that remove.sh can still read .mam/install_manifest.txt!
mkdir -p .mam.update-tmp
# Copy SQLite databases and session files
for db in .mam/agent-sessions.*; do
if [ -f "$db" ]; then
cp -f "$db" .mam.update-tmp/
fi
done
# Copy jobs history
if [ -d ".mam/jobs" ] && [ "$(ls -A .mam/jobs 2>/dev/null)" ]; then
mkdir -p .mam.update-tmp/jobs
cp -rf .mam/jobs/* .mam.update-tmp/jobs/
fi
# Copy delegate logs
if [ -d ".mam/delegate_job_logs" ] && [ "$(ls -A .mam/delegate_job_logs 2>/dev/null)" ]; then
mkdir -p .mam.update-tmp/delegate_job_logs
cp -rf .mam/delegate_job_logs/* .mam.update-tmp/delegate_job_logs/
fi
# Copy manifest so we have a backup
if [ -f ".mam/install_manifest.txt" ]; then
cp -f .mam/install_manifest.txt .mam.update-tmp/
fi
fi
# Define trap to restore backup files on failure
restore_on_failure() {
echo "❌ Update failed. Reverting configuration and database to previous state..."
if [ $HAS_ENV -eq 1 ] && [ -f ".env.update-tmp" ]; then
mv -f ".env.update-tmp" ".env" 2>/dev/null || true
fi
if [ $HAS_MAM -eq 1 ] && [ -d ".mam.update-tmp" ]; then
# Revert to old database/jobs backup by restoring .mam directory
rm -rf .mam 2>/dev/null || true
mkdir -p .mam
cp -f .mam.update-tmp/agent-sessions.* .mam/ 2>/dev/null || true
if [ -d ".mam.update-tmp/jobs" ]; then
cp -rf .mam.update-tmp/jobs .mam/ 2>/dev/null || true
fi
if [ -d ".mam.update-tmp/delegate_job_logs" ]; then
cp -rf .mam.update-tmp/delegate_job_logs .mam/ 2>/dev/null || true
fi
if [ -f ".mam.update-tmp/install_manifest.txt" ]; then
cp -f .mam.update-tmp/install_manifest.txt .mam/ 2>/dev/null || true
fi
rm -rf .mam.update-tmp 2>/dev/null || true
fi
}
trap restore_on_failure EXIT
# 3. Perform uninstallation of existing files
echo "🗑️ Removing existing installation..."
# remove.sh will run in manifest mode because .mam/install_manifest.txt is still present.
# It will delete .agents/, documents, scripts, .venv, and .mam folder.
bash remove.sh --force
# 4. Fetch and run the latest installer from Gitea
echo "📥 Fetching and running the latest installer..."
INSTALLER_URL="https://git.godopu.com/tmpl/multi-agent-mux/raw/branch/main/deploy/install.sh"
if command -v curl &>/dev/null; then
curl -fsSL "$INSTALLER_URL" | bash -s -- "$TARGET_DIR"
elif command -v wget &>/dev/null; then
wget -qO- "$INSTALLER_URL" | bash -s -- "$TARGET_DIR"
else
echo "❌ Error: Neither 'curl' nor 'wget' is available to fetch the installer." >&2
exit 1
fi
# Disable failure trap since installation succeeded
trap - EXIT
# 5. Restore backups of configuration and database
echo "🔄 Restoring configuration and database..."
if [ $HAS_ENV -eq 1 ]; then
# Overwrite the default .env created by installer (if any) with the user's backup
mv -f ".env.update-tmp" ".env"
fi
if [ $HAS_MAM -eq 1 ]; then
if [ -d ".mam.update-tmp" ]; then
# The installer created a new .mam directory with a new manifest.
# We want to merge the old .mam database/jobs back while keeping the new manifest.
for db in .mam.update-tmp/agent-sessions.*; do
if [ -f "$db" ]; then
cp -f "$db" .mam/
fi
done
if [ -d ".mam.update-tmp/jobs" ] && [ "$(ls -A .mam.update-tmp/jobs 2>/dev/null)" ]; then
mkdir -p .mam/jobs
cp -rf .mam.update-tmp/jobs/* .mam/jobs/
fi
if [ -d ".mam.update-tmp/delegate_job_logs" ] && [ "$(ls -A .mam.update-tmp/delegate_job_logs 2>/dev/null)" ]; then
mkdir -p .mam/delegate_job_logs
cp -rf .mam.update-tmp/delegate_job_logs/* .mam/delegate_job_logs/
fi
rm -rf ".mam.update-tmp"
fi
fi
echo "===================================================================="
echo "🎉 Update complete!"
echo "===================================================================="
@@ -1,11 +0,0 @@
# tmux-agent-orchestrate-delegate-job 스킬
작업(Job)을 자율 에이전트(claude-code/codex/opencode/human)에게 위임하고 MQTT
이벤트 채널로 비동기 관찰하는 Hermes 스킬. **시작점은 [`SKILL.md`](./SKILL.md).**
- 프로토콜/스키마: [`job-protocol.md`](./job-protocol.md)
- 브로커 PoC→운영 전환: [`mqtt-broker-setup.md`](./mqtt-broker-setup.md)
- 레지스트리 포맷/동시성: [`registry.md`](./registry.md)
- 참조 구현: [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) (bash wrapper), [`scripts/publish_event.py`](./scripts/publish_event.py), [`scripts/job_subscriber.py`](./scripts/job_subscriber.py), [`scripts/registry.py`](./scripts/registry.py), [`scripts/mqtt_common.py`](./scripts/mqtt_common.py)
- 영구 감사 로그: `.hermes/delegate_job_logs/<job_id>/` (`meta.json`·`events.ndjson`·`status.json`)
`tmux-agent-orchestrate-delegate-job logs <id>` 또는 `tmux-agent-orchestrate-delegate-job logs --list`로 조회 (SKILL.md "Audit Logs" 참조)
@@ -1,385 +0,0 @@
---
name: tmux-agent-orchestrate-delegate-job
description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, tmux-agent-orchestrate-delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types."
version: 1.0.0
author: Hermes Agent
license: MIT
platforms: [linux, macos, windows]
metadata:
hermes:
tags: [agent-delegation, mqtt, jobs, orchestration, async-completion]
related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring]
---
# tmux-agent-orchestrate-delegate-job — Async Job Delegation over MQTT
Delegate a unit of work to an autonomous agent, then **observe** it instead of
blocking on it. Every job gets a unique id and a registry record; the agent
publishes lifecycle events (`started`, `permission_required`, `progress`,
`completed`, `error`) to a per-job MQTT topic; the delegator subscribes and
treats `completed`/`error` — or a timeout — as the terminal state.
This skill is a **reference implementation**: copy the files in this directory
into your project and customise. The `communication_over_mqtt` project is the
canonical concrete instance.
## Overview
The model is deliberately small. A **job** is one delegated task. An **agent**
is a worker (a claude-code tmux session, a codex run, a human). The **registry**
(`.hermes/jobs/<id>.json`) holds everything about a job so nothing important
lives in environment variables — which means one tmux session can process many
jobs sequentially, and many sessions can fan out in parallel, with no env
collisions. The **event channel** is one MQTT topic per job carrying JSON
payloads; `event` discriminates the type.
Responsibility is split into exactly one entry point each:
[`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup,
monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py)
observes them (timeouts, terminal state machine, defensive parsing). Shared
logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in
[`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py`
in the host project stay frozen.
Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up
the protocol. **Production** moves to your own authenticated TLS broker — the
switch is **config only** (env vars + the registry `broker.*` block), never a
code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
## When to Use / When NOT to Use
**Use when:**
- you want **fire-and-observe** delegation — kick off work and get a completion
signal rather than blocking a terminal;
- several agent types (claude-code, codex, opencode, human) must follow **one**
completion protocol;
- you need **multi-job fan-out** across tmux sessions with safe job claiming;
- you want a clean PoC → authenticated-broker upgrade path.
**Do NOT use when:**
- a one-shot `claude -p '…'` that returns inline is enough (no async signal
needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly;
- you need request/response RPC or large artifact transfer (this is a
one-direction event stream, not a data bus);
- the payload would carry secrets and you're still on the public broker — move
to the own-broker stage first.
## Quick Start
The one-line wrapper handles register + subscriber-first + agent launch. If
you're new, **start here** and only fall back to the manual 5-step flow when
you need finer control.
```bash
# 1) one line: register → start subscriber → launch agent in tmux
# (uses public broker by default; last stdout line is the audit-log dir)
tmux-agent-orchestrate-delegate-job submit \
--agent claude-code \
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
--workdir /path/to/project \
--agent-session tmux:demo \
--timeout 3600 --idle-timeout 120
# → stdout: registered job: <JID>
# subscriber pid: …
# agent launched in tmux session: demo
# subscriber output: <one line per event>
# /path/to/project/.hermes/delegate_job_logs/<JID> ← audit log dir
# 2) at any time, query the job or its audit log
tmux-agent-orchestrate-delegate-job status --job <JID>
tmux-agent-orchestrate-delegate-job logs <JID> # pretty timeline
tmux-agent-orchestrate-delegate-job logs --list # every job, live status
# 3) run a user-supplied validator against the job's artifacts
tmux-agent-orchestrate-delegate-job verify --job <JID> --validate ./validate.sh
```
The wrapper enforces the **subscribe-before-publish** ordering and **forwards
the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls
`publish_event.py --job <JID>` with the right id — see Pitfall §"Wrong job_id
propagated to the agent"). When you need finer control, the manual flow is:
```bash
# Manual 5-step (same outcome, more knobs)
PY=.venv/bin/python
SKILL=./skills/tmux-agent-orchestrate-delegate-job/scripts
# 1) register
JID=$($PY "$SKILL/registry.py" register \
--prompt "…" --agent claude-code --agent-session tmux:demo \
--timeout 3600 --idle-timeout 120)
# 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs)
$PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 3600 --idle-timeout 120 &
# 3) pass JID to the agent and instruct it to publish events with --job "$JID"
# (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id")
# 4) on completion the subscriber prints events and exits 0/1/2
# 5) inspect any time
$PY "$SKILL/registry.py" get --job "$JID"
$PY "$SKILL/registry.py" logs "$JID" # positional job id
$PY "$SKILL/registry.py" logs --list
```
## Job Protocol
One topic per job: `python/mqtt/jobs/<job_id>/events`. Payload (JSON, UTF-8,
`schema_version=1`):
```json
{ "schema_version": 1, "seq": 7, "job_id": "abc12345",
"event": "started|permission_required|progress|completed|error",
"timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text",
"data": { "optional": "metadata" } }
```
- `seq` is monotonic per job (first = 1); the subscriber uses it to spot
reorder/duplication.
- `timestamp` is advisory — timeouts are measured from **receive** time.
- `detail`/`data` carry **no** secrets or absolute paths.
- A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing).
`started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0,
`error`→exit 1. Full catalogue + production `auth_token` handling:
[`job-protocol.md`](./job-protocol.md).
## Registry Format
```
.hermes/jobs/<id>.json # metadata record (single source of truth)
.hermes/jobs/<id>.events.log # append-only JSON-lines log (debug, optional)
.hermes/jobs/.lock # fcntl advisory lock for the registry
```
The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block,
`topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`,
`last_seq`, and (production) `auth_token`. Because the `broker` block lives in
the record, `publish_event.py` connects from the registry alone. Concurrency,
the atomic rename trick, and multi-session job claiming are in
[`registry.md`](./registry.md).
## Audit Logs
Every job's lifecycle is mirrored to a **persistent, append-only audit log**
under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`;
default `<cwd>/.hermes/delegate_job_logs`). Unlike the registry — live state
mutated in place and liable to be cleaned up — the audit log is durable
history you can replay after the fact. It is git-ignored.
```
.hermes/delegate_job_logs/<job_id>/
meta.json # registration snapshot: prompt, agent, broker, timeouts, …
events.ndjson # append-only, one JSON event per line, in time order
status.json # current status only (fast point-query)
```
**What is logged, automatically:**
| When | `events.ndjson` line | Written by |
|------|----------------------|------------|
| job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` |
| any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` |
| event published | `published` (carries the exact payload — reproducible) | `publish_event.py` |
| event received | `received` (subscriber's external view) | `job_subscriber.py` |
Both the emitter side (`published`) and the observer side (`received`) are
recorded, so a dropped publish or a missed receive is still visible from the
other. Every write is **best-effort and isolated** — an fcntl-locked append
guarded by `try/except` that only ever emits a `logger.warning`, so a logging
failure can never break a publish, a subscribe, or a registry write. stdout is
never touched.
**Reading them:**
```bash
tmux-agent-orchestrate-delegate-job logs <job_id> # pretty-print one job's timeline
tmux-agent-orchestrate-delegate-job logs --list # summarise every logged job (with live status)
# or directly via the registry CLI:
$PY scripts/registry.py logs <job_id> [--tail N] [--json]
$PY scripts/registry.py logs --list [--json]
```
`submit` prints the job's audit-log directory as its last stdout line, so a
caller can `tail -n1` to locate it.
## Broker Setup
| Stage | Broker | Auth | Transport |
|-------|--------|------|-----------|
| PoC | `broker.hivemq.com` | none | 1883 plaintext |
| Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS |
All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`,
`MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by
`broker_config_from_env()`, with the registry `broker.*` block overriding per
job. Moving to your own broker is **config only**: install Mosquitto, set
`persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant
the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip
`MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL,
`mosquitto_passwd`, self-signed/private-CA certs, cut-over verification):
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
## Agent Adapters
Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry
path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract
in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the
**freshly-issued id from the registry record for *this* delegation** — never a
job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the
agent").
- **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at
lifecycle points. `submit --mode tmux` injects a prompt that already names
`$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference
instruction block (the wrapper injects something equivalent):
```text
Your job_id is "$JOB_ID" (read it from the registry record for this delegation —
do not reuse any job_id you saw before).
On start: $PY tmux-agent-orchestrate-delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started
On permission: $PY … --job "$JOB_ID" --event permission_required --detail "<tool>:<what>"
On progress: $PY … --job "$JOB_ID" --event progress --detail "<short status>"
On success: $PY … --job "$JOB_ID" --event completed --detail "<one-line summary>"
On failure: $PY … --job "$JOB_ID" --event error --detail "<one-line reason>"
Task: <the user's prompt>
The subscriber for "$JOB_ID" is already running; your completed/error event
ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
```
See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns.
- **codex** — same contract. Invoke `codex exec "<instruction-block-above>"` or
wire `publish_event.py` as an MCP tool so the agent can call it directly.
- **opencode** — wire `publish_event.py` as a tool/command the agent can call;
identical event points.
- **human** — a person does the work, reads the registry record, then runs
`publish_event.py --job <id> --event completed` (or `error`) by hand.
## User Interface
The [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) bash wrapper bundles register +
subscribe-first + run-agent + validate:
```bash
tmux-agent-orchestrate-delegate-job submit --agent claude-code \
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
--workdir /path/to/project --timeout 3600 [--validate ./validate.sh]
tmux-agent-orchestrate-delegate-job status --job <id> # one record, pretty-printed
tmux-agent-orchestrate-delegate-job list # all jobs, one line each
tmux-agent-orchestrate-delegate-job verify --job <id> --validate ./validate.sh # runs it, reports exit code
tmux-agent-orchestrate-delegate-job wait [--job <id>] # block until terminal (else --wait-any)
```
`submit` **always starts the subscriber before the agent** (the ordering
dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and
calls `--validate` afterward if given. The skill automates job-id generation,
registry creation, broker resolution, subscriber-first ordering, agent launch,
and completion detection; it does **not** automate the agent's internals or your
business-logic validation — those are hooks you fill (`validate.sh` reads
`$JOB_ID`/`$REGISTRY_DIR`).
## Common Pitfalls
- **Publishing before subscribing** — MQTT does not queue non-retained messages
for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely
on retained terminal events (production). `submit` enforces this.
- **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID`
on every `submit`. If your agent instruction (or the wrapper's prompt template)
hard-codes an old job_id, the agent calls `publish_event.py --job <wrong>`,
the subscriber's defensive parser drops it as a `job_id` mismatch, and the
delegator waits until idle timeout (exit 2). Fix: instruct the agent to
**read the job_id from the registry record for *this* delegation** (or pass it
in via env / `--prompt` interpolation), never from prior runs. `submit`'s
default prompt template interpolates `$JOB_ID` for you — if you build a custom
prompt, do the same.
- **tmux session name collision** — `submit --mode tmux` derives the session
name from `--agent-session tmux:<name>` (default `tmux:claude`). If a session
with that name is already attached (e.g. you ran the demo and the previous
session is still open), `tmux new-session -d -s <name>` fails and the agent
never launches. Pick a unique `--agent-session` per concurrent delegation
(e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one
(`tmux kill-session -t claude`) before re-running.
- **Timeout before `started`** — a cold-starting agent may not emit `started`
for a while; the wall-clock timeout starts at subscribe time so a stuck agent
still terminates. Don't set `--timeout` so low you false-positive a slow start.
- **No retry on publish** — a dropped `completed` would hang the delegator
forever; `publish_event.py` retries with exponential backoff and exits 2 if it
still fails, so the delegator is never left waiting silently.
- **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or
`error` can trail `completed`; the subscriber's terminal state machine
finalises each job once and ignores the rest.
- **Trusting the public broker** — anyone can publish there; never make a real
decision on a PoC signal. Add `auth_token` + an authenticated broker first.
- **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or
tokens (except the production `auth_token` in `data`).
## Subagent Orchestration Pattern
When using this skill from a Hermes `delegate_task` subagent to dispatch work to
a coding-agent CLI (agy/claude) running in a tmux session, the following pattern
has been verified (2026-06-21, 6-batch refactoring sprint):
### Roles
- **Main worker** (implementation): one agent session (e.g. `agy-new`) receives
brief files and executes code changes.
- **Reviewers** (spec compliance + code quality): two other agent sessions
(e.g. `agy-existing`, `claude-existing`) review the diff in parallel.
- **Hermes** (orchestrator): dispatches subagents, verifies diffs, commits,
and falls back to direct fixes when reviewers find issues.
### Key lessons learned
1. **Brief delivery via file path** — don't paste long briefs inline via
`tmux send-keys`; the TUI may swallow them. Instead, send a short instruction
like "follow /tmp/batch1-brief.md" and let the agent read the file.
2. **Polling vs MQTT subscriber** — for short tasks (<5min), pane polling
(`capture-pane` + grep for completion markers) is simpler and more reliable
than registering a job via `registry.py` + `job_subscriber.py`. Use MQTT
subscriber only for long-running jobs (>5min) where push notification matters.
3. **Reviewers catch different bugs** — in practice, agy (Flash) caught
semantic issues (slash matching, export scope), while claude (Opus) caught
API signature mismatches (paho v2 5-arg vs 4-arg `on_disconnect`). Two
reviewers with different models provide complementary coverage.
4. **Hermes fallback fix** — when reviewers find a small, well-defined issue
(wrong argument count, missing slash), Hermes should fix it directly rather
than re-dispatching the implementer. This saves a full round-trip.
5. **Batch grouping** — group 2-3 FW items per batch when they touch different
files (no file overlap). This amortises the dispatch overhead. Items touching
the same file must be in separate batches to avoid conflicts.
6. **Pane Snapshots & Truncation Prevention** — to prevent long agent responses from being scrolled out and truncated due to TUI viewport limitations, enforce the following snapshotting pattern:
- Immediately after dispatching a brief, capture the pre-brief pane buffer via `capture-pane -S -200`.
- During long execution, run a background loop taking incremental snapshots (e.g. every 30 seconds `>> /tmp/pane-snap.txt`).
- Immediately after job termination, capture the entire final pane state to ensure no terminal logs are lost.
## Verification Checklist
- [ ] `started` → `completed` over the public broker: subscriber prints the
lines and exits **0**.
- [ ] `error` path: subscriber exits **1**.
- [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` →
exit **2**.
- [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is
dropped with a warning, not crashed on.
- [ ] one tmux session processes two registry jobs in sequence; a second
session with a different `agent_session` claims only its own.
- [ ] broker cut-over: same scripts reach an authenticated TLS broker with env
changes only; a credential without write ACL is rejected; a late
subscriber still receives the retained terminal event.
- [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample`
still works unchanged (regression).
- [ ] **audit log integrity** — for a completed job,
`.hermes/delegate_job_logs/<JID>/events.ndjson` contains `registered` →
`received started` → `published completed` (in that order), and
`status.json.status == "completed"` matches the registry record. A
logging failure (e.g. read-only log dir) does not break the publish or
subscribe path — only a `logger.warning` is emitted.
- [ ] **end-to-end demo smoke** — run
`tmux-agent-orchestrate-delegate-job submit --agent claude-code --agent-session tmux:demo-smoke
--prompt "echo hello and call publish_event.py --job <JID>
--event completed" --timeout 120` and confirm
(a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session
name printed, (d) `events.ndjson` grows as the agent runs, (e) final
stdout line is the audit-log dir.
@@ -1,280 +0,0 @@
#!/usr/bin/env bash
# tmux-agent-orchestrate-delegate-job — user-facing orchestrator for the tmux-agent-orchestrate-delegate-job skill.
#
# Subcommands:
# submit register a job, start the subscriber FIRST, then run the agent,
# then (optionally) run a validation script.
# status show one job record.
# list list all jobs.
# verify run a user-supplied --validate script against a job's artifacts.
# wait block until all running/pending jobs reach a terminal state.
#
# This is a reference wrapper: it shells out to the python scripts that live
# next to it. Copy it into your project and customise as needed. It never hard
# fails if `claude`/`codex`/`tmux` are missing — it prints what it would run.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Pick an interpreter: prefer a project .venv, else python3.
pick_python() {
local py_bin
if [[ -n "${DELEGATE_JOB_PYTHON:-}" ]]; then
py_bin="$DELEGATE_JOB_PYTHON"
elif [[ -x "${WORKDIR:-.}/.venv/bin/python" ]]; then
py_bin="${WORKDIR}/.venv/bin/python"
elif [[ -x ".venv/bin/python" ]]; then
py_bin="$(pwd)/.venv/bin/python"
else
py_bin="python3"
fi
if ! "$py_bin" -c "import paho.mqtt" 2>/dev/null; then
echo "ERROR: paho-mqtt package is missing for $py_bin." >&2
echo " Please create a virtual environment and install it:" >&2
echo " python3 -m venv .venv && .venv/bin/pip install -r \"$SCRIPT_DIR/requirements.txt\"" >&2
exit 1
fi
echo "$py_bin"
}
REGISTRY_DIR_DEFAULT=".hermes/jobs"
usage() {
cat <<'EOF'
tmux-agent-orchestrate-delegate-job <command> [options]
submit --agent <name> --prompt <text> [--workdir <dir>] [--agent-session <label>]
[--timeout <sec>] [--idle-timeout <sec>] [--validate <script>]
[--registry-dir <dir>] [--dry-run]
# The skill is tmux-interactive only; --mode print was removed.
status --job <id> [--registry-dir <dir>]
list [--registry-dir <dir>]
verify --job <id> --validate <script> [--registry-dir <dir>]
wait [--job <id>] [--timeout <sec>] [--registry-dir <dir>]
logs <job_id> | --list # persistent audit log (delegate_job_logs/)
EOF
}
# ---- arg parsing helpers --------------------------------------------------
AGENT="claude-code"; PROMPT=""; WORKDIR="$(pwd)"; AGENT_SESSION="tmux:claude"
TIMEOUT=3600; IDLE_TIMEOUT=120; VALIDATE=""; DRY_RUN=0
JOB_ID=""; REGISTRY_DIR="$REGISTRY_DIR_DEFAULT"
parse_opts() {
while [[ $# -gt 0 ]]; do
case "$1" in
--agent) AGENT="$2"; shift 2;;
--prompt) PROMPT="$2"; shift 2;;
--workdir) WORKDIR="$2"; shift 2;;
--agent-session) AGENT_SESSION="$2"; shift 2;;
--timeout) TIMEOUT="$2"; shift 2;;
--idle-timeout) IDLE_TIMEOUT="$2"; shift 2;;
--validate) VALIDATE="$2"; shift 2;;
--job) JOB_ID="$2"; shift 2;;
--registry-dir) REGISTRY_DIR="$2"; shift 2;;
--dry-run) DRY_RUN=1; shift;;
*) echo "unknown option: $1" >&2; usage; exit 1;;
esac
done
}
cmd_submit() {
parse_opts "$@"
[[ -n "$PROMPT" ]] || { echo "submit requires --prompt" >&2; exit 1; }
PY="$(pick_python)"
cd "$WORKDIR"
mkdir -p "$REGISTRY_DIR"
# 1) register job (prints the new job id)
JOB_ID="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" register \
--prompt "$PROMPT" --agent "$AGENT" --agent-session "$AGENT_SESSION" \
--timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT")"
echo "registered job: $JOB_ID"
# 2) START THE SUBSCRIBER FIRST (ordering dependency — MQTT does not queue
# non-retained messages for absent subscribers).
local logf="$REGISTRY_DIR/$JOB_ID.subscriber.out"
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
>"$logf" 2>&1 &
local sub_pid=$!
echo "subscriber pid: $sub_pid (log: $logf)"
sleep 1 # give the subscriber time to CONNACK + SUBSCRIBE before the agent runs
# 3) run the agent (or print the command for dry-run / missing binary)
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
# NOTE: the agent MUST use --job "$JOB_ID" (the one we just minted). Hard-coding
# an id from an earlier session is the #1 reason a delegated job sits idle and
# times out (see SKILL.md "Wrong job_id propagated to the agent"). We make the
# freshness explicit in the instruction header.
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
On start run: $pub --event started.
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
On progress (optional): $pub --event progress --detail '<short status>'.
On success run: $pub --event completed --detail '<one-line summary>'.
On failure run: $pub --event error --detail '<one-line reason>'.
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
Task: $PROMPT"
run_agent "$JOB_ID" "$instructions"
# 4) optional validation hook
if [[ -n "$VALIDATE" ]]; then
echo "running validation: $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "validation: PASS"
else
local rc=$?
echo "validation: FAIL (exit $rc)"
fi
fi
if [[ "$DRY_RUN" == "1" ]]; then
# In dry-run we never started a real subscriber (the wrapper short-circuits
# before launching one), but the wait below would still try to join the
# background sub_pid from cmd_submit. Skip both the wait and the subscriber
# log dump; the user just wants to see the instruction that would have run.
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root_dry/$JOB_ID"
return 0
fi
wait "$sub_pid" || true
echo "subscriber output:"; cat "$logf" || true
# Last stdout line: the persistent audit-log dir for this job (see SKILL.md
# "Audit Logs"). Callers can scrape `tail -n1` to find it.
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root/$JOB_ID"
}
run_agent() {
local job_id="$1"; local instructions="$2"
# The skill is INTERACTIVE-ONLY. We never invoke `claude -p` or any other
# one-shot print mode, because:
# - claude -p exits the moment stdin is drained, so there's nothing to
# `tmux attach` to afterwards.
# - fire-and-forget via wrapper defeats the whole point of the audit log
# (you can't tell what happened if the agent crashes mid-turn).
# - the job registry already gives us an authoritative completion signal,
# so we don't need a wrapper-side exit code to know "done".
# The user attaches with `tmux attach -t <session>` and types follow-up
# prompts themselves. We pre-load the first prompt via stdin and `read`
# keeps the pane open after the agent exits so the user can review.
case "$AGENT" in
claude-code) bin="claude";;
codex) bin="codex";;
human) echo "[human agent] complete the task, then run publish_event.py --event completed"; return;;
*) bin="$AGENT";;
esac
if [[ "$DRY_RUN" == "1" ]]; then
echo "[dry-run] would launch agent '$AGENT' in a fresh tmux session with instructions:"
echo "----"; echo "$instructions"; echo "----"
return
fi
if ! command -v tmux >/dev/null 2>&1; then
echo "ERROR: this skill requires tmux (interactive agent sessions)." >&2
echo " Install with: brew install tmux (or your package manager)" >&2
return 1
fi
if ! command -v "$bin" >/dev/null 2>&1; then
echo "ERROR: agent binary '$bin' not found in PATH." >&2
return 1
fi
local sess="${AGENT_SESSION#tmux:}"
# Detect a stale session with the same name (e.g. the user is still attached
# from an earlier run, or a previous wrapper died without cleanup). tmux
# new-session on an existing name fails silently; check first and fail loud.
if tmux has-session -t "$sess" 2>/dev/null; then
local attached
attached=$(tmux list-clients -t "$sess" 2>/dev/null | wc -l | tr -d ' ')
echo "ERROR: tmux session '$sess' already exists (clients attached: $attached)." >&2
echo " Pick a unique --agent-session (e.g. tmux:demo, tmux:claude-a) or" >&2
echo " kill the stale one first: tmux kill-session -t $sess" >&2
return 1
fi
# Before launching the agent, set up error trap to publish error event
if [ -n "${job_id:-}" ] && [ -n "${PY:-}" ]; then
local pub_script="$SCRIPT_DIR/scripts/publish_event.py"
trap 'rc=$?; if [ $rc -ne 0 ]; then "$PY" "$pub_script" --job "$job_id" --event error --detail "agent bootstrap failed (exit $rc)"; fi' EXIT
fi
tmux new-session -d -s "$sess" -c "$WORKDIR" \
"printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; echo '--- agent exited (job $job_id); press enter to close ---'; read"
echo "agent launched in tmux session: $sess (attach with: tmux attach -t $sess)"
trap - EXIT
}
cmd_status() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "status requires --job" >&2; exit 1; }
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get --job "$JOB_ID"
}
cmd_list() {
parse_opts "$@"
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" list
}
cmd_verify() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "verify requires --job" >&2; exit 1; }
[[ -n "$VALIDATE" ]] || { echo "verify requires --validate <script>" >&2; exit 1; }
echo "verifying job $JOB_ID with $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "verify: PASS (exit 0)"; exit 0
else
rc=$?; echo "verify: FAIL (exit $rc)"; exit "$rc"
fi
}
cmd_logs() {
# logs <job_id> | logs --list — delegates to registry.py's logs CLI, which
# reads the persistent audit log under $DELEGATE_JOB_LOGS_DIR (or
# <cwd>/delegate_job_logs). Run from your project dir so the default resolves.
PY="$(pick_python)"
if [[ "${1:-}" == "--list" ]]; then
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs --list
else
local jid="${1:-}"
[[ -n "$jid" ]] || { echo "logs requires <job_id> or --list" >&2; exit 1; }
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs "$jid"
fi
}
cmd_wait() {
parse_opts "$@"
PY="$(pick_python)"
if [[ -n "$JOB_ID" ]]; then
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT"
else
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--wait-any --timeout "$TIMEOUT"
fi
}
main() {
local sub="${1:-}"; shift || true
case "$sub" in
submit) cmd_submit "$@";;
status) cmd_status "$@";;
list) cmd_list "$@";;
verify) cmd_verify "$@";;
wait) cmd_wait "$@";;
logs) cmd_logs "$@";;
""|-h|--help|help) usage;;
*) echo "unknown command: $sub" >&2; usage; exit 1;;
esac
}
main "$@"
@@ -1,65 +0,0 @@
#!/usr/bin/env bash
# watchdog.sh — tmux-agent-orchestrate-monitor 의 부속 스크립트
#
# Metadata for SKILL.md:
# description: "Watchdog helper that keeps subscriber alive and exits when JOB is done"
# usage: "watchdog.sh <job_id> <workdir> [--help]"
if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ] || [ $# -lt 2 ]; then
echo "Usage: $0 <job_id> <workdir>"
exit 0
fi
JOB_ID="$1"
WORKDIR="$2"
LOG_DIR="$WORKDIR/.hermes/jobs"
mkdir -p "$LOG_DIR"
log() {
echo "[$(date -u +'%Y-%m-%dT%H:%M:%SZ')] $*"
}
log "watchdog started for JOB=$JOB_ID workdir=$WORKDIR"
while true; do
# 1) Get current job status with robust Python parsing
STATUS=$(cd "$WORKDIR" && .venv/bin/python skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py get --job "$JOB_ID" 2>/dev/null | python3 -c '
import sys, json
try:
data = json.load(sys.stdin)
print(data.get("status", "unknown"))
except Exception:
print("unknown")
' 2>/dev/null || echo "unknown")
log "JOB status: $STATUS"
# 2) Terminal check
case "$STATUS" in
completed|error|permission_required)
log "JOB reached terminal state ($STATUS), watchdog exiting"
exit 0
;;
esac
# 3) Start subscriber (2min hard limit)
LOG_FILE="$LOG_DIR/subscriber-${JOB_ID}-$(date +%s).log"
log "starting subscriber (2min hard limit, log: $LOG_FILE)"
(
cd "$WORKDIR" && timeout 120 .venv/bin/python skills/tmux-agent-orchestrate-delegate-job/scripts/job_subscriber.py \
--job "$JOB_ID" --timeout 120 --idle-timeout 999999 --registry-dir .hermes/jobs > "$LOG_FILE" 2>&1
echo "[$(date -u +'%Y-%m-%dT%H:%M:%SZ')] subscriber exited" >> "$LOG_FILE"
) &
SUB_PID=$!
log "subscriber PID=$SUB_PID"
# 4) Wait for subscriber to exit or timeout
wait $SUB_PID 2>/dev/null
EXIT_CODE=$?
log "subscriber exited code=$EXIT_CODE"
sleep 1
done