Commit Graph

63 Commits

Author SHA1 Message Date
Godopu 5d69ad4f0b Harden remove.sh: fix fallback data-loss risk, prevent remove.sh clobbering, and ensure macOS compatibility 2026-06-24 12:08:00 +09:00
Godopu db75b7deb0 Add MAM uninstaller script (remove.sh) and integrate into installer copy block 2026-06-24 11:49:45 +09:00
Godopu b37407874d Harden installer: partial-install detection, complete runtime docs, explicit copy checks 2026-06-24 10:43:08 +09:00
Godopu 387b43d8e3 fix(deploy): stage installer download and copy runtime assets no-clobber (FW-D1)
deploy/install.sh extracted the repo archive in-place with
`tar --strip-components=1`, which inside an existing project could silently
overwrite the target's own README.md/FUTURE_WORKS.md/etc and litter it with
this repo's dev docs.

Rebuild the fetch path:
- stage the clone/extract into a `mktemp -d` dir, never in-place
- verify `.agents/skills/lib.sh` is present before copying anything
- copy only runtime assets (.agents/, AGENT.md, .env.example) into the target
  with per-file no-clobber guards (`[ ! -e ]`), so existing files always win
- post-fetch sanity check now tests a file, not just the directory
- fail fast when neither git nor curl is available

Use explicit `[ ! -e ]` guards + a POSIX find merge rather than `cp -n`
(non-portable; emits a deprecation warning on GNU coreutils 9.x). The earlier
`tar --exclude` denylist idea was rejected in review: non-portable and the
unanchored `--exclude="scripts"` pattern stripped the skills' own nested
scripts/ dirs, yielding a silently broken install.

Mark FW-D1 resolved and FW-D2 partially addressed in FUTURE_WORKS.md/.ko.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:33:05 +09:00
Godopu 7eaaaf8944 fix(lib,install): update locking doc to SQLite transaction, cache NFS check, verify PyYAML 2026-06-23 23:41:18 +09:00
Godopu 25cf729040 fix(deploy): add directory creation guard and sanity check after download in installer 2026-06-23 23:22:45 +09:00
Godopu 1d2eca57ce fix(deploy): automatically download .agents/skills from Gitea if missing in installer 2026-06-23 23:16:00 +09:00
Godopu 82dcb78a85 fix(deploy): refine install.sh variables, pip upgrade, permissions, and active defaults based on review 2026-06-23 08:32:02 +09:00
Godopu a10c789dc2 refactor(wrapper): update dry-run message to align with existing session delegation 2026-06-23 08:26:13 +09:00
Godopu 6cd0d5333a feat(deploy): add Gitea deployment templates, installer, and CI/CD workflow 2026-06-23 08:22:41 +09:00
Godopu 99ac8b3ce4 refactor(security,concurrency): resolve structural issues, enforce Claude permission skip, update docs 2026-06-23 08:03:43 +09:00
Godopu 12dceb14b2 feat(monitor): consolidate per-job watchdogs into shared wildcard subscriber (FW-W3) 2026-06-23 00:37:39 +09:00
Godopu 31f18b2e5a docs: update FUTURE_WORKS.md and FUTURE_WORKS.ko.md with portability and workflow bottleneck roadmap 2026-06-22 16:28:31 +09:00
Godopu c721d1cd86 refactor: rename skills from tmux-agent-orchestrate-* to multi-agent-mux-* in backplane scripts and documents 2026-06-22 15:58:48 +09:00
Godopu ee48d77d0a docs: add README.md and README.ko.md introducing the orchestration skills and backplane architecture 2026-06-22 14:19:15 +09:00
Godopu 9735258bc5 refactor: rename metadata directory .hermes to .mam in backplane scripts and documents 2026-06-22 14:06:13 +09:00
Godopu 30e447189e refactor: migrate skills/ directory to .agents/skills/ 2026-06-21 14:42:12 +00:00
Godopu e1d998e1ef feat: add support for hermes agent in tmux orchestration scripts 2026-06-21 14:21:30 +00:00
Godopu aacea05f6a docs: replace CLAUDE.md content with README.md title 2026-06-21 13:11:05 +00:00
Godopu a414158864 docs: rename project reference from advanced_multi_agent to tmux_agent_orchestration 2026-06-21 11:32:57 +00:00
Godopu d21d419a7c docs: update commit references for FW-N5, FW-N6, and FW-N7 2026-06-21 10:45:45 +00:00
Godopu 450722b3df docs(security): correct example metadata in job-protocol.md to refer to hmac_sig 2026-06-21 10:45:26 +00:00
Godopu 6a88f10a74 feat(security): implement FW-N5, FW-N6, FW-N7 (HMAC-SHA256 protocol docs, auto-generate token, replay attack defense) 2026-06-21 10:39:49 +00:00
Godopu 8a4067ca91 docs: internationalize top-level documentation files to English and backup Korean originals to *.ko.md 2026-06-21 10:35:01 +00:00
Godopu 738e4dc8d1 docs: add BOOTSTRAP.md project setup and onboarding guide 2026-06-21 10:27:33 +00:00
Godopu 0017ef572d feat: add interactive Understand-Anything project analysis dashboard 2026-06-21 10:10:55 +00:00
Godopu ec92b6c3fa docs: commit analysis report and instruction documents 2026-06-21 10:07:38 +00:00
Godopu 5d555dbf6d docs: rename Messaging_System_REPORT.md to MESSAGING.md and update references 2026-06-21 10:06:53 +00:00
Godopu 9d9b91dc69 docs: add new recommendations to FUTURE_WORKS.md 2026-06-21 10:03:59 +00:00
Godopu 7a4453a2fc docs: add portability guide to AGENT.md checklist 2026-06-21 10:00:21 +00:00
Godopu 86ca4e2713 docs: add AGENT.md guidelines and update README.md entry point 2026-06-21 09:35:21 +00:00
Godopu 8947bebb10 docs: update DONE.md and FUTURE_WORKS.md to reflect completed tasks 2026-06-21 09:20:01 +00:00
Godopu 5258b5013c feat(lib): implement FW-N1~FW-N4 items and pane snapshot guidelines 2026-06-21 09:19:46 +00:00
Godopu 8097df0cbe feat(lib): SQLite DB normalization (FW-L3) & stop semantics simplification (FW-L2) 2026-06-21 09:05:52 +00:00
Godopu 478be56679 fix(lib): hardening and edge-case bugfixes (FW-12, FW-16 round)
- Restored .bak generation to maintain P0-B backup invariants
- Fixed stale NFS warning message to reflect SQLite DELETE fallback
- Replaced vulnerable yaml.replace with os.path.splitext globally
- Ensured YAML dump occurs after conn.commit() to prevent partial syncs
- Re-applied chmod 0600 on SQLite -wal and -shm files
2026-06-21 08:43:06 +00:00
Godopu 9b797a5c8c feat(lib): migrate to SQLite WAL backend for robust concurrency (FW-L1)
- Replaces python fcntl.flock with SQLite BEGIN IMMEDIATE.
- Status/Reconcile read from SQLite SSOT, with YAML fallback.
- Explicitly documented tradeoff: YAML is no longer a real-time view.
- Handles PRAGMA wal_checkpoint(TRUNCATE) safely outside transactions.
2026-06-21 08:35:07 +00:00
Godopu 623eef814b docs: split FUTURE_WORKS.md -> DONE.md (FW-01~16 completed) + new items (FW-N1~N4, FW-L1~L2)
DONE.md: 16/16 items completed, 11 commits, 3-agent verified.
FUTURE_WORKS.md: rewritten with only remaining items:
  - FW-L1: SQLite WAL migration (FW-02 long-term)
  - FW-L2: stop option semantics Step 2 (FW-03/13 follow-up)
  - FW-N1: reconcile.sh idle timeout vs job timeout mismatch (new)
  - FW-N2: wire format compat (HMAC rollout) (new)
  - FW-N3: log message 'auth_token mismatch' -> 'HMAC verify failed' (new)
  - FW-N4: REPORT.md section 2.4 plaintext auth_token -> HMAC (new)
2026-06-21 07:15:53 +00:00
Godopu 9ee9076d60 docs(delegate-job): add Subagent Orchestration Pattern section to SKILL.md
Verified pattern from 2026-06-21 6-batch refactoring sprint:
- Main worker (agy-new) + 2 reviewers (agy-existing, claude-existing) + Hermes orchestrator
- Brief delivery via file path (not inline tmux send-keys)
- Polling for short tasks, MQTT subscriber for long tasks
- Complementary reviewer coverage (different models catch different bugs)
- Hermes fallback fix for small well-defined issues
- Batch grouping rules (no file overlap)
2026-06-21 06:41:25 +00:00
Godopu f1a98be8de fix(lib.sh): add NFS flock warning (FW-02) + unify venv deps with pyyaml (FW-11)
FW-02: atomic_dump_yaml now calls _atomic_dump_yaml_check_nfs() which
  detects NFS/CIFS/SSHFS mounts and warns that flock is unreliable.
  Long-term fix (SQLite WAL) documented in FUTURE_WORKS.md.

FW-11: pyyaml added to requirements.txt and installed in .venv, so
  both paho-mqtt and yaml are available in a single interpreter.
  Eliminates the system-python3-vs-venv split for monitor --subscribe.
2026-06-21 06:39:12 +00:00
Godopu 7d925de00d fix(monitor): add status enum docs + subscribe security warning (FW-09, FW-15)
FW-09: SKILL.md defines valid last_visible_status values (running/stopped/
  terminated/archived). reconcile.sh now sets last_visible_status to
  'running' and uses last_visible_note for free-form comments.

FW-15: SKILL.md adds Security section for --subscribe on public brokers.
  Documents wildcard subscription risks, auto-kill spoofing, HMAC
  verification mitigation, and recommends --once/polling for PoC.
2026-06-21 06:37:28 +00:00
Godopu 2cffcc46c5 fix(delegate-job): unify .env loading in Python scripts (FW-04) + trap agent bootstrap errors (FW-06)
FW-04: mqtt_common.py now loads .env at module import via _load_dotenv().
  Walks up from script dir to find workspace .env, sets vars not already
  in os.environ (OS env takes precedence). Uses stdlib only — no
  python-dotenv dependency.

FW-06: bash wrapper sets trap EXIT before tmux new-session to publish
  error event if agent bootstrap fails (non-zero exit). Trap is cleared
  after successful session creation. Only active when job_id is set.
2026-06-21 06:35:17 +00:00
Godopu 155c6e8d5c docs: fix delete->stop in REPORT + add session/job state glossary (FW-03, FW-10, FW-16)
FW-03: replace 'delete' with 'stop' in skill reference (line 299).
  'terminated' retained as valid YAML status value (hard kill mode).

FW-10/FW-16: add Glossary section distinguishing session states
  (running/stopped/terminated/archived in agent-sessions.yaml) from
  job states (pending/running/completed/error/cancelled in registry).
  Documents which skill/function sets each state.
2026-06-21 06:32:29 +00:00
Godopu 3677e4aace feat(delegate-job): add subscriber auto-reconnect (FW-01) + HMAC-SHA256 event signing (FW-05)
FW-01: job_subscriber.py now has on_disconnect callback (5-arg paho v2
  signature), reconnect_delay_set(1,16) for exponential backoff, and
  with_retry-wrapped initial connect (5 attempts). paho loop_start()
  handles auto-reconnect internally.

FW-05: publish_event.py signs payloads with HMAC-SHA256 using auth_token
  as key (replaces plaintext token in wire). mqtt_common.py adds
  verify_hmac() helper using hmac.compare_digest (timing-safe).
  job_subscriber.py validates incoming events via verify_hmac.
  PoC mode (auth_token=None) skips verification — backward compatible.

Reviewed by agy-existing (PASS) and claude-existing (FAIL: on_disconnect
  4-arg signature → fixed to 5-arg matching paho v2 CallbackAPIVersion).
2026-06-21 06:31:39 +00:00
Godopu 4cea11438a refactor(lib.sh): extract hardcoded tmux shim paths to constants (FW-07) + cache _delegate_py_bin result (FW-08)
FW-07: _resolve_real_tmux_path and _init_tmux_isolation now use
  _TMUX_SHIM_DIR_PATTERN and _TMUX_SKILLS_BIN_PATTERN env-overridable
  constants instead of hardcoded path strings. All 4 reference sites
  updated (lines 32, 37, 57, 76). Default values preserve original
  slash semantics (/multi-agent-tmux-shim/, /skills/.bin).

FW-08: _delegate_py_bin caches result in AGENT_PYTHON_BIN shell
  variable (not exported — avoids cross-workspace pollution).
  Fallback uses command -v python3 for absolute path caching.

Reviewed by agy-existing (FAIL->fixed) and claude-existing (FAIL->fixed).
Both reviewers identified: slash omission, incomplete extraction at :57/:76,
export side effects. All issues resolved.
2026-06-21 06:24:31 +00:00
Godopu c68852b8e3 docs: add FUTURE_WORKS.md — 3-agent deep analysis results (FW-01~FW-16) 2026-06-21 06:15:08 +00:00
Godopu 5af1387b5d refactor(stop): rewrite SKILL.md frontmatter/heading/prose for stop semantics (FW-13, FW-03)
- frontmatter description: 'Terminate...mark it terminated' → 'Stop...hard=terminated, stop options=stopped'
- heading: 'Multi-Agent Delete' → 'Multi-Agent Stop'
- tags: delete → stop
- state machine diagram: delete → stop
- prose: soft/hard delete → soft/hard stop throughout
- stop_session.sh: comments + echo 'delete complete' → 'stop complete'
- create/SKILL.md: companion list 'delete' → 'stop' (2 locations)
- preserved: status=terminated (valid YAML value), terminated_at field, --purge-conversation semantics
2026-06-21 06:13:37 +00:00
Godopu 9334352924 docs: rename REPORT.md -> Messaging_System_REPORT.md (FW-14)
Regularize the uncommitted rename via git mv so the working tree
is clean and the authoritative messaging-system spec is unambiguous.
2026-06-21 06:09:39 +00:00
Godopu a6f7c045bc feat(delegate-job): bump default --timeout 600s -> 3600s (1h wall-clock budget)
Changed 11 locations across 5 files:
- scripts/registry.py: timeout_sec dataclass default + argparse default
- scripts/job_subscriber.py: help text + fallback default
- SKILL.md: 4 recommended invocation examples
- registry.md: JSON example + CLI example
- tmux-agent-orchestrate-delegate-job: bash wrapper TIMEOUT var

--idle-timeout 120s preserved unchanged.
Rationale: 10min default was too short for deep analysis / multi-file
generation tasks; 1h aligns with long-running agent delegation patterns.
2026-06-21 06:08:49 +00:00
Godopu 50b2b201b8 refactor(skills): rename tmux-agent-orchestrate-delete -> stop (step 1)
User decision: 2-step approach (Step 1 = simple rename, Step 2 = option
redefinition in a separate round).

Changes (mechanical, history preserved):
- skills/tmux-agent-orchestrate-delete/ -> skills/tmux-agent-orchestrate-stop/ (git mv)
- scripts/delete_session.sh -> scripts/stop_session.sh (git mv)
- sed s/orchestrate-delete/orchestrate-stop/g + delete_session.sh->stop_session.sh
  across 7 files (0 residual of either pattern)
- SKILL.md frontmatter 'name' -> tmux-agent-orchestrate-stop
- related_skills / companion refs in create/status/monitor/resume SKILL.md updated

NOT in this commit (deferred to step 2):
- Option redefinition (--purge-conversation, --mode soft clarification)
- Deprecation shim (external consumers = 0, no need)

6-route surface preserved (create/resume/stop/status/monitor + delegate-job).

Verified on isolated server -L claude-rename-step1-test (kill-server after):
- syntax PASS (all .sh + py_compile)
- E2E via renamed stop_session.sh: capture-id records id + status=stopped,
  status.sh renders it (DRIFT=-), idempotency exit 0
- 0 stale 'tmux-agent-orchestrate-delete' / 'delete_session' references
- git history preserved (rename detected as R)
- Global skill untouched; real YAML + main canary -L multi-agent-canary untouched

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 15:48:27 +00:00
Godopu a2d4f80608 fix(monitor,resume): honor stopped status + clear stop metadata on resume
Implements user choice Option B: the two follow-ups to 0de0f23, in one patch.

Changes:
- skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh:
  - drift-A skip-set extended: ('terminated', 'archived', 'stopped')
  - prevents the monitor from overwriting a tmux-dead 'stopped' row with
    'terminated (auto-detected)', which would lose resumable + captured id
- skills/tmux-agent-orchestrate-resume/scripts/update_yaml_resumed.sh:
  - pop stopped_at, stopped_at_epoch, stop_reason, resumable on resume
    (alongside the existing terminated_at*/termination_mode/archived_at) so a
    resumed row has no stale end-of-session metadata
- skills/tmux-agent-orchestrate-monitor/SKILL.md: documented 'stopped' in the
  drift class list + a skip-set note on drift class A
- skills/tmux-agent-orchestrate-resume/SKILL.md: documented stopped -> running
  transition + tier-1 race-free resume path

5-route surface preserved (no new directory). delete_session.sh untouched.

Verified on isolated server -L claude-followup-test (kill-server after):
- syntax PASS
- E2E A: stop -> tmux dead -> reconcile --once -> status stays 'stopped'
- E2E B: resume -> stopped_at/stopped_at_epoch/stop_reason/resumable all gone
- E2E C: plain delete -> terminated, reconcile leaves it (no regression)
- Real YAML + main canary untouched

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 15:32:02 +00:00