Harden installer: partial-install detection, complete runtime docs, explicit copy checks

fix(deploy): stage installer download and copy runtime assets no-clobber (FW-D1)
deploy/install.sh extracted the repo archive in-place with `tar --strip-components=1`, which inside an existing project could silently overwrite the target's own README.md/FUTURE_WORKS.md/etc and litter it with this repo's dev docs. Rebuild the fetch path: - stage the clone/extract into a `mktemp -d` dir, never in-place - verify `.agents/skills/lib.sh` is present before copying anything - copy only runtime assets (.agents/, AGENT.md, .env.example) into the target with per-file no-clobber guards (`[ ! -e ]`), so existing files always win - post-fetch sanity check now tests a file, not just the directory - fail fast when neither git nor curl is available Use explicit `[ ! -e ]` guards + a POSIX find merge rather than `cp -n` (non-portable; emits a deprecation warning on GNU coreutils 9.x). The earlier `tar --exclude` denylist idea was rejected in review: non-portable and the unanchored `--exclude="scripts"` pattern stripped the skills' own nested scripts/ dirs, yielding a silently broken install. Mark FW-D1 resolved and FW-D2 partially addressed in FUTURE_WORKS.md/.ko.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 10:43:08 +09:00 · 2026-06-24 10:33:05 +09:00
3 changed files with 107 additions and 17 deletions
@@ -2,7 +2,7 @@

 > **목적**: `multi-agent-mux` 프로젝트의 향후 작업 후보를 추적한다.
 > 완료된 항목은 `DONE.ko.md`를 참조.
-> **최종 갱신**: 2026-06-22
+> **최종 갱신**: 2026-06-24

 ---

@@ -26,6 +26,10 @@
 | **FW-W5** | 리뷰어 판정을 위한 구조적 메시지 스키마 정의 | P2 (Medium) | 중 | **워크플로우**: PM 에이전트가 터미널 스크롤백 문자열을 무가공 grep 파싱하는 대신, 전용 리뷰 피드백 토픽(예: `reviews/<job_id>/verdicts`) 및 정형화된 JSON 포맷(`PASS`/`NOT_PASS` + 차단 요인) 도입 | 없음 |
 | **FW-W6** | 모니터링 복구 루프의 Hermes 에이전트 지원 확장 | P2 (Medium) | 중 | **워크플로우 / 일관성**: `reconcile.sh` 내 자동 등록(drift-B) 및 ID 동기화(drift-C) 로직에 `hermes` 세션을 완전 편입시켜 Claude/Agy 세션과 동일한 모니터링 및 복구 수준 지원 | 없음 |
 | **FW-W7** | derive_session_name 내 디렉터리 경로 슬러그 이름 충돌 해결 | P2 (Medium) | 소 | **워크플로우 / 충돌 방지**: 마지막 2개 디렉터리만 슬러그화할 때 발생하는 동일 이름의 중첩 디렉터리 세션 이름 충돌(예: `/projectA/src` 및 `/projectB/src` 가 동일한 세션명으로 슬러그화됨)을 해결하기 위해 워크스페이스 범위 해시 값을 포함하는 세션명 명명 규칙 적용 | 없음 |
+| ~~**FW-D1**~~ | ✅ **해결됨 (2026-06-24)** — 설치 스크립트가 더 이상 in-place 추출하지 않음 | — | — | **배포 / 안전성**: `deploy/install.sh`는 이제 다운로드를 `mktemp -d` 임시 디렉터리에 스테이징하고 `.agents/skills/lib.sh` 존재를 검증한 뒤, 런타임 자산(`.agents/`, `AGENT.md`, `.env.example`)만 per-file no-clobber 가드(`[ ! -e ]`)로 타겟에 복사한다. 따라서 기존 타겟 파일이 항상 우선하며 레포 개발 문서가 워크스페이스에 들어가지 않는다. fetch 후 sanity 체크도 디렉터리가 아닌 파일을 검사하도록 변경 | 완료 |
+| **FW-D2** | 설치 스크립트가 다운로드하는 소스를 sourcing 전에 고정 및 검증 | P2 (Medium) | 소 | **배포 / 공급망**: 설치 스크립트는 네트워크로 이동형 `main` 브랜치를 clone/추출하고, 워크스페이스는 이후 해당 셸 스크립트(`lib.sh` 등)를 `source`한다. *부분 해결 (2026-06-24): 복사 전에 스테이징된 트리에 `.agents/skills/lib.sh`가 존재하는지 검증함.* **남은 작업:** 릴리스 태그나 커밋 SHA로 고정하고 공개 체크섬을 검증하여 구조적 존재 여부뿐 아니라 콘텐츠 무결성까지 보장 | 없음 |
+| **FW-D3** | `install.sh`와 `lib.sh` 간 NFS 감지 로직 중복 제거 | P2 (Medium) | 소 | **배포 / 이식성**: `deploy/install.sh`가 `lib.sh::_check_is_nfs`에 이미 존재하는 GNU 전용 `df --output=target` + `mount` NFS 검사를 재구현한다. FW-P1 이식성 수정이 이 두 번째 사본까지 포함하도록, 단일 공유 헬퍼로 추출하여 macOS/BSD에서 두 호출 지점 모두 올바르게 동작하게 한다 | FW-P1 |
+| **FW-D4** | CI shellcheck 커버리지 공백 해소 | P3 (Low) | 소 | **배포 / 품질**: `deploy/gitea-ci.yml`은 5개 스크립트만 shellcheck하며, `status.sh`, `resolve_session_id.sh`, `update_yaml_resumed.sh`, `scripts/generate-env.sh`는 검사되지 않는다. 추적되는 모든 `*.sh`를 glob 처리하여 신규 스크립트가 자동 포함되도록 한다 | 없음 |

 ---

@@ -44,4 +48,7 @@
   * 세션 강제 종료(`tmux kill-session`) 권한은 안전하게 제어되어야 합니다. 모니터(`reconcile.sh`)가 와일드카드 토픽을 무검증 수신하여 즉시 세션을 정리하면 위조 주입 공격에 취약해집니다. 종료 이벤트 수신부에 HMAC 서명 검증을 의무화하고, 세션 강제 중지 전 예상되는 작업 결과물(Artifact) 존속 상태를 교차 검토하도록 설계합니다.

 5. **개별 잡 와치독의 단일 와일드카드 구독자 통합 (FW-W3)**:
-   * 매 잡마다 개별적으로 실행되어 2분 주기로 끊고 재연결하던 `watchdog.sh` 프로세스 방식 대신, 상시 기동되는 `reconcile.sh --subscribe` 단일 와일드카드 구독자 구조로 이벤트 처리, HMAC 보안 검증 및 시퀀스 추적 로직을 완전히 통일했습니다. 이를 통해 불필요한 MQTT 커넥션 급증을 원천 차단하고 세션 정리 과정을 간소화했으며, 메모리 캐시 기반 시퀀스 추적을 통해 Replay 공격 차단 정합성을 동시 실행 중인 모든 잡에 대해 안정적으로 제공합니다.
+   * 매 잡마다 개별적으로 실행되어 2분 주기로 끊고 재연결하던 `watchdog.sh` 프로세스 방식 대신, 상시 기동되는 `reconcile.sh --subscribe` 단일 와일드카드 구독자 구조로 이벤트 처리, HMAC 보안 검증 및 시퀀스 추적 로직을 완전히 통일했습니다. 이를 통해 불필요한 MQTT 커넥션 급증을 원천 차단하고 세션 정리 과정을 간소화했으며, 메모리 캐시 기반 시퀀스 추적을 통해 Replay 공격 차단 정합성을 동시 실행 중인 모든 잡에 대해 안정적으로 제공합니다.
+
+6. **배포 설치 스크립트 강화 (FW-D1 ~ FW-D4)**:
+   * `deploy/install.sh`와 Gitea 템플릿은 가장 최근에 추가된(DONE.md 검증 라운드 이후) 리뷰가 가장 적은 영역이며, 검증된 오케스트레이션 코드가 실행되기 *이전*에 동작하는 유일한 경로입니다. **FW-D1(릴리스 차단 항목)은 이제 해결되었습니다(2026-06-24):** 처음 제안된 `tar --exclude` 거부목록(denylist) 방식 — 리뷰 결과 이식성이 없고, 더 심각하게는 비앵커드 `--exclude="scripts"` 패턴이 스킬 트리 내부의 `scripts/` 디렉터리까지 제거하여 조용히 깨진 설치를 만든다는 점이 확인됨 — 대신, 임시 디렉터리 스테이징 + 런타임 자산 허용목록(allowlist) 복사 + per-file no-clobber 가드로 재구성했습니다. 이로써 파괴적 덮어쓰기 위험과 개발 문서 오염을 한 번에 해소했습니다. FW-D2는 부분 해결(복사 전 스테이징 트리 구조 검증)되었고, 남은 공급망 강화 작업은 fetch를 태그/SHA + 체크섬으로 고정하는 것입니다. FW-D3(NFS 감지 분기, FW-P1에 통합)와 FW-D4(CI 린트 커버리지)는 일관성/품질 부채로 남아 있습니다.
@@ -2,7 +2,7 @@

 > **Purpose**: Track future work candidates for the `multi-agent-mux` project.
 > For completed items, see `DONE.md`.
-> **Last Updated**: 2026-06-22
+> **Last Updated**: 2026-06-24

 ---

@@ -20,12 +20,17 @@ Below is the list of pending future work items. These items were proposed based
 | **FW-P5** | Resolve BASH_SOURCE path resolution under zsh | P2 (Medium) | Small | **Portability**: Fix `lib.sh` interactive sourcing issues under zsh shell where `${BASH_SOURCE[0]}` resolves to empty. | None |
 | **FW-P6** | Anchor project root dynamically via marker-file lookup | P1 (High) | Medium | **Portability**: Resolve structural fragility caused by hardcoded `../..` relative directory traversal in `lib.sh`, `status.sh`, and `reconcile.sh`. Use an upward search for root markers (`.git`, `.mam`, `.env`) to export a single source of truth for `WORKSPACE_ROOT`. | None |
 | **FW-P7** | Enforce HMAC verification and liveness checks on monitor termination | P1 (High) | Medium | **Portability / Security**: Prevent remote session killing by unauthorized or spoofed events. Integrate `verify_hmac` inside the monitor (`reconcile.sh`'s `on_message` handler) and confirm expected artifacts exist before executing `tmux kill-session`. | None |
+| **FW-P8** | Unify `.env` loading in `lib.sh` to prevent split-brain path resolution | P1 (High) | Small | **Portability / Consistency**: Sourcing the `.env` file inside `lib.sh` is critical to prevent split-brain path resolution where shell scripts query the default session database path while Python scripts query a custom path defined in `.env`. Sourcing `.env` at the top of `lib.sh` ensures all shell utilities automatically inherit user overrides for `TMUX_SERVER_NAME`, `AGENT_SESSIONS_YAML`, etc. | None |
 | **FW-W1** | Replace global registry lock with fine-grained locks | P2 (Medium) | Medium | **Concurrency / Scaling**: Eliminate throughput bottlenecks where all progress/sequence updates channel through a single fcntl lock on `.mam/jobs/`. Implement per-job lock files. | None |
 | **FW-W2** | Implement readiness probes for blind TUI key inputs | P2 (Medium) | Large | **Workflow**: Replace fixed timing sleeps in create, resume, and stop scripts with dynamic terminal readiness probes (e.g. scrapers or CLI checking hooks) to dismiss trust dialogs robustly. | None |
 | **FW-W4** | Persist subscriber sequence numbers alongside job records | P1 (High) | Medium | **Workflow / Security**: Persist `subscriber.last_seq` to disk or SQLite to prevent sequence counter reset on subscriber restart, locking down the replay defense window for the full job lifetime. | None |
 | **FW-W5** | Define structured message schema for reviewer verdicts | P2 (Medium) | Medium | **Workflow**: Create a dedicated reviewer topic (e.g., `reviews/<job_id>/verdicts`) emitting structured JSON verdicts (`PASS` / `NOT_PASS` + details) to eliminate raw text grepping by the PM. | None |
 | **FW-W6** | Expand monitor reconciliation support to Hermes agent | P2 (Medium) | Medium | **Workflow / Consistency**: Fully integrate `hermes` sessions into auto-registration (drift-B) and ID materialization (drift-C) under `reconcile.sh` to match Claude/Agy monitoring coverage. | None |
 | **FW-W7** | Resolve path slug collisions in derive_session_name | P2 (Medium) | Small | **Workflow / Collision Avoidance**: Update `derive_session_name` to handle same-name nested directories (e.g. `/projectA/src` and `/projectB/src` both slugify to identical session names) by incorporating workspace-scoped identifiers or hash digests. | None |
+| ~~**FW-D1**~~ | ✅ **RESOLVED (2026-06-24)** — installer no longer extracts in-place | — | — | **Deploy / Safety**: `deploy/install.sh` now stages the download into a `mktemp -d` dir, verifies `.agents/skills/lib.sh` is present, then copies only the runtime assets (`.agents/`, `AGENT.md`, `.env.example`) into the target with per-file no-clobber guards (`[ ! -e ]`), so existing target files always win and repo dev docs never land in the workspace. The post-fetch sanity check now tests a file, not just the directory. | Done |
+| **FW-D2** | Pin and verify the source the installer downloads before sourcing it | P2 (Medium) | Small | **Deploy / Supply-chain**: The installer clones/extracts the moving `main` branch over the network, and the workspace later `source`s those shell scripts (`lib.sh` et al.). *Partially addressed (2026-06-24): the staged tree is now verified to contain `.agents/skills/lib.sh` before any file is copied.* **Remaining:** pin to a release tag or commit SHA and/or verify a published checksum so the fetched content is integrity-checked, not merely structurally present. | None |
+| **FW-D3** | De-duplicate NFS detection between `install.sh` and `lib.sh` | P2 (Medium) | Small | **Deploy / Portability**: `deploy/install.sh` re-implements the GNU-specific `df --output=target` + `mount` NFS check already present in `lib.sh::_check_is_nfs`. The FW-P1 portability fix must cover this second copy — extract a single shared helper so both call sites stay correct on macOS/BSD. | FW-P1 |
+| **FW-D4** | Close CI shellcheck coverage gaps | P3 (Low) | Small | **Deploy / Quality**: `deploy/gitea-ci.yml` shellchecks only 5 scripts; `status.sh`, `resolve_session_id.sh`, `update_yaml_resumed.sh`, and `scripts/generate-env.sh` are never linted. Glob all tracked `*.sh` so new scripts are covered automatically. | None |

 ---

@@ -44,4 +49,9 @@ Below is the list of pending future work items. These items were proposed based
   * Auto-termination must not trust unauthenticated events. Since `reconcile.sh` listens to a wildcard topic, any client on a public broker could spoof a terminal message and trigger `tmux kill-session`. Requiring HMAC signature verification on the terminal event path, combined with artifact validation, mitigates spoofing and accidental session cleanup.

 5. **Consolidation of per-job watchdogs (FW-W3)**:
-   * Instead of spawning an independent `watchdog.sh` process for each job which reconnects every 2 minutes, we consolidated the event handling, HMAC security verification, and sequence tracking into a single, persistent wildcard subscriber running under `reconcile.sh --subscribe`. This drastically reduces MQTT broker connections, simplifies cleanup logic, and leverages python's memory storage to handle replay attack prevention (monotonic sequence numbers) for concurrent jobs.
+   * Instead of spawning an independent `watchdog.sh` process for each job which reconnects every 2 minutes, we consolidated the event handling, HMAC security verification, and sequence tracking into a single, persistent wildcard subscriber running under `reconcile.sh --subscribe`. This drastically reduces MQTT broker connections, simplifies cleanup logic, and leverages python's memory storage to handle replay attack prevention (monotonic sequence numbers) for concurrent jobs.
+6. **Consistent `.env` Sourcing across Shell and Python (FW-P8)**:
+   * Sourcing the `.env` configuration file inside `lib.sh` ensures that shell utilities and Python scripts are fully aligned. Without this, customized database locations or isolated tmux server names declared in `.env` are only honored by the Python-based MQTT subsystems, while the shell orchestrators silently fall back to default socket files and paths.
+
+7. **Deployment Installer Hardening (FW-D1 ~ FW-D4)**:
+   * `deploy/install.sh` and the Gitea templates are the newest, least-reviewed surface (added after the DONE.md verification round) and the one path that runs *before* any of the reviewed orchestration code. **FW-D1 (the release blocker) is now resolved (2026-06-24):** rather than the originally proposed `tar --exclude` denylist — which review showed was non-portable and, worse, stripped the skills' own nested `scripts/` directories via the unanchored `--exclude="scripts"` pattern, yielding a silently broken install — the installer was rebuilt around temp-dir staging + an allowlist copy of runtime assets with per-file no-clobber guards. This closes the destructive-overwrite hole and the dev-doc clutter in one move. FW-D2 is partially addressed (the staged tree is structurally verified before copy); the remaining supply-chain hardening is pinning the fetch to a tag/SHA + checksum. FW-D3 (NFS detection drift, folded into FW-P1) and FW-D4 (CI lint coverage) remain open consistency/quality debt.
@@ -54,24 +54,97 @@ echo "✅ PyYAML (system dependency) detected."
 mkdir -p "$TARGET_DIR"
 cd "$TARGET_DIR"

-# Download .agents/skills if missing (for curl one-liner installs)
-if [ ! -d ".agents/skills" ]; then
-  echo "📥 .agents/skills not found. Fetching from Gitea repository..."
-  if command -v git &>/dev/null && [ -z "$(ls -A 2>/dev/null || echo "")" ]; then
-    echo "🌐 Cloning Gitea repository..."
-    git clone "https://git.godopu.com/tmpl/multi-agent-mux.git" .
+REPO_URL="https://git.godopu.com/tmpl/multi-agent-mux.git"
+ARCHIVE_URL="https://git.godopu.com/tmpl/multi-agent-mux/archive/main.tar.gz"
+
+# Helper to verify presence of all core runtime files.
+# Keying off a set of core files helps detect and recover from partial/interrupted installations.
+check_assets_present() {
+  local dir="${1:-.}"
+  local core_files=(
+    ".agents/skills/lib.sh"
+    ".agents/skills/multi-agent-mux-create/scripts/create_session.sh"
+    ".agents/skills/multi-agent-mux-delegate-job/scripts/registry.py"
+    ".agents/skills/multi-agent-mux-status/scripts/status.sh"
+  )
+  for f in "${core_files[@]}"; do
+    if [ ! -f "$dir/$f" ]; then
+      return 1
+    fi
+  done
+  return 0
+}
+
+# Fetch the orchestration assets if missing or incomplete (for curl one-liner installs).
+#
+# Safety model (FW-D1): we NEVER extract the repo archive directly into the
+# target. Running inside an existing project must not overwrite the target's
+# own files (README.md, FUTURE_WORKS.md, …) or litter it with this repo's
+# development docs. Instead we stage the download into a throwaway temp dir,
+# verify it, then copy ONLY the runtime assets (.agents/, documents, .env.example)
+# into the target with per-file no-clobber guards so a pre-existing target file
+# always wins.
+if ! check_assets_present "."; then
+  echo "📥 Orchestration skills not found or incomplete. Fetching from Gitea repository..."
+  STAGE_DIR="$(mktemp -d)"
+  trap 'rm -rf "$STAGE_DIR"' EXIT
+
+  if command -v git &>/dev/null; then
+    echo "🌐 Cloning repository (shallow) into a staging area..."
+    git clone --depth 1 "$REPO_URL" "$STAGE_DIR"
+  elif command -v curl &>/dev/null; then
+    echo "🌐 Downloading and extracting archive into a staging area..."
+    curl -fsSL "$ARCHIVE_URL" | tar -xz --strip-components=1 -C "$STAGE_DIR"
  else
-    echo "🌐 Downloading and extracting skills archive..."
-    curl -fsSL "https://git.godopu.com/tmpl/multi-agent-mux/archive/main.tar.gz" | tar -xz --strip-components=1
-  fi
-  
-  if [ ! -d ".agents/skills" ]; then
-    echo "❌ Error: Fetch completed but '.agents/skills' is still missing. Target layout might be invalid." >&2
+    echo "❌ Error: neither 'git' nor 'curl' is available to fetch the skills." >&2
    exit 1
  fi
-  echo "✅ Skills downloaded successfully."
+
+  # Verify the staged tree before we trust and copy from it.
+  if ! check_assets_present "$STAGE_DIR"; then
+    echo "❌ Error: fetched source is missing core runtime assets. Aborting (no files copied)." >&2
+    exit 1
+  fi
+
+  # Copy ONLY runtime assets into the target, never overwriting an existing
+  # target file. We merge per-file (POSIX find + an explicit "[ ! -e ]" guard)
+  # instead of `cp -n`: `cp -n` is non-portable and now prints a deprecation
+  # warning on GNU coreutils 9.x, whereas the explicit guard is portable to
+  # BSD/macOS and makes the no-clobber intent obvious.
+  mkdir -p .agents
+  ( cd "$STAGE_DIR/.agents" && find . -type f -print ) | while IFS= read -r rel; do
+    dest=".agents/${rel#./}"
+    if [ ! -e "$dest" ]; then
+      mkdir -p "$(dirname "$dest")"
+      cp "$STAGE_DIR/.agents/$rel" "$dest" || { echo "❌ Error: Failed to copy $rel" >&2; exit 1; }
+    fi
+  done
+
+  # Copy non-dev documents if they don't already exist.
+  # We skip dev-specific docs like README.md, DONE.md, and FUTURE_WORKS.md.
+  for doc in AGENT.md AGENT.ko.md MESSAGING.md BOOTSTRAP.md BOOTSTRAP.ko.md INSTRUCTION.md; do
+    if [ -f "$STAGE_DIR/$doc" ] && [ ! -e "$doc" ]; then
+      cp "$STAGE_DIR/$doc" . || { echo "❌ Error: Failed to copy $doc" >&2; exit 1; }
+    fi
+  done
+
+  if [ -f "$STAGE_DIR/.env.example" ] && [ ! -e ".env.example" ]; then
+    cp "$STAGE_DIR/.env.example" . || { echo "❌ Error: Failed to copy .env.example" >&2; exit 1; }
+  fi
+
+  rm -rf "$STAGE_DIR"
+  trap - EXIT
+  echo "✅ Skills staged into workspace (existing files preserved)."
 fi

+# Sanity check: verify all core files, not just a single one — an empty or
+# incomplete layout would yield a silently broken install.
+if ! check_assets_present "."; then
+  echo "❌ Error: Core runtime assets missing after setup. Target layout might be invalid." >&2
+  exit 1
+fi
+echo "✅ Orchestration skills present."
+
 echo "📂 Ensuring metadata directory structure (.mam/)..."
 mkdir -p .mam/jobs .mam/delegate_job_logs