feat(monitor): consolidate per-job watchdogs into shared wildcard subscriber (FW-W3)

This commit is contained in:
2026-06-23 00:35:48 +09:00
parent 31f18b2e5a
commit 12dceb14b2
8 changed files with 97 additions and 83 deletions
+4 -2
View File
@@ -22,7 +22,6 @@ Below is the list of pending future work items. These items were proposed based
| **FW-P7** | Enforce HMAC verification and liveness checks on monitor termination | P1 (High) | Medium | **Portability / Security**: Prevent remote session killing by unauthorized or spoofed events. Integrate `verify_hmac` inside the monitor (`reconcile.sh`'s `on_message` handler) and confirm expected artifacts exist before executing `tmux kill-session`. | None |
| **FW-W1** | Replace global registry lock with fine-grained locks | P2 (Medium) | Medium | **Concurrency / Scaling**: Eliminate throughput bottlenecks where all progress/sequence updates channel through a single fcntl lock on `.mam/jobs/`. Implement per-job lock files. | None |
| **FW-W2** | Implement readiness probes for blind TUI key inputs | P2 (Medium) | Large | **Workflow**: Replace fixed timing sleeps in create, resume, and stop scripts with dynamic terminal readiness probes (e.g. scrapers or CLI checking hooks) to dismiss trust dialogs robustly. | None |
| **FW-W3** | Consolidate per-job watchdogs into shared wildcard subscriber | P2 (Medium) | Medium | **Workflow / Efficiency**: Drop per-job watchdog + subscriber churn (which reconnects every 120s) and migrate their handling to the wildcard MQTT subscriber already running in `reconcile.sh`. | None |
| **FW-W4** | Persist subscriber sequence numbers alongside job records | P1 (High) | Medium | **Workflow / Security**: Persist `subscriber.last_seq` to disk or SQLite to prevent sequence counter reset on subscriber restart, locking down the replay defense window for the full job lifetime. | None |
| **FW-W5** | Define structured message schema for reviewer verdicts | P2 (Medium) | Medium | **Workflow**: Create a dedicated reviewer topic (e.g., `reviews/<job_id>/verdicts`) emitting structured JSON verdicts (`PASS` / `NOT_PASS` + details) to eliminate raw text grepping by the PM. | None |
| **FW-W6** | Expand monitor reconciliation support to Hermes agent | P2 (Medium) | Medium | **Workflow / Consistency**: Fully integrate `hermes` sessions into auto-registration (drift-B) and ID materialization (drift-C) under `reconcile.sh` to match Claude/Agy monitoring coverage. | None |
@@ -41,4 +40,7 @@ Below is the list of pending future work items. These items were proposed based
* Hardcoding relative depth limits (like `../..` relative to a skill's location) creates direct fragility when moving directories or refactoring. By walking up the directory tree to search for known anchors (like `.git` or `.mam`), we establish a single canonical root path and prevent scripts from breaking when their execution wrappers are relocated.
4. **Monitor Termination Authorization (FW-P7)**:
* Auto-termination must not trust unauthenticated events. Since `reconcile.sh` listens to a wildcard topic, any client on a public broker could spoof a terminal message and trigger `tmux kill-session`. Requiring HMAC signature verification on the terminal event path, combined with artifact validation, mitigates spoofing and accidental session cleanup.
* Auto-termination must not trust unauthenticated events. Since `reconcile.sh` listens to a wildcard topic, any client on a public broker could spoof a terminal message and trigger `tmux kill-session`. Requiring HMAC signature verification on the terminal event path, combined with artifact validation, mitigates spoofing and accidental session cleanup.
5. **Consolidation of per-job watchdogs (FW-W3)**:
* Instead of spawning an independent `watchdog.sh` process for each job which reconnects every 2 minutes, we consolidated the event handling, HMAC security verification, and sequence tracking into a single, persistent wildcard subscriber running under `reconcile.sh --subscribe`. This drastically reduces MQTT broker connections, simplifies cleanup logic, and leverages python's memory storage to handle replay attack prevention (monotonic sequence numbers) for concurrent jobs.