FW-09: SKILL.md defines valid last_visible_status values (running/stopped/ terminated/archived). reconcile.sh now sets last_visible_status to 'running' and uses last_visible_note for free-form comments. FW-15: SKILL.md adds Security section for --subscribe on public brokers. Documents wildcard subscription risks, auto-kill spoofing, HMAC verification mitigation, and recommends --once/polling for PoC.
12 KiB
name, description, version, author, license, platforms, environments, metadata
| name | description | version | author | license | platforms | environments | metadata | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tmux-agent-orchestrate-monitor | Run a long-lived Kanban worker that polls .hermes/agent-sessions.yaml against the actual tmux/agent runtime state and reconciles them. Use when you want live visibility into which agent sessions are running, which are dead, which have stale YAML entries, and which have new session ids that haven't been recorded yet. Designed to be dispatched as a Kanban goal_mode task (--goal) so it keeps running until the user stops it. | 1.0.0 | godopu | MIT |
|
|
|
Agent Sessions Monitor — Live Reconciliation via Kanban Worker
Companion skills:
tmux-agent-orchestrate-create/tmux-agent-orchestrate-resume/tmux-agent-orchestrate-stop(mutators); this skill is the observer. Single source of truth:./.hermes/agent-sessions.yaml.
What this skill does
Dispatch a Kanban worker (in goal_mode) that:
- Every ~30s polls the actual state of:
tmux ls(which sessions are alive)tmux list-panes -t <session> ...(pane cmd, cwd, pid)~/.claude/projects/<workspace-key>/*.jsonlmtime + first-line sessionId~/.gemini/antigravity-cli/cache/last_conversations.json(agy workspace → conversation mapping)~/.gemini/antigravity-cli/conversations/<uuid>.dbmtime (agy)
- Compares the live state to
agent-sessions.yaml - Detects 4 classes of drift:
- yaml-only terminated/archived/stopped: tmux dead, YAML says
terminated,archived, orstopped→ OK, left untouched (deliberate end states) - yaml-only running, tmux dead: YAML says
running, tmux is gone → markterminatedwith timestamp - tmux-only running, not in YAML: tmux session exists with
<workspace>-creator-*naming but YAML doesn't know about it → register as a new entry - stale UUID: YAML has a UUID, but the on-disk artifact is gone → flag in comment
- yaml-only terminated/archived/stopped: tmux dead, YAML says
- Writes a Kanban
kanban_commenton every drift event with diff details - Heartbeat every 5 minutes
- Goal loop: judge (auxiliary model) re-checks the card after each turn against the body to decide "is monitoring still wanted?". When the user says "stop monitoring" via comment, the worker blocks with
reason=stop-requested.
When to use
- You have multiple workspaces with tmux agent sessions and want a single source of truth
- You suspect YAML drift after a host reboot / crash
- You want a notification when a session id was just created (so you can record it before next restart)
- You're running multi-day work and want to know "what's actually running right now"
When NOT to use
- One-off interactive session — just check
tmux lsand read the YAML - A single, short session — overhead > benefit
- You don't have a Kanban dispatcher running
Dispatching the monitor
# Goal-mode task: keeps running until the user signals stop
hermes kanban create \
--title "agent-sessions monitor (live reconcile)" \
--assignee default \
--workspace worktree \
--branch wt/tmux-agent-orchestrate-monitor \
--goal \
--goal-max-turns 100 \
--max-runtime 8h \
--max-retries 1 \
--skill tmux-agent-orchestrate-monitor \
--body "$(cat <<'EOF'
You are the agent-sessions monitor. Every 30 seconds, do:
1. Read .hermes/agent-sessions.yaml
2. Run `tmux ls` and `tmux list-panes -F 'session=#{session_name} pid=#{pane_pid} cmd=#{pane_current_command} cwd=#{pane_current_path}'`
3. For each session in the YAML, check the corresponding tmux state
4. For each tmux session matching `*-creator-claude` or `*-creator-agy` that's not in the YAML, register it
5. For any drift, call `kanban_comment` with the diff
6. Sleep 30 seconds, then repeat
If the user comments `stop` or `stop monitoring` on this card, call `kanban_block(reason="stop-requested by user")`.
If you find that a Claude session's `claude_session_id_own` is null but there's a new *.jsonl in the project dir, read the sessionId from the first line and update the YAML.
Use the helper script at skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh for the YAML updates — it handles all the merge logic and writes a structured comment to this card.
EOF
)"
Helper script: reconcile.sh
The worker calls this script every 30s. It:
- Diffs YAML ↔ tmux ↔ disk artifacts
- Updates YAML if needed (only when changes are real, not on every poll — avoids spamming)
- Emits a JSON diff to stdout that the worker turns into a
kanban_comment
# Reconcile + auto-update YAML (atomic, flock-guarded). Emits JSON drift to stdout.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --once --emit-diff
# Read-only: compute drift WITHOUT writing the YAML (use for "what's running?" checks).
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --once --emit-diff --dry-run
# Push-based MQTT Monitor: listen to delegated job events on the broker and update the YAML instantly.
# Bounded run that exits after 5 min idle, or 1 h wall-clock; falls back to polling if the broker is down.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --subscribe --idle-timeout 300 --timeout 3600
# Persistent monitor (no timeouts): runs until interrupted; still polls if the broker is unreachable.
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --subscribe --idle-timeout 0
Flags: --once (single pass), --emit-diff (print JSON), --dry-run (P1-E — no mutation), --subscribe (push-based MQTT subscription monitoring). --subscribe sub-flags: --timeout N (exit after N seconds of wall-clock; 0 = no limit, default), --idle-timeout N (exit after N seconds with no message; default 600, 0 = never idle-out). On a broker connection failure (connect error or non-zero CONNACK), --subscribe falls back to a polling loop that re-runs --once --emit-diff every RECONCILE_POLL_INTERVAL (default 15) seconds until --timeout. Terminal-event YAML updates are written through lib.sh::atomic_dump_yaml (flock + schema-validate + .bak). There are no --workspace / --agent / --comment-card flags; the worker turns the emitted JSON drifts[] into kanban_comment calls itself.
Drift classes (what the script handles)
Status Enum
The status and last_visible_status fields MUST be one of the following exact strings: running, stopped, terminated, archived.
Any unstructured comments or reasons for the status change should be placed in last_visible_note or termination_mode.
A. tmux dead, YAML says running → auto-terminate
YAML: status=running, pane.pid=201132, cmd=claude
tmux: no session
→ set status=terminated, terminated_at=<now>, termination_mode=auto-detected
→ comment: "lab-landing-page-creator-claude: tmux gone (was pane 201132, cmd claude). Marked terminated."
Skip-set: the auto-terminate only fires for sessions whose status is running.
Rows already in a deliberate end state — terminated, archived, or stopped
(set by tmux-agent-orchestrate-stop --capture-id/--reason/--graceful) — are
left untouched. This is critical: a stopped row keeps its resumable: true and
captured *_session_id_own, so the monitor must not overwrite it with
terminated ("auto-detected") when its tmux is (expectedly) gone.
B. tmux alive, not in YAML → auto-register
tmux: session=lab-paper-pdf2md-creator-agy, pid=...,
cmd=agy, cwd=$WORKSPACE_ROOT/paper-pdf2md
YAML: no such session
→ register as new entry: status=running, last_visible_status=running, last_visible_note=auto-registered
→ comment: "lab-paper-pdf2md-creator-agy: tmux found but not in YAML. Auto-registered."
C. New session id materializes (claude first message sent)
YAML: claude_session_id_own=null (placeholder)
disk: ~/.claude/projects/.../b3a7...c2f.jsonl exists, mtime=now,
first line sessionId=b3a7...c2f
→ update claude_session_id_own=b3a7...c2f
→ comment: "lab-landing-page-creator-claude: session id materialized b3a7...c2f"
D. Stale UUID (artifact gone)
YAML: agent_identities.claude.session_id=87dc548e-...
disk: ~/.claude/projects/.../87dc548e-...jsonl: missing
→ flag in comment, but DO NOT delete from YAML
(the user may have moved the file or the disk may be temporarily unavailable;
only `--purge-conversation` should remove the id)
Pitfalls
- Don't run the monitor without
--goal— without goal mode, a single turn will spawn, do one reconcile, and complete. Goal mode keeps the worker alive across many turns. - The 30s poll is a default — workers may override if they detect heavy churn. A workspace with 5+ agent sessions should bump to 60s to avoid noise.
kanban_commentrate limits — Kanban may throttle if you comment too fast. Coalesce: only comment when the diff is new (not the same drift on every poll). The script tracks a state file at.cache/tmux-agent-orchestrate-monitor/<workspace>.statein the workspace root for this (overridable viaAGENT_SESSIONS_STATE_DIR).- Don't fight the user's explicit action — if
tmux-agent-orchestrate-stopis mid-flight and the monitor sees the same session in two states within 5s, prefer the user's most recent action. The monitor should not auto-revert a freshterminatedtorunningbecause of a staletmux has-sessioncheck. - The monitor should never modify the conversation artifacts (jsonl, db) — only the YAML. If you see a stale UUID, comment about it but don't delete the file.
- TUI capture-pane is expensive — only capture when you need to update
last_visible_status, not every poll.
Worker body template (for hermes kanban create --body)
The --body of the dispatched task IS the worker's behavior spec. Here's a tested template:
# agent-sessions monitor
## Loop (every 30s)
1. Read agent-sessions.yaml
2. Bash: `bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --emit-diff`
3. Parse the JSON diff from stdout
4. If `drifts` is non-empty:
- For each drift, call `kanban_comment` with the diff message
5. Bash: `sleep 30`
6. Heartbeat every 5 min: `kanban_heartbeat(progress="alive, N drifts detected, last at <time>")`
## Stop condition
If `$HERMES_KANBAN_TASK` card has any comment containing "stop" or "stop monitoring" from a user:
- Call `kanban_block(reason="stop-requested by user at <timestamp>")`
## Drift responses
- A. tmux dead + YAML running: auto-terminate YAML, comment
- B. tmux alive not in YAML: auto-register, comment
- C. New session id from *.jsonl: update YAML, comment
- D. Stale UUID: comment only, no YAML change
## Hard rules
- Do NOT modify conversation artifacts (jsonl, db, brain/)
- Do NOT spawn/delete tmux sessions — that's the create/delete skills' job
- Do NOT call tmux-agent-orchestrate-create or tmux-agent-orchestrate-stop — only the user initiates those
- Do NOT call `git commit` / `git push`
Security: --subscribe on Public Brokers
When using --subscribe with the default PoC public broker
(broker.hivemq.com:1883), be aware that:
- Wildcard subscription means anyone can publish events to your job topics.
- Auto-kill on terminal events means a spoofed
completedorerrorevent from a third party can terminate your agent session. - Mitigation: Use
--subscribeonly on private TLS-enabled brokers (production mode). For PoC, prefer polling-based monitor (--onceor no--subscribe) which reads YAML/tmux state directly without MQTT. - HMAC verification: Events are now verified via
verify_hmac()inmqtt_common.py(see FW-05). Ensureauth_tokenis set for each job to enable signature validation — unauthenticated events will be dropped.
Verification (one-shot)
# Run reconcile once and inspect output
bash skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh --emit-diff --once \
| python3 -m json.tool
Related skills
kanban-worker— base lifecycle for the dispatched workerkanban-orchestrator— if you want to dispatch this monitor from an orchestrator, use this to know how to phrase the body