feat(tmux-agent-orchestrate-monitor): integrate watchdog pattern as skill

Moved /tmp/subscriber-watchdog.sh → skills/tmux-agent-orchestrate-monitor/scripts/watchdog.sh
(skill-managed lifecycle, no longer lives outside workspace).

Added lib.sh::start_watchdog() helper:
- Spawns watchdog as background nohup process
- Writes watchdog log to .hermes/jobs/<JID>.watchdog.log
- Returns watchdog PID via stdout

Wired create_session.sh --submit-job to auto-start watchdog after JOB registration.

Fixes:
- Bug: registry.py get first-line parse was fragile (empty status → infinite loop)
  → Now uses python3 json.load for robust parsing
- Bug: old path skills/delegate-job/scripts/job_subscriber.py hardcoded
  → Now uses skills/tmux-agent-orchestrate-delegate-job/scripts/job_subscriber.py

Verified on isolated server -L agy-watchdog-skill-test (kill-server after):
- Syntax check PASS
- E2E: register job → start watchdog → publish completed → watchdog exits
- Global skill non-interference verified
- Main isolated server -L multi-agent-canary untouched
This commit is contained in:
2026-06-19 23:33:46 +00:00
parent e9fc763d31
commit e8eebe5eb1
3 changed files with 89 additions and 0 deletions
+65
View File
@@ -0,0 +1,65 @@
#!/usr/bin/env bash
# watchdog.sh — tmux-agent-orchestrate-monitor 의 부속 스크립트
#
# Metadata for SKILL.md:
# description: "Watchdog helper that keeps subscriber alive and exits when JOB is done"
# usage: "watchdog.sh <job_id> <workdir> [--help]"
if [ "${1:-}" = "--help" ] || [ "${1:-}" = "-h" ] || [ $# -lt 2 ]; then
echo "Usage: $0 <job_id> <workdir>"
exit 0
fi
JOB_ID="$1"
WORKDIR="$2"
LOG_DIR="$WORKDIR/.hermes/jobs"
mkdir -p "$LOG_DIR"
log() {
echo "[$(date -u +'%Y-%m-%dT%H:%M:%SZ')] $*"
}
log "watchdog started for JOB=$JOB_ID workdir=$WORKDIR"
while true; do
# 1) Get current job status with robust Python parsing
STATUS=$(cd "$WORKDIR" && .venv/bin/python skills/tmux-agent-orchestrate-delegate-job/scripts/registry.py get --job "$JOB_ID" 2>/dev/null | python3 -c '
import sys, json
try:
data = json.load(sys.stdin)
print(data.get("status", "unknown"))
except Exception:
print("unknown")
' 2>/dev/null || echo "unknown")
log "JOB status: $STATUS"
# 2) Terminal check
case "$STATUS" in
completed|error|permission_required)
log "JOB reached terminal state ($STATUS), watchdog exiting"
exit 0
;;
esac
# 3) Start subscriber (2min hard limit)
LOG_FILE="$LOG_DIR/subscriber-${JOB_ID}-$(date +%s).log"
log "starting subscriber (2min hard limit, log: $LOG_FILE)"
(
cd "$WORKDIR" && timeout 120 .venv/bin/python skills/tmux-agent-orchestrate-delegate-job/scripts/job_subscriber.py \
--job "$JOB_ID" --timeout 120 --idle-timeout 999999 --registry-dir .hermes/jobs > "$LOG_FILE" 2>&1
echo "[$(date -u +'%Y-%m-%dT%H:%M:%SZ')] subscriber exited" >> "$LOG_FILE"
) &
SUB_PID=$!
log "subscriber PID=$SUB_PID"
# 4) Wait for subscriber to exit or timeout
wait $SUB_PID 2>/dev/null
EXIT_CODE=$?
log "subscriber exited code=$EXIT_CODE"
sleep 1
done