feat(skills): integrate delegate-job skill (squashed from delegate-job-skill)

- Copy delegate-job-skill/skills/delegate-job/ → skills/delegate-job/
- Move requirements.txt (paho-mqtt>=2.0.0) into the new location
- Refactor outdated hardcoded paths (~/PuKi/lab/, ~/.hermes/skills/) to dynamic resolution
- Add MQTT connection timeout / retry hardening
- Remove legacy delegate-job-skill/ directory
- Update .gitignore

Note: delegate-job-skill git history is squashed — preserved content, dropped commit lineage.
This commit is contained in:
2026-06-19 14:00:29 +00:00
parent 8a3abff2d6
commit 97f649a3e1
14 changed files with 6055 additions and 2 deletions
+11
View File
@@ -0,0 +1,11 @@
# delegate-job 스킬
작업(Job)을 자율 에이전트(claude-code/codex/opencode/human)에게 위임하고 MQTT
이벤트 채널로 비동기 관찰하는 Hermes 스킬. **시작점은 [`SKILL.md`](./SKILL.md).**
- 프로토콜/스키마: [`job-protocol.md`](./job-protocol.md)
- 브로커 PoC→운영 전환: [`mqtt-broker-setup.md`](./mqtt-broker-setup.md)
- 레지스트리 포맷/동시성: [`registry.md`](./registry.md)
- 참조 구현: [`delegate-job`](./delegate-job) (bash wrapper), [`scripts/publish_event.py`](./scripts/publish_event.py), [`scripts/job_subscriber.py`](./scripts/job_subscriber.py), [`scripts/registry.py`](./scripts/registry.py), [`scripts/mqtt_common.py`](./scripts/mqtt_common.py)
- 영구 감사 로그: `.hermes/delegate_job_logs/<job_id>/` (`meta.json`·`events.ndjson`·`status.json`)
`delegate-job logs <id>` 또는 `delegate-job logs --list`로 조회 (SKILL.md "Audit Logs" 참조)
+348
View File
@@ -0,0 +1,348 @@
---
name: delegate-job
description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types."
version: 1.0.0
author: Hermes Agent
license: MIT
platforms: [linux, macos, windows]
metadata:
hermes:
tags: [agent-delegation, mqtt, jobs, orchestration, async-completion]
related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring]
---
# delegate-job — Async Job Delegation over MQTT
Delegate a unit of work to an autonomous agent, then **observe** it instead of
blocking on it. Every job gets a unique id and a registry record; the agent
publishes lifecycle events (`started`, `permission_required`, `progress`,
`completed`, `error`) to a per-job MQTT topic; the delegator subscribes and
treats `completed`/`error` — or a timeout — as the terminal state.
This skill is a **reference implementation**: copy the files in this directory
into your project and customise. The `communication_over_mqtt` project is the
canonical concrete instance.
## Overview
The model is deliberately small. A **job** is one delegated task. An **agent**
is a worker (a claude-code tmux session, a codex run, a human). The **registry**
(`.hermes/jobs/<id>.json`) holds everything about a job so nothing important
lives in environment variables — which means one tmux session can process many
jobs sequentially, and many sessions can fan out in parallel, with no env
collisions. The **event channel** is one MQTT topic per job carrying JSON
payloads; `event` discriminates the type.
Responsibility is split into exactly one entry point each:
[`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup,
monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py)
observes them (timeouts, terminal state machine, defensive parsing). Shared
logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in
[`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py`
in the host project stay frozen.
Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up
the protocol. **Production** moves to your own authenticated TLS broker — the
switch is **config only** (env vars + the registry `broker.*` block), never a
code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
## When to Use / When NOT to Use
**Use when:**
- you want **fire-and-observe** delegation — kick off work and get a completion
signal rather than blocking a terminal;
- several agent types (claude-code, codex, opencode, human) must follow **one**
completion protocol;
- you need **multi-job fan-out** across tmux sessions with safe job claiming;
- you want a clean PoC → authenticated-broker upgrade path.
**Do NOT use when:**
- a one-shot `claude -p '…'` that returns inline is enough (no async signal
needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly;
- you need request/response RPC or large artifact transfer (this is a
one-direction event stream, not a data bus);
- the payload would carry secrets and you're still on the public broker — move
to the own-broker stage first.
## Quick Start
The one-line wrapper handles register + subscriber-first + agent launch. If
you're new, **start here** and only fall back to the manual 5-step flow when
you need finer control.
```bash
# 1) one line: register → start subscriber → launch agent in tmux
# (uses public broker by default; last stdout line is the audit-log dir)
delegate-job submit \
--agent claude-code \
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
--workdir /path/to/project \
--agent-session tmux:demo \
--timeout 600 --idle-timeout 120
# → stdout: registered job: <JID>
# subscriber pid: …
# agent launched in tmux session: demo
# subscriber output: <one line per event>
# /path/to/project/.hermes/delegate_job_logs/<JID> ← audit log dir
# 2) at any time, query the job or its audit log
delegate-job status --job <JID>
delegate-job logs <JID> # pretty timeline
delegate-job logs --list # every job, live status
# 3) run a user-supplied validator against the job's artifacts
delegate-job verify --job <JID> --validate ./validate.sh
```
The wrapper enforces the **subscribe-before-publish** ordering and **forwards
the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls
`publish_event.py --job <JID>` with the right id — see Pitfall §"Wrong job_id
propagated to the agent"). When you need finer control, the manual flow is:
```bash
# Manual 5-step (same outcome, more knobs)
PY=.venv/bin/python
SKILL=./skills/delegate-job/scripts
# 1) register
JID=$($PY "$SKILL/registry.py" register \
--prompt "…" --agent claude-code --agent-session tmux:demo \
--timeout 600 --idle-timeout 120)
# 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs)
$PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 600 --idle-timeout 120 &
# 3) pass JID to the agent and instruct it to publish events with --job "$JID"
# (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id")
# 4) on completion the subscriber prints events and exits 0/1/2
# 5) inspect any time
$PY "$SKILL/registry.py" get --job "$JID"
$PY "$SKILL/registry.py" logs "$JID" # positional job id
$PY "$SKILL/registry.py" logs --list
```
## Job Protocol
One topic per job: `python/mqtt/jobs/<job_id>/events`. Payload (JSON, UTF-8,
`schema_version=1`):
```json
{ "schema_version": 1, "seq": 7, "job_id": "abc12345",
"event": "started|permission_required|progress|completed|error",
"timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text",
"data": { "optional": "metadata" } }
```
- `seq` is monotonic per job (first = 1); the subscriber uses it to spot
reorder/duplication.
- `timestamp` is advisory — timeouts are measured from **receive** time.
- `detail`/`data` carry **no** secrets or absolute paths.
- A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing).
`started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0,
`error`→exit 1. Full catalogue + production `auth_token` handling:
[`job-protocol.md`](./job-protocol.md).
## Registry Format
```
.hermes/jobs/<id>.json # metadata record (single source of truth)
.hermes/jobs/<id>.events.log # append-only JSON-lines log (debug, optional)
.hermes/jobs/.lock # fcntl advisory lock for the registry
```
The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block,
`topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`,
`last_seq`, and (production) `auth_token`. Because the `broker` block lives in
the record, `publish_event.py` connects from the registry alone. Concurrency,
the atomic rename trick, and multi-session job claiming are in
[`registry.md`](./registry.md).
## Audit Logs
Every job's lifecycle is mirrored to a **persistent, append-only audit log**
under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`;
default `<cwd>/.hermes/delegate_job_logs`). Unlike the registry — live state
mutated in place and liable to be cleaned up — the audit log is durable
history you can replay after the fact. It is git-ignored.
```
.hermes/delegate_job_logs/<job_id>/
meta.json # registration snapshot: prompt, agent, broker, timeouts, …
events.ndjson # append-only, one JSON event per line, in time order
status.json # current status only (fast point-query)
```
**What is logged, automatically:**
| When | `events.ndjson` line | Written by |
|------|----------------------|------------|
| job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` |
| any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` |
| event published | `published` (carries the exact payload — reproducible) | `publish_event.py` |
| event received | `received` (subscriber's external view) | `job_subscriber.py` |
Both the emitter side (`published`) and the observer side (`received`) are
recorded, so a dropped publish or a missed receive is still visible from the
other. Every write is **best-effort and isolated** — an fcntl-locked append
guarded by `try/except` that only ever emits a `logger.warning`, so a logging
failure can never break a publish, a subscribe, or a registry write. stdout is
never touched.
**Reading them:**
```bash
delegate-job logs <job_id> # pretty-print one job's timeline
delegate-job logs --list # summarise every logged job (with live status)
# or directly via the registry CLI:
$PY scripts/registry.py logs <job_id> [--tail N] [--json]
$PY scripts/registry.py logs --list [--json]
```
`submit` prints the job's audit-log directory as its last stdout line, so a
caller can `tail -n1` to locate it.
## Broker Setup
| Stage | Broker | Auth | Transport |
|-------|--------|------|-----------|
| PoC | `broker.hivemq.com` | none | 1883 plaintext |
| Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS |
All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`,
`MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by
`broker_config_from_env()`, with the registry `broker.*` block overriding per
job. Moving to your own broker is **config only**: install Mosquitto, set
`persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant
the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip
`MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL,
`mosquitto_passwd`, self-signed/private-CA certs, cut-over verification):
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
## Agent Adapters
Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry
path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract
in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the
**freshly-issued id from the registry record for *this* delegation** — never a
job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the
agent").
- **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at
lifecycle points. `submit --mode tmux` injects a prompt that already names
`$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference
instruction block (the wrapper injects something equivalent):
```text
Your job_id is "$JOB_ID" (read it from the registry record for this delegation —
do not reuse any job_id you saw before).
On start: $PY delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started
On permission: $PY … --job "$JOB_ID" --event permission_required --detail "<tool>:<what>"
On progress: $PY … --job "$JOB_ID" --event progress --detail "<short status>"
On success: $PY … --job "$JOB_ID" --event completed --detail "<one-line summary>"
On failure: $PY … --job "$JOB_ID" --event error --detail "<one-line reason>"
Task: <the user's prompt>
The subscriber for "$JOB_ID" is already running; your completed/error event
ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
```
See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns.
- **codex** — same contract. Invoke `codex exec "<instruction-block-above>"` or
wire `publish_event.py` as an MCP tool so the agent can call it directly.
- **opencode** — wire `publish_event.py` as a tool/command the agent can call;
identical event points.
- **human** — a person does the work, reads the registry record, then runs
`publish_event.py --job <id> --event completed` (or `error`) by hand.
## User Interface
The [`delegate-job`](./delegate-job) bash wrapper bundles register +
subscribe-first + run-agent + validate:
```bash
delegate-job submit --agent claude-code \
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
--workdir /path/to/project --timeout 600 [--validate ./validate.sh]
delegate-job status --job <id> # one record, pretty-printed
delegate-job list # all jobs, one line each
delegate-job verify --job <id> --validate ./validate.sh # runs it, reports exit code
delegate-job wait [--job <id>] # block until terminal (else --wait-any)
```
`submit` **always starts the subscriber before the agent** (the ordering
dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and
calls `--validate` afterward if given. The skill automates job-id generation,
registry creation, broker resolution, subscriber-first ordering, agent launch,
and completion detection; it does **not** automate the agent's internals or your
business-logic validation — those are hooks you fill (`validate.sh` reads
`$JOB_ID`/`$REGISTRY_DIR`).
## Common Pitfalls
- **Publishing before subscribing** — MQTT does not queue non-retained messages
for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely
on retained terminal events (production). `submit` enforces this.
- **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID`
on every `submit`. If your agent instruction (or the wrapper's prompt template)
hard-codes an old job_id, the agent calls `publish_event.py --job <wrong>`,
the subscriber's defensive parser drops it as a `job_id` mismatch, and the
delegator waits until idle timeout (exit 2). Fix: instruct the agent to
**read the job_id from the registry record for *this* delegation** (or pass it
in via env / `--prompt` interpolation), never from prior runs. `submit`'s
default prompt template interpolates `$JOB_ID` for you — if you build a custom
prompt, do the same.
- **tmux session name collision** — `submit --mode tmux` derives the session
name from `--agent-session tmux:<name>` (default `tmux:claude`). If a session
with that name is already attached (e.g. you ran the demo and the previous
session is still open), `tmux new-session -d -s <name>` fails and the agent
never launches. Pick a unique `--agent-session` per concurrent delegation
(e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one
(`tmux kill-session -t claude`) before re-running.
- **Timeout before `started`** — a cold-starting agent may not emit `started`
for a while; the wall-clock timeout starts at subscribe time so a stuck agent
still terminates. Don't set `--timeout` so low you false-positive a slow start.
- **No retry on publish** — a dropped `completed` would hang the delegator
forever; `publish_event.py` retries with exponential backoff and exits 2 if it
still fails, so the delegator is never left waiting silently.
- **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or
`error` can trail `completed`; the subscriber's terminal state machine
finalises each job once and ignores the rest.
- **Trusting the public broker** — anyone can publish there; never make a real
decision on a PoC signal. Add `auth_token` + an authenticated broker first.
- **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or
tokens (except the production `auth_token` in `data`).
## Verification Checklist
- [ ] `started` → `completed` over the public broker: subscriber prints the
lines and exits **0**.
- [ ] `error` path: subscriber exits **1**.
- [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` →
exit **2**.
- [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is
dropped with a warning, not crashed on.
- [ ] one tmux session processes two registry jobs in sequence; a second
session with a different `agent_session` claims only its own.
- [ ] broker cut-over: same scripts reach an authenticated TLS broker with env
changes only; a credential without write ACL is rejected; a late
subscriber still receives the retained terminal event.
- [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample`
still works unchanged (regression).
- [ ] **audit log integrity** — for a completed job,
`.hermes/delegate_job_logs/<JID>/events.ndjson` contains `registered` →
`received started` → `published completed` (in that order), and
`status.json.status == "completed"` matches the registry record. A
logging failure (e.g. read-only log dir) does not break the publish or
subscribe path — only a `logger.warning` is emitted.
- [ ] **end-to-end demo smoke** — run
`delegate-job submit --agent claude-code --agent-session tmux:demo-smoke
--prompt "echo hello and call publish_event.py --job <JID>
--event completed" --timeout 120` and confirm
(a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session
name printed, (d) `events.ndjson` grows as the agent runs, (e) final
stdout line is the audit-log dir.
File diff suppressed because it is too large Load Diff
+272
View File
@@ -0,0 +1,272 @@
#!/usr/bin/env bash
# delegate-job — user-facing orchestrator for the delegate-job skill.
#
# Subcommands:
# submit register a job, start the subscriber FIRST, then run the agent,
# then (optionally) run a validation script.
# status show one job record.
# list list all jobs.
# verify run a user-supplied --validate script against a job's artifacts.
# wait block until all running/pending jobs reach a terminal state.
#
# This is a reference wrapper: it shells out to the python scripts that live
# next to it. Copy it into your project and customise as needed. It never hard
# fails if `claude`/`codex`/`tmux` are missing — it prints what it would run.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Pick an interpreter: prefer a project .venv, else python3.
pick_python() {
local py_bin
if [[ -n "${DELEGATE_JOB_PYTHON:-}" ]]; then
py_bin="$DELEGATE_JOB_PYTHON"
elif [[ -x "${WORKDIR:-.}/.venv/bin/python" ]]; then
py_bin="${WORKDIR}/.venv/bin/python"
elif [[ -x ".venv/bin/python" ]]; then
py_bin="$(pwd)/.venv/bin/python"
else
py_bin="python3"
fi
if ! "$py_bin" -c "import paho.mqtt" 2>/dev/null; then
echo "ERROR: paho-mqtt package is missing for $py_bin." >&2
echo " Please create a virtual environment and install it:" >&2
echo " python3 -m venv .venv && .venv/bin/pip install -r \"$SCRIPT_DIR/requirements.txt\"" >&2
exit 1
fi
echo "$py_bin"
}
REGISTRY_DIR_DEFAULT=".hermes/jobs"
usage() {
cat <<'EOF'
delegate-job <command> [options]
submit --agent <name> --prompt <text> [--workdir <dir>] [--agent-session <label>]
[--timeout <sec>] [--idle-timeout <sec>] [--validate <script>]
[--registry-dir <dir>] [--dry-run]
# The skill is tmux-interactive only; --mode print was removed.
status --job <id> [--registry-dir <dir>]
list [--registry-dir <dir>]
verify --job <id> --validate <script> [--registry-dir <dir>]
wait [--job <id>] [--timeout <sec>] [--registry-dir <dir>]
logs <job_id> | --list # persistent audit log (delegate_job_logs/)
EOF
}
# ---- arg parsing helpers --------------------------------------------------
AGENT="claude-code"; PROMPT=""; WORKDIR="$(pwd)"; AGENT_SESSION="tmux:claude"
TIMEOUT=600; IDLE_TIMEOUT=120; VALIDATE=""; DRY_RUN=0
JOB_ID=""; REGISTRY_DIR="$REGISTRY_DIR_DEFAULT"
parse_opts() {
while [[ $# -gt 0 ]]; do
case "$1" in
--agent) AGENT="$2"; shift 2;;
--prompt) PROMPT="$2"; shift 2;;
--workdir) WORKDIR="$2"; shift 2;;
--agent-session) AGENT_SESSION="$2"; shift 2;;
--timeout) TIMEOUT="$2"; shift 2;;
--idle-timeout) IDLE_TIMEOUT="$2"; shift 2;;
--validate) VALIDATE="$2"; shift 2;;
--job) JOB_ID="$2"; shift 2;;
--registry-dir) REGISTRY_DIR="$2"; shift 2;;
--dry-run) DRY_RUN=1; shift;;
*) echo "unknown option: $1" >&2; usage; exit 1;;
esac
done
}
cmd_submit() {
parse_opts "$@"
[[ -n "$PROMPT" ]] || { echo "submit requires --prompt" >&2; exit 1; }
PY="$(pick_python)"
cd "$WORKDIR"
mkdir -p "$REGISTRY_DIR"
# 1) register job (prints the new job id)
JOB_ID="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" register \
--prompt "$PROMPT" --agent "$AGENT" --agent-session "$AGENT_SESSION" \
--timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT")"
echo "registered job: $JOB_ID"
# 2) START THE SUBSCRIBER FIRST (ordering dependency — MQTT does not queue
# non-retained messages for absent subscribers).
local logf="$REGISTRY_DIR/$JOB_ID.subscriber.out"
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
>"$logf" 2>&1 &
local sub_pid=$!
echo "subscriber pid: $sub_pid (log: $logf)"
sleep 1 # give the subscriber time to CONNACK + SUBSCRIBE before the agent runs
# 3) run the agent (or print the command for dry-run / missing binary)
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
# NOTE: the agent MUST use --job "$JOB_ID" (the one we just minted). Hard-coding
# an id from an earlier session is the #1 reason a delegated job sits idle and
# times out (see SKILL.md "Wrong job_id propagated to the agent"). We make the
# freshness explicit in the instruction header.
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
On start run: $pub --event started.
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
On progress (optional): $pub --event progress --detail '<short status>'.
On success run: $pub --event completed --detail '<one-line summary>'.
On failure run: $pub --event error --detail '<one-line reason>'.
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
Task: $PROMPT"
run_agent "$JOB_ID" "$instructions"
# 4) optional validation hook
if [[ -n "$VALIDATE" ]]; then
echo "running validation: $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "validation: PASS"
else
local rc=$?
echo "validation: FAIL (exit $rc)"
fi
fi
if [[ "$DRY_RUN" == "1" ]]; then
# In dry-run we never started a real subscriber (the wrapper short-circuits
# before launching one), but the wait below would still try to join the
# background sub_pid from cmd_submit. Skip both the wait and the subscriber
# log dump; the user just wants to see the instruction that would have run.
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root_dry/$JOB_ID"
return 0
fi
wait "$sub_pid" || true
echo "subscriber output:"; cat "$logf" || true
# Last stdout line: the persistent audit-log dir for this job (see SKILL.md
# "Audit Logs"). Callers can scrape `tail -n1` to find it.
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
echo "$logs_root/$JOB_ID"
}
run_agent() {
local job_id="$1"; local instructions="$2"
# The skill is INTERACTIVE-ONLY. We never invoke `claude -p` or any other
# one-shot print mode, because:
# - claude -p exits the moment stdin is drained, so there's nothing to
# `tmux attach` to afterwards.
# - fire-and-forget via wrapper defeats the whole point of the audit log
# (you can't tell what happened if the agent crashes mid-turn).
# - the job registry already gives us an authoritative completion signal,
# so we don't need a wrapper-side exit code to know "done".
# The user attaches with `tmux attach -t <session>` and types follow-up
# prompts themselves. We pre-load the first prompt via stdin and `read`
# keeps the pane open after the agent exits so the user can review.
case "$AGENT" in
claude-code) bin="claude";;
codex) bin="codex";;
human) echo "[human agent] complete the task, then run publish_event.py --event completed"; return;;
*) bin="$AGENT";;
esac
if [[ "$DRY_RUN" == "1" ]]; then
echo "[dry-run] would launch agent '$AGENT' in a fresh tmux session with instructions:"
echo "----"; echo "$instructions"; echo "----"
return
fi
if ! command -v tmux >/dev/null 2>&1; then
echo "ERROR: this skill requires tmux (interactive agent sessions)." >&2
echo " Install with: brew install tmux (or your package manager)" >&2
return 1
fi
if ! command -v "$bin" >/dev/null 2>&1; then
echo "ERROR: agent binary '$bin' not found in PATH." >&2
return 1
fi
local sess="${AGENT_SESSION#tmux:}"
# Detect a stale session with the same name (e.g. the user is still attached
# from an earlier run, or a previous wrapper died without cleanup). tmux
# new-session on an existing name fails silently; check first and fail loud.
if tmux has-session -t "$sess" 2>/dev/null; then
local attached
attached=$(tmux list-clients -t "$sess" 2>/dev/null | wc -l | tr -d ' ')
echo "ERROR: tmux session '$sess' already exists (clients attached: $attached)." >&2
echo " Pick a unique --agent-session (e.g. tmux:demo, tmux:claude-a) or" >&2
echo " kill the stale one first: tmux kill-session -t $sess" >&2
return 1
fi
tmux new-session -d -s "$sess" -c "$WORKDIR" \
"printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; echo '--- agent exited (job $job_id); press enter to close ---'; read"
echo "agent launched in tmux session: $sess (attach with: tmux attach -t $sess)"
}
cmd_status() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "status requires --job" >&2; exit 1; }
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get --job "$JOB_ID"
}
cmd_list() {
parse_opts "$@"
PY="$(pick_python)"
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" list
}
cmd_verify() {
parse_opts "$@"
[[ -n "$JOB_ID" ]] || { echo "verify requires --job" >&2; exit 1; }
[[ -n "$VALIDATE" ]] || { echo "verify requires --validate <script>" >&2; exit 1; }
echo "verifying job $JOB_ID with $VALIDATE"
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
echo "verify: PASS (exit 0)"; exit 0
else
rc=$?; echo "verify: FAIL (exit $rc)"; exit "$rc"
fi
}
cmd_logs() {
# logs <job_id> | logs --list — delegates to registry.py's logs CLI, which
# reads the persistent audit log under $DELEGATE_JOB_LOGS_DIR (or
# <cwd>/delegate_job_logs). Run from your project dir so the default resolves.
PY="$(pick_python)"
if [[ "${1:-}" == "--list" ]]; then
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs --list
else
local jid="${1:-}"
[[ -n "$jid" ]] || { echo "logs requires <job_id> or --list" >&2; exit 1; }
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs "$jid"
fi
}
cmd_wait() {
parse_opts "$@"
PY="$(pick_python)"
if [[ -n "$JOB_ID" ]]; then
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--job "$JOB_ID" --timeout "$TIMEOUT"
else
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
--wait-any --timeout "$TIMEOUT"
fi
}
main() {
local sub="${1:-}"; shift || true
case "$sub" in
submit) cmd_submit "$@";;
status) cmd_status "$@";;
list) cmd_list "$@";;
verify) cmd_verify "$@";;
wait) cmd_wait "$@";;
logs) cmd_logs "$@";;
""|-h|--help|help) usage;;
*) echo "unknown command: $sub" >&2; usage; exit 1;;
esac
}
main "$@"
+118
View File
@@ -0,0 +1,118 @@
# Job Event Protocol
The wire contract every delegate-job agent (claude-code, codex, opencode,
human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across
the PoC (public broker) and production (own broker) stages; only transport
hardening changes, never the payload shape.
Reference implementation: [`./scripts/publish_event.py`](./scripts/publish_event.py)
(emit) and [`./scripts/job_subscriber.py`](./scripts/job_subscriber.py) (observe).
---
## 1. Topic design
| Topic | Purpose |
|-------|---------|
| `python/mqtt/sample` | Legacy demo topic — **never changed** (README compat). |
| `python/mqtt/jobs/<job_id>/events` | Per-job event stream (this protocol). |
- One topic per job, JSON payload, `event` field discriminates the type.
- Single-direction publish only (worker → observer). No request/response.
- Future split is reserved but not required:
`<job_id>/events`, `<job_id>/logs`, `<job_id>/artifacts`.
- `topic_prefix` is stored in the job record so publishers resolve the topic
from the registry alone (`<topic_prefix>/events`).
---
## 2. Payload schema (JSON, UTF-8, `schema_version = 1`)
```json
{
"schema_version": 1,
"seq": 7,
"job_id": "abc12345",
"event": "started | permission_required | progress | completed | error",
"timestamp": "2026-06-19T09:32:00Z",
"detail": "generalised, whitelisted human-readable string",
"data": { "optional": "metadata" }
}
```
| Field | Rule |
|-------|------|
| `schema_version` | If publisher/subscriber disagree, the subscriber **drops** the event with a warning (defensive parsing). |
| `seq` | Monotonic **per `job_id`**, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (`last_seq`) so it survives restarts. |
| `job_id` | Subscriber drops any event whose `job_id` it did not subscribe for. |
| `timestamp` | Publisher host clock, **advisory only**. The delegator's timeout is measured from *receive* time, not this field. |
| `detail` | Generalised text only. **No absolute paths, keys, or tokens.** |
| `data` | Optional metadata. Production may add `auth_token`, `build_id`, etc. |
---
## 3. Event catalogue
| event | When emitted | `detail` example | seq |
|-------|--------------|------------------|-----|
| `started` | Agent first picks up the job | `"Job a1b2c3d4 started"` | 1 |
| `permission_required` | Agent needs a tool/permission grant | `"needs to write sort_problems.md"` | as it happens |
| `progress` | Optional intermediate checkpoint | `"creating problem 5/10"` | as it happens |
| `completed` | Successful terminal state | `"saved to sort_problems.md"` | last |
| `error` | Failure / exception terminal state | `"internal error, see logs"` | last |
`started` and `completed`/`error` are mandatory bookends; `permission_required`
and `progress` are optional. `detail` must stay on the whitelist of generalised
phrasings — never leak secrets through it.
### Terminal semantics
- `completed` → subscriber exits 0; `error` → exits 1.
- The subscriber runs a **terminal state machine**: it finalises a job on the
first `completed`/`error` it sees and ignores any later terminal event for
that job (QoS-1 duplicate, or an `error`-after-`completed` reorder). When all
watched jobs are finalised it exits.
- Wall-clock timeout *or* idle timeout before a terminal event → exit 2.
---
## 4. Production hardening (own broker stage)
The payload shape is unchanged; the transport and trust model tighten. See
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md) for the broker side.
- **Auth / ACL** — username/password + per-topic ACL. `jobs/+/events` publish is
granted to the worker credential, subscribe to the Hermes credential.
- **`auth_token` (the bonus field)** — each job record carries a per-job
`auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into
**`data.auth_token`**; the subscriber compares it against the registry's
expected token and **drops mismatches**. This is an integrity check on top of
the broker ACL, useful while still on a shared/public broker.
```json
{ "...": "...", "data": { "auth_token": "9f3c…", "build_id": "42" } }
```
- **TLS** — port 8883 + private CA. Toggled with `MQTT_TLS=1` (+ `MQTT_CA_CERTS`);
no code change.
- **Retained terminal events** — `completed`/`error` publish with `retain=True`
so a subscriber that joins late immediately receives the last terminal state
instead of a stale view. The reference publisher auto-retains terminal events;
`--retained` forces it for any event.
- **Dual timeouts** — total wall-clock budget + last-activity idle detection,
both measured from receive time.
- **Clock trust** — never trust the payload `timestamp` for timeout decisions.
---
## 5. Why a public broker is PoC-only
On `broker.hivemq.com` anyone can publish/subscribe the same topic. Therefore:
- No secret data in payloads.
- `started`/`completed`/`error` are *signals*, never a basis for a security
decision.
- Non-retained messages are **not queued** for absent subscribers — start the
subscriber **before** the agent (ordering dependency), or rely on retained
terminal events in production.
- Real operational decisions belong to the own-broker stage with auth + ACL.
File diff suppressed because it is too large Load Diff
+176
View File
@@ -0,0 +1,176 @@
# MQTT Broker Setup — PoC → Production
The delegate-job scripts read **all** broker settings from environment
variables (or a job record's `broker.*` block) through a single helper,
`broker_config_from_env()` in
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py). The design goal:
**switch from the public PoC broker to your own broker with config only — no
code change.**
| Env var | Meaning | PoC default | Production |
|---------|---------|-------------|-----------|
| `MQTT_BROKER` | host | `broker.hivemq.com` | internal hostname/IP |
| `MQTT_PORT` | port | `1883` | `8883` (TLS) |
| `MQTT_TLS` | TLS on/off (`1`/`0`) | `0` | `1` |
| `MQTT_USERNAME` / `MQTT_PASSWORD` | auth | (none) | broker-issued |
| `MQTT_CA_CERTS` | CA bundle path | (none) | private CA path |
| `MQTT_CERTFILE` / `MQTT_KEYFILE` | client cert (optional mTLS) | (none) | per-client |
| `MQTT_CLIENT_ID_PREFIX` | client id prefix | `hermes` | per-environment |
---
## 1. PoC: public broker (`broker.hivemq.com`)
**Pros** — zero setup, reachable from anywhere, perfect for wiring up the
publish/subscribe loop and the timeout/state-machine logic.
**Cons / accepted assumptions** — no auth, no integrity, shared with the world:
- no secrets in payloads;
- `started`/`completed`/`error` are advisory signals only;
- non-retained messages are **not queued** for absent subscribers, so the
subscriber must start before the agent;
- a re-subscribing client cannot recover past (non-retained) events.
Use it only to validate the protocol, never for real decisions.
---
## 2. Production: self-hosted Mosquitto (or EMQX)
Both support MQTT 5 + ACL + TLS. Mosquitto shown below; EMQX is a drop-in for
the same env vars.
### 2.1 Install
```bash
# macOS
brew install mosquitto
# Debian/Ubuntu
sudo apt-get update && sudo apt-get install -y mosquitto mosquitto-clients
# Docker
docker run -d --name mosquitto -p 8883:8883 \
-v "$PWD/mosquitto.conf:/mosquitto/config/mosquitto.conf" \
-v "$PWD/certs:/mosquitto/certs" \
-v "$PWD/auth:/mosquitto/auth" \
eclipse-mosquitto:2
```
### 2.2 `mosquitto.conf` (key lines)
```conf
persistence true
persistence_location /mosquitto/data/
password_file /mosquitto/auth/passwd
acl_file /mosquitto/auth/acl
allow_anonymous false
listener 8883
cafile /mosquitto/certs/ca.crt
certfile /mosquitto/certs/server.crt
keyfile /mosquitto/certs/server.key
```
`persistence true` + QoS 1 + retained terminal events means a subscriber that
joins after a job finished still sees the final `completed`/`error`.
### 2.3 Users (username/password)
```bash
# create the file with the first user, then add more with -b
mosquitto_passwd -c /mosquitto/auth/passwd hermes # subscriber/delegator
mosquitto_passwd /mosquitto/auth/passwd claude-worker # publisher/agent
# (omit -c after the first; -c truncates the file)
```
### 2.4 ACL — least privilege
The worker only **publishes** events; Hermes only **subscribes**:
```conf
# /mosquitto/auth/acl
# claude-worker: may publish job events, may not read others' streams
user claude-worker
topic write python/mqtt/jobs/+/events
# hermes: observes every job's events
user hermes
topic read python/mqtt/jobs/+/events
# keep the legacy demo topic usable for both, if desired
pattern readwrite python/mqtt/sample
```
### 2.5 TLS certificates
**Quick self-signed (single host, internal only):**
```bash
mkdir -p certs && cd certs
openssl req -x509 -newkey rsa:2048 -nodes -days 825 \
-keyout server.key -out server.crt \
-subj "/CN=mqtt.internal"
cp server.crt ca.crt # clients trust this as the CA bundle
```
**Private CA (recommended — separate CA from server cert):**
```bash
# 1) CA
openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -key ca.key -days 3650 -out ca.crt -subj "/CN=Hermes-CA"
# 2) server cert signed by the CA
openssl genrsa -out server.key 2048
openssl req -new -key server.key -out server.csr -subj "/CN=mqtt.internal"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
-out server.crt -days 825
```
Clients trust `ca.crt` via `MQTT_CA_CERTS=/path/to/ca.crt`.
---
## 3. Cut-over verification (config-only, no code change)
Goal: prove the **same scripts** talk to your broker by changing only env/registry.
```bash
# 1) point the env at the new broker
export MQTT_BROKER=mqtt.internal
export MQTT_PORT=8883
export MQTT_TLS=1
export MQTT_CA_CERTS=$PWD/certs/ca.crt
export MQTT_USERNAME=hermes
export MQTT_PASSWORD=# subscriber side
# (publisher side uses claude-worker creds via the job record's broker block)
# 2) sanity-check with the mosquitto CLI first
mosquitto_sub -h "$MQTT_BROKER" -p 8883 --cafile "$MQTT_CA_CERTS" \
-u hermes -P "$MQTT_PASSWORD" -t 'python/mqtt/jobs/+/events' -v &
# 3) run the unchanged delegate-job loop
PY=.venv/bin/python
JID=$($PY scripts/registry.py register --prompt "broker cutover smoke")
$PY scripts/job_subscriber.py --job "$JID" --timeout 30 &
sleep 3
$PY scripts/publish_event.py --job "$JID" --event started
$PY scripts/publish_event.py --job "$JID" --event completed # auto-retained
```
Expected:
- subscriber prints the `started` and `completed` lines and exits 0;
- `mosquitto_sub` shows the same events (ACL allows `hermes` to read);
- publishing as a credential **without** write ACL is rejected by the broker;
- a subscriber started *after* `completed` still receives it (retained).
If all four hold, the migration is config-only. Persist the broker block into
each job record so `publish_event.py` connects from the registry alone:
```json
"broker": { "host": "mqtt.internal", "port": 8883, "tls": true,
"username": "claude-worker", "password": "…" }
```
+183
View File
@@ -0,0 +1,183 @@
# Job Registry
The registry is the **single source of truth** for delegated work. Job metadata
(id, prompt, broker, status, timeouts) lives in files, **not** environment
variables — so one tmux session can handle many jobs sequentially or in
parallel without collisions, and `publish_event.py` / `job_subscriber.py` can
reconstruct everything they need from the registry alone.
Reference implementation: [`./scripts/registry.py`](./scripts/registry.py)
(library + CLI) over the primitives in
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py).
---
## 1. Directory layout
```
.hermes/jobs/
<job_id>.json # job metadata record (schema below)
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
.lock # shared advisory lock (fcntl) for the whole registry
```
`registry_dir` defaults to `.hermes/jobs` and is overridable everywhere via
`--registry-dir`.
---
## 2. Job record schema
```json
{
"schema_version": 1,
"job_id": "abc12345",
"status": "pending | running | completed | error | cancelled",
"created_at": "2026-06-19T09:30:00Z",
"updated_at": "2026-06-19T09:32:00Z",
"prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
"agent": "claude-code",
"agent_session": "tmux:claude",
"broker": {
"host": "broker.hivemq.com",
"port": 1883,
"tls": false,
"username": null,
"password": null
},
"topic_prefix": "python/mqtt/jobs/abc12345",
"timeout_sec": 600,
"idle_timeout_sec": 120,
"expected_artifacts": ["sort_problems.md"],
"last_seq": 0,
"auth_token": null
}
```
- `broker` lets `publish_event.py` connect from the record alone (env still
overrides toggles like `MQTT_TLS`).
- `topic_prefix` → the events topic is `<topic_prefix>/events`.
- `last_seq` backs the monotonic `seq` counter so it survives process restarts.
- `expected_artifacts` is the hook a user `validate.sh` checks (existence/content).
- `auth_token` is `null` in PoC; production sets `secrets.token_urlsafe(32)`.
---
## 3. Concurrency rules
### PoC — fcntl advisory lock
Every read-modify-write (`register_job`, `pick_pending`, `update_status`,
`next_seq`) runs inside `registry_lock(registry_dir)`, an exclusive
`fcntl.flock` over `.lock`. Single-host, good enough for many tmux sessions on
one machine.
### Production — SQLite WAL
When delegation spans **multiple hosts**, the file lock no longer serialises
across machines. Migrate the same operations to a SQLite database in WAL mode
(`PRAGMA journal_mode=WAL`) with a transaction per claim. The function
signatures stay identical; only the storage backend changes.
---
## 4. How multiple sessions take only their own work
Each tmux session carries an `agent_session` label (`tmux:claude`,
`tmux:claude-a`, `tmux:claude-b`, …). `pick_pending(agent_session)`:
1. acquires the registry lock,
2. scans for the **oldest** record with `status == "pending"` **and**
matching `agent_session`,
3. flips it to `running` and writes it back **atomically**,
4. releases the lock and returns the `job_id` (or `None`).
Because the scan + flip happen under one lock, two sessions can never claim the
same job. Sessions with distinct labels naturally partition the work; sessions
sharing a label compete safely — first to acquire the lock wins, the other sees
the job already `running` and moves on.
```bash
# session A only ever runs its own pending jobs
PY scripts/registry.py pick --agent-session tmux:claude-a # prints id or exits 3
```
---
## 5. Atomic status updates
All writes use a temp-file + `os.replace` rename, which is atomic on POSIX:
1. take the registry lock,
2. load the current record,
3. mutate fields + refresh `updated_at` (and `last_seq` for `next_seq`),
4. write to `.<job_id>.<rand>.tmp` in the **same directory**, `fsync`,
5. `os.replace(tmp, <job_id>.json)`,
6. release the lock.
A reader therefore always sees either the old or the new complete record, never
a half-written file. This is the file-based equivalent of the rename trick
(`pending.<session>``running.<session>`) and maps cleanly onto a single
SQLite transaction when you migrate.
---
## 6. CLI quick reference
```bash
PY=.venv/bin/python
$PY scripts/registry.py register --prompt "…" --agent claude-code \
--agent-session tmux:claude --timeout 600 --idle-timeout 120 # → prints job_id
$PY scripts/registry.py list # human table
$PY scripts/registry.py list --json # full records
$PY scripts/registry.py get --job <id> # one record
$PY scripts/registry.py status --job <id> --set completed # set status
$PY scripts/registry.py pick --agent-session tmux:claude # claim → running
```
Exit codes: `0` ok, `1` not found / bad status, `3` (`pick`) no pending job for
that session.
---
## 7. Persistent audit log
Separate from the registry, every job is also mirrored to a durable append-only
audit log at `.hermes/delegate_job_logs/<job_id>/` (override with
`DELEGATE_JOB_LOGS_DIR`, default `<cwd>/.hermes/delegate_job_logs`). The registry
is **live state** mutated in place; the audit log is **history** that survives
even after the registry dir is cleaned up. It is git-ignored.
```
.hermes/delegate_job_logs/<job_id>/
meta.json # registration snapshot (the full job record at register time)
events.ndjson # append-only, one JSON event per line, time-ordered
status.json # current status only (fast point-query)
```
`events.ndjson` lines are written automatically at four points:
| Trigger | line `event` | Source |
|---------|-------------|--------|
| `register_job` | `registered` | `registry.register_job``mqtt_common.init_job_log` |
| status change (`update_status`, `pick`, publish status sync) | `status_changed` (`from`/`to`) | `mqtt_common.update_job_status` / `pick_pending` |
| event published | `published` (embeds the exact payload) | `publish_event.py` |
| event received | `received` | `job_subscriber.py` |
Helpers live in [`./scripts/mqtt_common.py`](./scripts/mqtt_common.py):
`LOGS_DIR`, `job_log_path`, `init_job_log`, `append_event` (fcntl-locked,
concurrent-append safe), `update_logged_status`, and the readers
`read_logged_meta` / `read_logged_status` / `iter_logged_events` /
`list_logged_jobs`. Every writer is **best-effort and isolated** — wrapped in
`try/except` with a `logger.warning`, so an audit-log failure never breaks the
registry write, the publish, or the subscribe it shadows.
Read them via the CLI:
```bash
PY=.venv/bin/python
$PY scripts/registry.py logs <job_id> # pretty timeline
$PY scripts/registry.py logs <job_id> --tail 20 # last 20 events
$PY scripts/registry.py logs <job_id> --json # raw JSON lines
$PY scripts/registry.py logs --list # every job, live status
```
+1
View File
@@ -0,0 +1 @@
paho-mqtt>=2.0.0
+233
View File
@@ -0,0 +1,233 @@
#!/usr/bin/env python3
"""job_subscriber.py — the single entry point for observing Job events.
Subscribes to one job's ``<topic_prefix>/events`` (or, with ``--wait-any``, the
events of every running/pending job in the registry), prints one line to stdout
per accepted event, and exits on a terminal event or a timeout.
Design points (all flagged in the PLAN review):
- terminal state machine: ``completed``/``error`` is acted on exactly once per
job, so QoS-1 duplicates or an ``error``-after-``completed`` reorder are safe.
- dual timeouts: a wall-clock ``--timeout`` (total budget, started at
subscribe time so a cold start can't hang forever) AND an idle
``--idle-timeout`` (no new event for N seconds).
- defensive parsing: undecodable payloads, ``schema_version`` mismatches, and
``job_id`` values we did not subscribe for are logged and dropped.
stdout = event lines only. Diagnostics go to stderr via logging.
Exit codes:
0 all watched jobs reached ``completed``
1 any watched job reached ``error``
2 timed out (wall-clock or idle) before all jobs finished
"""
from __future__ import annotations
import argparse
import json
import logging
import queue
import sys
import time
from typing import Any, Dict, List, Optional, Set, Tuple
import mqtt_common
import registry
from mqtt_common import (
DEFAULT_REGISTRY_DIR,
SCHEMA_VERSION,
broker_config_from_job,
load_job,
make_client,
)
logger = logging.getLogger("delegate_job.job_subscriber")
TERMINAL_EVENTS = ("completed", "error")
def _format_line(topic: str, payload: Dict[str, Any]) -> str:
return (
f"{payload.get('timestamp','-')} "
f"job={payload.get('job_id','?')} "
f"seq={payload.get('seq','?')} "
f"{payload.get('event','?'):<20} "
f"{payload.get('detail','')}"
)
class _Watcher:
"""Holds the shared queue + the set of job_ids we accept events for."""
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
self.events: "queue.Queue[Tuple[str, Dict[str, Any]]]" = queue.Queue()
self.expected = set(expected_job_ids)
self.tokens = expected_tokens # job_id -> expected auth_token (or None)
def on_message(self, _client, _userdata, msg) -> None:
# --- defensive parsing -------------------------------------------
try:
payload = json.loads(msg.payload.decode("utf-8"))
except (UnicodeDecodeError, json.JSONDecodeError) as exc:
logger.warning("drop unparseable payload on %s: %s", msg.topic, exc)
return
if not isinstance(payload, dict):
logger.warning("drop non-object payload on %s", msg.topic)
return
if payload.get("schema_version") != SCHEMA_VERSION:
logger.warning("drop event with schema_version=%r (expected %d)",
payload.get("schema_version"), SCHEMA_VERSION)
return
jid = payload.get("job_id")
if jid not in self.expected:
logger.warning("drop event for unexpected job_id=%r on %s", jid, msg.topic)
return
# --- production auth check: data.auth_token must match if expected ---
expected_token = self.tokens.get(jid)
if expected_token is not None:
got = (payload.get("data") or {}).get("auth_token")
if got != expected_token:
logger.warning("drop event for job %s: auth_token mismatch", jid)
return
# Persistent audit log from the *subscriber's* vantage point: every event
# that survives defensive parsing is recorded here, including ones a
# different host published. This is the external-observer record that
# backstops the publisher's own "published" line if it never wrote one.
mqtt_common.append_event(jid, {
"event": "received",
"source_event": payload.get("event"),
"seq": payload.get("seq"),
"topic": msg.topic,
"timestamp": payload.get("timestamp"),
"detail": payload.get("detail", ""),
})
self.events.put((msg.topic, payload))
def _collect_jobs(args) -> List[Dict[str, Any]]:
"""Resolve the list of job records this invocation should watch."""
if args.wait_any:
jobs = [r for r in registry.list_jobs(args.registry_dir)
if r.get("status") in ("pending", "running")]
if not jobs:
logger.error("no pending/running jobs to wait for")
return jobs
job = load_job(args.job, args.registry_dir) # raises FileNotFoundError
return [job]
def main(argv=None) -> int:
parser = argparse.ArgumentParser(description="Subscribe to Job events on MQTT")
target = parser.add_mutually_exclusive_group(required=True)
target.add_argument("--job", help="job id to watch")
target.add_argument("--wait-any", action="store_true",
help="watch every pending/running job in the registry")
parser.add_argument("--timeout", type=float, default=None,
help="wall-clock budget in seconds (default: job.timeout_sec or 600)")
parser.add_argument("--idle-timeout", type=float, default=None,
help="max seconds with no new event (default: job.idle_timeout_sec or 120)")
parser.add_argument("--expect-retention", action="store_true",
help="warn if no retained terminal event arrives promptly")
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
parser.add_argument("-v", "--verbose", action="store_true")
args = parser.parse_args(argv)
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
try:
jobs = _collect_jobs(args)
except FileNotFoundError as exc:
logger.error("%s", exc)
return 2
if not jobs:
return 2
expected_ids: Set[str] = {j["job_id"] for j in jobs}
tokens = {j["job_id"]: j.get("auth_token") for j in jobs}
watcher = _Watcher(expected_ids, tokens)
# Resolve timeouts from CLI, falling back to the (first) job's settings.
base_job = jobs[0]
wall_timeout = args.timeout if args.timeout is not None else float(base_job.get("timeout_sec", 600))
idle_timeout = args.idle_timeout if args.idle_timeout is not None else float(base_job.get("idle_timeout_sec", 120))
# All watched jobs share a broker in practice; connect using the first
# job's broker and subscribe to each job's events topic.
config = broker_config_from_job(base_job)
client = make_client("subscriber", config)
client.on_message = watcher.on_message
subscribed_topics = []
for job in jobs:
prefix = job.get("topic_prefix") or mqtt_common.topic_prefix_for(job["job_id"])
subscribed_topics.append(f"{prefix}/events")
def on_connect(_c, _u, _flags, reason_code, _props):
if mqtt_common.reason_code_value(reason_code) != 0:
logger.error("broker connection failed: rc=%s", reason_code)
return
for topic in subscribed_topics:
_c.subscribe(topic, qos=1)
logger.info("subscribed to %s", topic)
client.on_connect = on_connect
client.connect(config.host, config.port, config.keepalive)
client.loop_start()
terminal: Dict[str, str] = {} # job_id -> "completed"/"error"
pending: Set[str] = set(expected_ids)
start = time.monotonic()
wall_deadline = start + wall_timeout
last_event = start
retention_checked = not args.expect_retention
try:
while pending:
now = time.monotonic()
if now >= wall_deadline:
logger.error("wall-clock timeout (%.0fs); still pending: %s",
wall_timeout, ", ".join(sorted(pending)))
return 2
idle_left = idle_timeout - (now - last_event)
if idle_left <= 0:
logger.error("idle timeout (%.0fs, no events); still pending: %s",
idle_timeout, ", ".join(sorted(pending)))
return 2
wait = min(wall_deadline - now, idle_left, 1.0)
try:
topic, payload = watcher.events.get(timeout=wait)
except queue.Empty:
if not retention_checked and (now - start) > 3.0:
logger.warning("--expect-retention set but no retained "
"terminal event observed yet")
retention_checked = True
continue
last_event = time.monotonic()
retention_checked = True
print(_format_line(topic, payload), flush=True)
jid = payload["job_id"]
event = payload.get("event")
if event in TERMINAL_EVENTS:
if jid in terminal:
# Already finalised: ignore duplicates / late reorders.
logger.info("ignoring duplicate terminal %s for %s", event, jid)
continue
terminal[jid] = event
pending.discard(jid)
finally:
client.loop_stop()
try:
client.disconnect()
except Exception: # pragma: no cover
pass
# All jobs reached a terminal state. error wins over completed.
if any(state == "error" for state in terminal.values()):
return 1
return 0
if __name__ == "__main__":
sys.exit(main())
+546
View File
@@ -0,0 +1,546 @@
"""Shared MQTT + registry helpers for the delegate-job skill.
Single entry point for:
- broker configuration (env -> dataclass),
- paho client construction (auth + TLS + unique client id),
- monotonic per-job sequence counters,
- retry-with-exponential-backoff,
- atomic registry record load/update under an fcntl lock.
Requires paho-mqtt >= 2.0 (uses CallbackAPIVersion.VERSION2).
This module is the *only* place that talks to the broker config and to the
raw job record file, so PoC -> production migration touches just env/registry
values, never code (see references/mqtt-broker-setup.md).
"""
from __future__ import annotations
import functools
import json
import logging
import os
import tempfile
import time
import uuid
from contextlib import contextmanager
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Any, Callable, Dict, Iterable, List, Optional
import paho.mqtt.client as mqtt
logger = logging.getLogger("delegate_job.mqtt_common")
# --------------------------------------------------------------------------
# Constants
# --------------------------------------------------------------------------
SCHEMA_VERSION = 1
DEFAULT_REGISTRY_DIR = ".hermes/jobs"
DEFAULT_TOPIC_ROOT = "python/mqtt/jobs"
LOCK_FILENAME = ".lock"
# Persistent audit-log layout: .hermes/delegate_job_logs/<job_id>/{meta,events,status}.
# This is a *separate* artifact from the registry: the registry is the live job
# record (mutated in place), the audit log is an append-only history that
# survives even if the registry dir is cleaned up.
META_FILENAME = "meta.json"
EVENTS_FILENAME = "events.ndjson"
STATUS_FILENAME = "status.json"
def _default_logs_dir() -> str:
"""Audit-log root. Overridable with ``DELEGATE_JOB_LOGS_DIR``; otherwise
``<cwd>/.hermes/delegate_job_logs`` — we keep audit logs next to the
live registry (``.hermes/jobs/``) so the two runtime artifacts sit
under the same parent dir and follow the same ``.gitignore`` rule.
The cwd of whichever process emits events (the bash wrapper and
scripts) is used as the anchor."""
env = os.environ.get("DELEGATE_JOB_LOGS_DIR")
if env and env.strip():
return env
return os.path.join(os.getcwd(), ".hermes", "delegate_job_logs")
LOGS_DIR = _default_logs_dir()
# --------------------------------------------------------------------------
# Broker configuration
# --------------------------------------------------------------------------
@dataclass
class BrokerConfig:
"""Resolved broker connection settings.
PoC defaults target the public HiveMQ broker. Production overrides arrive
either from environment variables or from a job record's ``broker.*`` block
(see ``broker_config_from_job``).
"""
host: str = "broker.hivemq.com"
port: int = 1883
tls: bool = False
username: Optional[str] = None
password: Optional[str] = None
client_id_prefix: str = "hermes"
# TLS material (only consulted when tls is True).
ca_certs: Optional[str] = None
certfile: Optional[str] = None
keyfile: Optional[str] = None
keepalive: int = 60
def to_dict(self) -> Dict[str, Any]:
return asdict(self)
def to_registry_block(self) -> Dict[str, Any]:
"""The subset that gets persisted into a job record's broker block."""
return {
"host": self.host,
"port": self.port,
"tls": self.tls,
"username": self.username,
"password": self.password,
}
def _env_bool(name: str, default: bool = False) -> bool:
raw = os.environ.get(name)
if raw is None:
return default
return raw.strip().lower() in ("1", "true", "yes", "on")
def _env_int(name: str, default: int) -> int:
raw = os.environ.get(name)
if raw is None or raw.strip() == "":
return default
try:
return int(raw)
except ValueError:
logger.warning("invalid int for %s=%r; using default %d", name, raw, default)
return default
def broker_config_from_env(overrides: Optional[Dict[str, Any]] = None) -> BrokerConfig:
"""Build a :class:`BrokerConfig` from environment variables.
Recognised vars (all optional, PoC defaults shown):
MQTT_BROKER (broker.hivemq.com), MQTT_PORT (1883), MQTT_TLS (0),
MQTT_USERNAME, MQTT_PASSWORD, MQTT_CLIENT_ID_PREFIX (hermes),
MQTT_CA_CERTS, MQTT_CERTFILE, MQTT_KEYFILE, MQTT_KEEPALIVE (60).
``overrides`` (e.g. a job record's broker block) wins over the env values
for any key it specifies with a non-None value.
"""
cfg = BrokerConfig(
host=os.environ.get("MQTT_BROKER", "broker.hivemq.com"),
port=_env_int("MQTT_PORT", 1883),
tls=_env_bool("MQTT_TLS", False),
username=os.environ.get("MQTT_USERNAME") or None,
password=os.environ.get("MQTT_PASSWORD") or None,
client_id_prefix=os.environ.get("MQTT_CLIENT_ID_PREFIX", "hermes"),
ca_certs=os.environ.get("MQTT_CA_CERTS") or None,
certfile=os.environ.get("MQTT_CERTFILE") or None,
keyfile=os.environ.get("MQTT_KEYFILE") or None,
keepalive=_env_int("MQTT_KEEPALIVE", 60),
)
if overrides:
for key, value in overrides.items():
if value is not None and hasattr(cfg, key):
setattr(cfg, key, value)
return cfg
def broker_config_from_job(job: Dict[str, Any]) -> BrokerConfig:
"""Resolve broker config for a job: env defaults, then the job's broker.*
block overrides. This lets ``publish_event.py`` connect from the registry
alone, while still honouring environment toggles (e.g. MQTT_TLS=1)."""
return broker_config_from_env(overrides=job.get("broker") or {})
def make_client(role: str, config: Optional[BrokerConfig] = None) -> mqtt.Client:
"""Return a configured paho ``Client`` (not yet connected).
The client id is ``f"{prefix}-{role}-{uuid8}"`` so concurrent publishers /
subscribers never collide on the broker. Auth and TLS are applied when the
config supplies them.
"""
config = config or broker_config_from_env()
client_id = f"{config.client_id_prefix}-{role}-{uuid.uuid4().hex[:8]}"
client = mqtt.Client(
callback_api_version=mqtt.CallbackAPIVersion.VERSION2,
client_id=client_id,
)
if config.username:
client.username_pw_set(config.username, config.password)
if config.tls:
# If ca_certs is None paho uses the system trust store (good enough for
# public CAs); a private CA bundle path is passed through unchanged.
client.tls_set(
ca_certs=config.ca_certs,
certfile=config.certfile,
keyfile=config.keyfile,
)
logger.debug("built client id=%s tls=%s host=%s", client_id, config.tls, config.host)
return client
def reason_code_value(rc: Any) -> int:
"""Normalise a paho v2 connect reason code to an int.
paho-mqtt 2.x hands callbacks a ``ReasonCode`` object (not an int); older
paths may pass a plain int. ``ReasonCode`` exposes ``.value``; 0 == success.
"""
return int(getattr(rc, "value", rc))
def topic_prefix_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
return f"{root}/{job_id}"
def events_topic_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
return f"{topic_prefix_for(job_id, root)}/events"
# --------------------------------------------------------------------------
# Registry primitives (single source of truth for raw record I/O)
# --------------------------------------------------------------------------
def _job_path(job_id: str, registry_dir: str) -> Path:
return Path(registry_dir) / f"{job_id}.json"
def _lock_path(registry_dir: str) -> Path:
return Path(registry_dir) / LOCK_FILENAME
@contextmanager
def registry_lock(registry_dir: str):
"""Advisory exclusive lock over the whole registry dir via fcntl.
PoC-grade single-host concurrency control. Multiple tmux sessions / scripts
serialise their read-modify-write of job records through this lock so two
sessions never claim the same pending job. For multi-host delegation move
to SQLite WAL (see references/registry.md)."""
import fcntl # POSIX only; imported lazily so import works on Windows.
Path(registry_dir).mkdir(parents=True, exist_ok=True)
lock_file = _lock_path(registry_dir)
fh = open(lock_file, "a+")
try:
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
yield
finally:
try:
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
finally:
fh.close()
def load_job(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Dict[str, Any]:
"""Load and parse a job record. Raises FileNotFoundError if absent."""
path = _job_path(job_id, registry_dir)
if not path.exists():
raise FileNotFoundError(f"job record not found: {path}")
with open(path, "r", encoding="utf-8") as fh:
return json.load(fh)
def _atomic_write_record(job_id: str, registry_dir: str, record: Dict[str, Any]) -> None:
"""Write a record atomically: temp file in the same dir + os.replace.
The rename is atomic on POSIX, so readers never observe a half-written
file. Callers MUST already hold ``registry_lock`` for read-modify-write
correctness."""
Path(registry_dir).mkdir(parents=True, exist_ok=True)
path = _job_path(job_id, registry_dir)
fd, tmp = tempfile.mkstemp(dir=str(path.parent), prefix=f".{job_id}.", suffix=".tmp")
try:
with os.fdopen(fd, "w", encoding="utf-8") as fh:
json.dump(record, fh, ensure_ascii=False, indent=2)
fh.write("\n")
fh.flush()
os.fsync(fh.fileno())
os.replace(tmp, path)
except BaseException:
if os.path.exists(tmp):
os.unlink(tmp)
raise
def update_job_status(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR, **fields: Any) -> Dict[str, Any]:
"""Atomically merge ``fields`` into a job record under the registry lock.
Always refreshes ``updated_at``. Returns the new record. Raises
FileNotFoundError if the job does not exist.
This is the single chokepoint for status writes (both ``registry.update_status``
and ``publish_event.py``'s status sync route through here), so it also mirrors
any ``status`` change into the persistent audit log — best-effort, after the
registry lock is released so a slow/failed log write never blocks the record."""
with registry_lock(registry_dir):
record = load_job(job_id, registry_dir)
old_status = record.get("status")
record.update(fields)
record["updated_at"] = _utcnow()
_atomic_write_record(job_id, registry_dir, record)
if "status" in fields:
new_status = record.get("status")
update_logged_status(job_id, new_status, updated_at=record["updated_at"])
if old_status != new_status:
append_event(job_id, {
"event": "status_changed",
"from": old_status,
"to": new_status,
"timestamp": record["updated_at"],
})
return record
def next_seq(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> int:
"""Return the next monotonic sequence number for a job, persisted in the
record's ``last_seq`` field so it stays consistent across process restarts.
First call returns 1."""
with registry_lock(registry_dir):
record = load_job(job_id, registry_dir)
seq = int(record.get("last_seq", 0)) + 1
record["last_seq"] = seq
record["updated_at"] = _utcnow()
_atomic_write_record(job_id, registry_dir, record)
return seq
def _utcnow() -> str:
"""ISO-8601 UTC timestamp with trailing Z (payload `timestamp` field)."""
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
def _utcnow_precise() -> str:
"""ISO-8601 UTC timestamp with millisecond resolution. Used for the audit
log's ``logged_at`` so events sort cleanly even within the same second."""
now = time.time()
base = time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(now))
return f"{base}.{int((now % 1) * 1000):03d}Z"
# --------------------------------------------------------------------------
# Persistent audit log (.hermes/delegate_job_logs/<job_id>/...)
#
# Every function here is idempotent, concurrency-safe, and *best-effort*: a
# logging failure is swallowed with a logger.warning and never propagated, so it
# can never break a publish, a subscribe, or a registry write. stdout is never
# touched (it is reserved for data output).
# --------------------------------------------------------------------------
def job_log_dir(job_id: str, logs_dir: Optional[str] = None) -> Path:
return Path(logs_dir or LOGS_DIR) / job_id
def job_log_path(job_id: str, kind: str, logs_dir: Optional[str] = None) -> Path:
"""Path to one audit-log file for a job. ``kind`` is a filename, e.g. the
module constants META_FILENAME / EVENTS_FILENAME / STATUS_FILENAME."""
return job_log_dir(job_id, logs_dir) / kind
@contextmanager
def _file_lock(fh):
"""Best-effort exclusive lock over a single open file via fcntl, so two
processes appending to events.ndjson never interleave a line. A no-op where
fcntl is unavailable (Windows); a short append is atomic enough there."""
try:
import fcntl
except ImportError: # pragma: no cover - non-POSIX
yield
return
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
try:
yield
finally:
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
def append_event(job_id: str, event_dict: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
"""Append one event as a JSON line to ``<logs>/<job_id>/events.ndjson``.
Concurrency-safe (fcntl lock over the file) and best-effort. A millisecond
``logged_at`` is stamped when the caller did not supply one."""
try:
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
path.parent.mkdir(parents=True, exist_ok=True)
record = dict(event_dict)
record.setdefault("logged_at", _utcnow_precise())
line = json.dumps(record, ensure_ascii=False) + "\n"
with open(path, "a", encoding="utf-8") as fh:
with _file_lock(fh):
fh.write(line)
fh.flush()
except Exception as exc: # pragma: no cover - best effort
logger.warning("append_event failed for job %s: %s", job_id, exc)
def update_logged_status(job_id: str, status: str, logs_dir: Optional[str] = None, **extras: Any) -> None:
"""Rewrite ``<logs>/<job_id>/status.json`` (current status for fast point
queries) atomically. Best-effort; merges any ``extras``."""
try:
path = job_log_path(job_id, STATUS_FILENAME, logs_dir)
path.parent.mkdir(parents=True, exist_ok=True)
record: Dict[str, Any] = {"job_id": job_id, "status": status, "updated_at": _utcnow()}
record.update(extras)
tmp = path.with_name(path.name + ".tmp")
with open(tmp, "w", encoding="utf-8") as fh:
json.dump(record, fh, ensure_ascii=False, indent=2)
fh.write("\n")
os.replace(tmp, path)
except Exception as exc: # pragma: no cover - best effort
logger.warning("update_logged_status failed for job %s: %s", job_id, exc)
def init_job_log(job_id: str, meta: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
"""Seed the per-job audit-log dir: write meta.json, status.json, and a first
``registered`` line in events.ndjson. Idempotent (the ``registered`` line is
written only when events.ndjson does not yet exist) and best-effort."""
try:
d = job_log_dir(job_id, logs_dir)
d.mkdir(parents=True, exist_ok=True)
with open(d / META_FILENAME, "w", encoding="utf-8") as fh:
json.dump(meta, fh, ensure_ascii=False, indent=2)
fh.write("\n")
status = meta.get("status", "pending")
update_logged_status(
job_id, status, logs_dir=logs_dir,
created_at=meta.get("created_at"), prompt=meta.get("prompt"),
)
events_path = d / EVENTS_FILENAME
first_time = not events_path.exists()
events_path.touch(exist_ok=True)
if first_time:
append_event(job_id, {
"event": "registered",
"status": status,
"agent": meta.get("agent"),
"agent_session": meta.get("agent_session"),
"topic_prefix": meta.get("topic_prefix"),
"timestamp": meta.get("created_at"),
}, logs_dir=logs_dir)
except Exception as exc: # pragma: no cover - best effort
logger.warning("init_job_log failed for job %s: %s", job_id, exc)
def read_logged_meta(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
"""Return a job's audit meta.json (registration snapshot), or None."""
try:
with open(job_log_path(job_id, META_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
return json.load(fh)
except (OSError, json.JSONDecodeError):
return None
def read_logged_status(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
"""Return a job's current status.json, or None. This is the fast point-query
file (current status only), separate from the registration-time meta.json."""
try:
with open(job_log_path(job_id, STATUS_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
return json.load(fh)
except (OSError, json.JSONDecodeError):
return None
def iter_logged_events(job_id: str, logs_dir: Optional[str] = None):
"""Yield each parsed event from a job's events.ndjson in file (time) order.
Malformed lines are skipped with a warning."""
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
if not path.exists():
return
with open(path, "r", encoding="utf-8") as fh:
for line in fh:
line = line.strip()
if not line:
continue
try:
yield json.loads(line)
except json.JSONDecodeError:
logger.warning("skipping malformed audit line in %s", path)
def list_logged_jobs(logs_dir: Optional[str] = None) -> List[Dict[str, Any]]:
"""Return one meta record per job directory under the logs root, oldest
first. Falls back to ``{"job_id": <dir>}`` when meta.json is missing."""
base = Path(logs_dir or LOGS_DIR)
out: List[Dict[str, Any]] = []
if not base.exists():
return out
for d in sorted(base.iterdir()):
if not d.is_dir():
continue
meta = read_logged_meta(d.name, logs_dir) or {"job_id": d.name}
# Overlay the live status.json so the summary reflects current state, not
# the registration-time snapshot frozen in meta.json.
status = read_logged_status(d.name, logs_dir)
if status:
meta = {**meta,
"status": status.get("status", meta.get("status")),
"updated_at": status.get("updated_at", meta.get("updated_at"))}
out.append(meta)
out.sort(key=lambda m: m.get("created_at") or "")
return out
# --------------------------------------------------------------------------
# Retry helper
# --------------------------------------------------------------------------
def with_retry(
fn: Optional[Callable] = None,
*,
attempts: int = 3,
base_delay: float = 0.5,
factor: float = 2.0,
max_delay: float = 8.0,
exceptions: Iterable[type] = (Exception,),
) -> Callable:
"""Retry ``fn`` with exponential backoff.
Usable two ways::
result = with_retry(do_publish, attempts=3)() # wrap-and-call
@with_retry(attempts=5, base_delay=1.0) # decorator
def do_publish(): ...
Re-raises the last exception once ``attempts`` is exhausted.
"""
exc_tuple = tuple(exceptions)
def decorate(func: Callable) -> Callable:
@functools.wraps(func)
def wrapper(*args: Any, **kwargs: Any) -> Any:
delay = base_delay
last_exc: Optional[BaseException] = None
for attempt in range(1, attempts + 1):
try:
return func(*args, **kwargs)
except exc_tuple as exc:
last_exc = exc
if attempt >= attempts:
break
logger.warning(
"attempt %d/%d failed: %s; retrying in %.1fs",
attempt, attempts, exc, delay,
)
time.sleep(delay)
delay = min(delay * factor, max_delay)
assert last_exc is not None
raise last_exc
return wrapper
if fn is not None:
return decorate(fn)
return decorate
def setup_logging(level: int = logging.WARNING) -> None:
"""Configure root logging to stderr. stdout is reserved for data output
(subscriber event lines, registry ids)."""
import sys
logging.basicConfig(
level=level,
stream=sys.stderr,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
+225
View File
@@ -0,0 +1,225 @@
#!/usr/bin/env python3
"""publish_event.py — the single entry point for emitting a Job event.
Loads the job record from the registry, resolves its broker, assigns the next
monotonic ``seq``, builds the schema-v1 JSON payload, and publishes it to
``<topic_prefix>/events`` over QoS 1 with exponential-backoff retry.
Silent by design: nothing is printed to stdout. Diagnostics go to stderr via
logging. Terminal events (``completed``/``error``) publish with retain=True so
a late subscriber still observes the final state (production hardening).
Exit codes:
0 published successfully
1 parameter / registry error (bad args, unknown job, no pending job)
2 publish failed after retries (network / broker / ACK timeout)
Usage:
publish_event.py --job <id> --event started [--detail "..."] [--data '{...}']
publish_event.py --pick-pending --agent-session tmux:claude --event completed
publish_event.py --job <id> --event completed --retained
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
import time
from typing import Any, Dict, Optional
import mqtt_common
import registry
from mqtt_common import (
DEFAULT_REGISTRY_DIR,
SCHEMA_VERSION,
broker_config_from_job,
events_topic_for,
load_job,
make_client,
next_seq,
with_retry,
)
logger = logging.getLogger("delegate_job.publish_event")
VALID_EVENTS = ("started", "permission_required", "progress", "completed", "error")
TERMINAL_EVENTS = ("completed", "error")
# event -> registry status to sync as a best-effort side effect
EVENT_TO_STATUS = {
"started": "running",
"completed": "completed",
"error": "error",
}
CONNECT_ACK_TIMEOUT = 10 # seconds to wait for CONNACK
PUBLISH_ACK_TIMEOUT = 5 # seconds to wait for QoS-1 PUBACK
def build_payload(
job_id: str,
seq: int,
event: str,
detail: str,
data: Optional[Dict[str, Any]],
auth_token: Optional[str],
) -> Dict[str, Any]:
payload: Dict[str, Any] = {
"schema_version": SCHEMA_VERSION,
"seq": seq,
"job_id": job_id,
"event": event,
"timestamp": mqtt_common._utcnow(),
"detail": detail,
"data": dict(data) if data else {},
}
# Production: carry the per-job auth token so the subscriber can verify the
# publisher. The token is compared in plain text (bearer-token style) by the
# subscriber — NOT an HMAC. See SKILL.md "Auth token" and PLAN 8.2. The
# registry stores the per-job token in `auth_token`; only include it on the
# wire when set so the public broker (no auth) doesn't leak anything.
if auth_token:
payload["data"]["auth_token"] = auth_token
return payload
def _publish_once(config, topic: str, body: bytes, retain: bool) -> None:
"""Connect, publish one QoS-1 message, wait for the broker ACK, disconnect.
Raises on any failure so ``with_retry`` can re-run the whole sequence (a
fresh connection per attempt is the robust choice for a PoC)."""
client = make_client("publisher", config)
connected = {"rc": None}
def on_connect(_c, _u, _flags, reason_code, _props):
connected["rc"] = reason_code
client.on_connect = on_connect
client.connect(config.host, config.port, config.keepalive)
client.loop_start()
try:
# Wait for CONNACK so we fail fast on auth/TLS errors.
deadline = time.monotonic() + CONNECT_ACK_TIMEOUT
while connected["rc"] is None and time.monotonic() < deadline:
time.sleep(0.05)
if connected["rc"] is None:
raise TimeoutError("no CONNACK from broker")
if mqtt_common.reason_code_value(connected["rc"]) != 0:
raise ConnectionError(f"broker refused connection: rc={connected['rc']}")
info = client.publish(topic, payload=body, qos=1, retain=retain)
info.wait_for_publish(timeout=PUBLISH_ACK_TIMEOUT)
if not info.is_published():
raise TimeoutError("publish not acknowledged within timeout")
finally:
client.loop_stop()
try:
client.disconnect()
except Exception: # pragma: no cover - disconnect best effort
pass
def _resolve_job_id(args) -> Optional[str]:
if args.pick_pending:
return registry.pick_pending(args.agent_session, args.registry_dir)
return args.job
def main(argv=None) -> int:
parser = argparse.ArgumentParser(description="Publish a Job event to MQTT")
target = parser.add_mutually_exclusive_group(required=True)
target.add_argument("--job", help="job id to publish for")
target.add_argument("--pick-pending", action="store_true",
help="auto-select a pending job for --agent-session")
parser.add_argument("--agent-session", default="tmux:claude",
help="session label used with --pick-pending")
parser.add_argument("--event", default="progress", choices=VALID_EVENTS)
parser.add_argument("--detail", default="")
parser.add_argument("--data", default=None, help="optional JSON object string")
parser.add_argument("--retained", action="store_true",
help="force retain=True (auto for completed/error)")
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
parser.add_argument("--attempts", type=int, default=3)
parser.add_argument("-v", "--verbose", action="store_true")
args = parser.parse_args(argv)
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
# --- parse optional data JSON (parameter error -> exit 1) ---
data: Optional[Dict[str, Any]] = None
if args.data:
try:
data = json.loads(args.data)
if not isinstance(data, dict):
raise ValueError("--data must be a JSON object")
except (ValueError, json.JSONDecodeError) as exc:
logger.error("invalid --data: %s", exc)
return 1
job_id = _resolve_job_id(args)
if not job_id:
logger.error("no job to publish for (unknown --job or no pending job)")
return 1
try:
job = load_job(job_id, args.registry_dir)
except FileNotFoundError as exc:
logger.error("%s", exc)
return 1
config = broker_config_from_job(job)
topic = job.get("topic_prefix")
topic = f"{topic}/events" if topic else events_topic_for(job_id)
seq = next_seq(job_id, args.registry_dir)
payload = build_payload(
job_id=job_id,
seq=seq,
event=args.event,
detail=args.detail,
data=data,
auth_token=job.get("auth_token"),
)
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
retain = args.retained or args.event in TERMINAL_EVENTS
publish = with_retry(
_publish_once,
attempts=args.attempts,
exceptions=(OSError, TimeoutError, ConnectionError, ValueError),
)
try:
publish(config, topic, body, retain)
except Exception as exc:
logger.error("publish failed after %d attempts: %s", args.attempts, exc)
return 2
# Persistent audit log: record the exact payload we put on the wire so the
# publish is reproducible from the log alone. Best-effort (isolated inside
# append_event) — never fails the publish.
mqtt_common.append_event(job_id, {
"event": "published",
"source_event": args.event,
"seq": seq,
"topic": topic,
"retain": retain,
"timestamp": payload["timestamp"],
"detail": args.detail,
"payload": payload,
})
# Best-effort side effects: registry status sync + (debug) event log. Never
# fail the publish on these.
registry.append_event(job_id, args.registry_dir, payload)
new_status = EVENT_TO_STATUS.get(args.event)
if new_status:
try:
mqtt_common.update_job_status(job_id, args.registry_dir, status=new_status)
except Exception as exc: # pragma: no cover - best effort
logger.warning("status sync failed: %s", exc)
logger.info("published %s seq=%d job=%s retain=%s", args.event, seq, job_id, retain)
return 0
if __name__ == "__main__":
sys.exit(main())
+327
View File
@@ -0,0 +1,327 @@
"""Job registry for the delegate-job skill.
A job record is the single source of truth for one delegated unit of work:
its id, prompt, owning agent session, broker connection, timeouts, and status.
Records live as ``<registry_dir>/<job_id>.json`` with an append-only event log
``<registry_dir>/<job_id>.events.log`` and a shared ``<registry_dir>/.lock``.
Concurrency is handled via the fcntl lock in :mod:`mqtt_common` (PoC). For
multi-host delegation, migrate to SQLite WAL — see references/registry.md.
Importable as a library and runnable as a CLI (``register``/``list``/``get``/
``status``/``pick``) so the ``delegate-job`` bash wrapper can shell out.
"""
from __future__ import annotations
import argparse
import json
import logging
import sys
import uuid
from pathlib import Path
from typing import Any, Dict, List, Optional
import mqtt_common
from mqtt_common import (
DEFAULT_REGISTRY_DIR,
SCHEMA_VERSION,
_atomic_write_record,
_utcnow,
broker_config_from_env,
load_job,
registry_lock,
topic_prefix_for,
)
logger = logging.getLogger("delegate_job.registry")
TERMINAL_STATUSES = ("completed", "error", "cancelled")
VALID_STATUSES = ("pending", "running", "completed", "error", "cancelled")
def generate_job_id(bits: int = 32) -> str:
"""PoC: 32-bit hex (8 chars). Production: 128-bit (full uuid4 hex)."""
if bits >= 128:
return uuid.uuid4().hex
nibbles = max(1, bits // 4)
return uuid.uuid4().hex[:nibbles]
def register_job(
prompt: str,
agent: str = "claude-code",
agent_session: str = "tmux:claude",
broker: Optional[Dict[str, Any]] = None,
timeout_sec: int = 600,
idle_timeout_sec: int = 120,
registry_dir: str = DEFAULT_REGISTRY_DIR,
job_id: Optional[str] = None,
expected_artifacts: Optional[List[str]] = None,
bits: int = 32,
auth_token: Optional[str] = None,
) -> str:
"""Create a new ``pending`` job record and return its id.
``broker`` defaults to the current environment's resolved broker block, so
the registry alone is enough for ``publish_event.py`` to connect later.
"""
job_id = job_id or generate_job_id(bits)
if broker is None:
broker = broker_config_from_env().to_registry_block()
now = _utcnow()
record: Dict[str, Any] = {
"schema_version": SCHEMA_VERSION,
"job_id": job_id,
"status": "pending",
"created_at": now,
"updated_at": now,
"prompt": prompt,
"agent": agent,
"agent_session": agent_session,
"broker": broker,
"topic_prefix": topic_prefix_for(job_id),
"timeout_sec": int(timeout_sec),
"idle_timeout_sec": int(idle_timeout_sec),
"expected_artifacts": expected_artifacts or [],
"last_seq": 0,
"auth_token": auth_token,
}
with registry_lock(registry_dir):
if mqtt_common._job_path(job_id, registry_dir).exists():
raise FileExistsError(f"job already exists: {job_id}")
_atomic_write_record(job_id, registry_dir, record)
# Seed the persistent audit log (meta.json + status.json + a "registered"
# event). Best-effort inside init_job_log — never blocks registration.
mqtt_common.init_job_log(job_id, meta=record)
logger.info("registered job %s (agent=%s session=%s)", job_id, agent, agent_session)
return job_id
def pick_pending(agent_session: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Optional[str]:
"""Claim the oldest ``pending`` job for ``agent_session``, flipping it to
``running`` atomically under the lock. Returns the job id, or None if no
pending job matches. This is how each tmux session takes only its own work
without two sessions grabbing the same job."""
with registry_lock(registry_dir):
candidates = []
for record in _iter_records(registry_dir):
if record.get("status") == "pending" and record.get("agent_session") == agent_session:
candidates.append(record)
if not candidates:
return None
candidates.sort(key=lambda r: r.get("created_at", ""))
chosen = candidates[0]
chosen["status"] = "running"
chosen["updated_at"] = _utcnow()
_atomic_write_record(chosen["job_id"], registry_dir, chosen)
logger.info("session %s picked job %s", agent_session, chosen["job_id"])
job_id = chosen["job_id"]
updated_at = chosen["updated_at"]
# pick_pending writes the record directly (not via update_job_status), so it
# mirrors the pending->running transition into the audit log here. Best-effort.
mqtt_common.update_logged_status(job_id, "running", updated_at=updated_at)
mqtt_common.append_event(job_id, {
"event": "status_changed",
"from": "pending",
"to": "running",
"by": agent_session,
"timestamp": updated_at,
})
return job_id
def update_status(job_id: str, registry_dir: str, status: str) -> Dict[str, Any]:
if status not in VALID_STATUSES:
raise ValueError(f"invalid status {status!r}; expected one of {VALID_STATUSES}")
return mqtt_common.update_job_status(job_id, registry_dir, status=status)
def list_jobs(registry_dir: str = DEFAULT_REGISTRY_DIR, status: Optional[str] = None) -> List[Dict[str, Any]]:
records = list(_iter_records(registry_dir))
if status:
records = [r for r in records if r.get("status") == status]
records.sort(key=lambda r: r.get("created_at", ""))
return records
def append_event(job_id: str, registry_dir: str, payload: Dict[str, Any]) -> None:
"""Append one event payload as a JSON line to the job's events log. Best
effort, debug-only; failures are logged but never raised to the caller."""
try:
Path(registry_dir).mkdir(parents=True, exist_ok=True)
log_path = Path(registry_dir) / f"{job_id}.events.log"
with open(log_path, "a", encoding="utf-8") as fh:
fh.write(json.dumps(payload, ensure_ascii=False) + "\n")
except OSError as exc: # pragma: no cover - best effort
logger.warning("could not append event for %s: %s", job_id, exc)
# convenience re-export so callers can `from registry import load_job`
__all__ = [
"register_job", "pick_pending", "update_status", "load_job",
"list_jobs", "append_event", "generate_job_id",
]
def _iter_records(registry_dir: str):
base = Path(registry_dir)
if not base.exists():
return
for path in sorted(base.glob("*.json")):
try:
with open(path, "r", encoding="utf-8") as fh:
yield json.load(fh)
except (OSError, json.JSONDecodeError) as exc:
logger.warning("skipping unreadable record %s: %s", path, exc)
# --------------------------------------------------------------------------
# CLI (so the bash wrapper can shell out without inline python)
# --------------------------------------------------------------------------
def _build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="delegate-job registry CLI")
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
sub = parser.add_subparsers(dest="command", required=True)
p_reg = sub.add_parser("register", help="create a pending job; prints the job id")
p_reg.add_argument("--prompt", required=True)
p_reg.add_argument("--agent", default="claude-code")
p_reg.add_argument("--agent-session", default="tmux:claude")
p_reg.add_argument("--timeout", type=int, default=600)
p_reg.add_argument("--idle-timeout", type=int, default=120)
p_reg.add_argument("--bits", type=int, default=32, help="32 (PoC) or 128 (prod)")
p_reg.add_argument("--artifact", action="append", default=[], dest="artifacts")
p_list = sub.add_parser("list", help="list jobs (optionally by status)")
p_list.add_argument("--status", default=None)
p_list.add_argument("--json", action="store_true")
p_get = sub.add_parser("get", help="print one job record as JSON")
p_get.add_argument("--job", required=True)
p_status = sub.add_parser("status", help="set a job status")
p_status.add_argument("--job", required=True)
p_status.add_argument("--set", required=True, dest="status")
p_pick = sub.add_parser("pick", help="claim a pending job for a session; prints id")
p_pick.add_argument("--agent-session", default="tmux:claude")
p_logs = sub.add_parser(
"logs",
help="show the persistent audit log for a job, or --list every logged job",
)
p_logs.add_argument("job_id", nargs="?", default=None,
help="job id whose events.ndjson to print")
p_logs.add_argument("--list", action="store_true", dest="list_all",
help="summarise every job under the logs dir instead")
p_logs.add_argument("--logs-dir", default=None,
help="override the audit-log root (default: $DELEGATE_JOB_LOGS_DIR "
"or <cwd>/.hermes/delegate_job_logs)")
p_logs.add_argument("--tail", type=int, default=0,
help="show only the last N events (0 = all)")
p_logs.add_argument("--json", action="store_true",
help="emit raw JSON lines / records instead of a table")
return parser
def main(argv: Optional[List[str]] = None) -> int:
mqtt_common.setup_logging(logging.INFO)
args = _build_parser().parse_args(argv)
rd = args.registry_dir
if args.command == "register":
job_id = register_job(
prompt=args.prompt,
agent=args.agent,
agent_session=args.agent_session,
timeout_sec=args.timeout,
idle_timeout_sec=args.idle_timeout,
registry_dir=rd,
expected_artifacts=args.artifacts,
bits=args.bits,
)
print(job_id)
return 0
if args.command == "list":
records = list_jobs(rd, status=args.status)
if args.json:
print(json.dumps(records, ensure_ascii=False, indent=2))
else:
if not records:
print("(no jobs)")
for r in records:
print(f"{r['job_id']} {r.get('status','?'):10s} {r.get('agent_session','')}"
f" {r.get('prompt','')[:48]}")
return 0
if args.command == "get":
try:
print(json.dumps(load_job(args.job, rd), ensure_ascii=False, indent=2))
except FileNotFoundError as exc:
print(str(exc), file=sys.stderr)
return 1
return 0
if args.command == "status":
try:
update_status(args.job, rd, args.status)
except (FileNotFoundError, ValueError) as exc:
print(str(exc), file=sys.stderr)
return 1
return 0
if args.command == "pick":
job_id = pick_pending(args.agent_session, rd)
if job_id is None:
return 3 # no pending job for this session
print(job_id)
return 0
if args.command == "logs":
return _cmd_logs(args)
return 1
def _cmd_logs(args) -> int:
"""Pretty-print one job's events.ndjson, or summarise all logged jobs."""
logs_dir = args.logs_dir or mqtt_common.LOGS_DIR
if args.list_all:
jobs = mqtt_common.list_logged_jobs(logs_dir)
if args.json:
print(json.dumps(jobs, ensure_ascii=False, indent=2))
return 0
if not jobs:
print(f"(no logged jobs under {logs_dir})")
return 0
for m in jobs:
print(f"{m.get('job_id','?')} {m.get('status','?'):10s} "
f"{m.get('created_at','-'):20s} {(m.get('prompt') or '')[:48]}")
return 0
if not args.job_id:
print("logs requires a <job_id> or --list", file=sys.stderr)
return 1
events = list(mqtt_common.iter_logged_events(args.job_id, logs_dir))
if not events and not mqtt_common.job_log_dir(args.job_id, logs_dir).exists():
print(f"no audit log for job {args.job_id} under {logs_dir}", file=sys.stderr)
return 1
if args.tail and args.tail > 0:
events = events[-args.tail:]
if args.json:
for e in events:
print(json.dumps(e, ensure_ascii=False))
return 0
for e in events:
ts = e.get("logged_at") or e.get("timestamp") or "-"
extra = e.get("detail") or e.get("to") or e.get("source_event") or ""
print(f"{ts:24s} {e.get('event','?'):<16s} {extra}")
return 0
if __name__ == "__main__":
sys.exit(main())