feat(skills): integrate delegate-job skill (squashed from delegate-job-skill)
- Copy delegate-job-skill/skills/delegate-job/ → skills/delegate-job/ - Move requirements.txt (paho-mqtt>=2.0.0) into the new location - Refactor outdated hardcoded paths (~/PuKi/lab/, ~/.hermes/skills/) to dynamic resolution - Add MQTT connection timeout / retry hardening - Remove legacy delegate-job-skill/ directory - Update .gitignore Note: delegate-job-skill git history is squashed — preserved content, dropped commit lineage.
This commit is contained in:
+5
-2
@@ -7,5 +7,8 @@ test-sessions.yaml
|
|||||||
test-sessions.yaml.bak
|
test-sessions.yaml.bak
|
||||||
test-sessions.yaml.lock
|
test-sessions.yaml.lock
|
||||||
|
|
||||||
# 자체 git repo 임베드 (별도 관리)
|
# delegate-job 임시/런타임 산출물
|
||||||
delegate-job-skill/
|
.hermes/
|
||||||
|
.venv/
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
# delegate-job 스킬
|
||||||
|
|
||||||
|
작업(Job)을 자율 에이전트(claude-code/codex/opencode/human)에게 위임하고 MQTT
|
||||||
|
이벤트 채널로 비동기 관찰하는 Hermes 스킬. **시작점은 [`SKILL.md`](./SKILL.md).**
|
||||||
|
|
||||||
|
- 프로토콜/스키마: [`job-protocol.md`](./job-protocol.md)
|
||||||
|
- 브로커 PoC→운영 전환: [`mqtt-broker-setup.md`](./mqtt-broker-setup.md)
|
||||||
|
- 레지스트리 포맷/동시성: [`registry.md`](./registry.md)
|
||||||
|
- 참조 구현: [`delegate-job`](./delegate-job) (bash wrapper), [`scripts/publish_event.py`](./scripts/publish_event.py), [`scripts/job_subscriber.py`](./scripts/job_subscriber.py), [`scripts/registry.py`](./scripts/registry.py), [`scripts/mqtt_common.py`](./scripts/mqtt_common.py)
|
||||||
|
- 영구 감사 로그: `.hermes/delegate_job_logs/<job_id>/` (`meta.json`·`events.ndjson`·`status.json`)
|
||||||
|
— `delegate-job logs <id>` 또는 `delegate-job logs --list`로 조회 (SKILL.md "Audit Logs" 참조)
|
||||||
@@ -0,0 +1,348 @@
|
|||||||
|
---
|
||||||
|
name: delegate-job
|
||||||
|
description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types."
|
||||||
|
version: 1.0.0
|
||||||
|
author: Hermes Agent
|
||||||
|
license: MIT
|
||||||
|
platforms: [linux, macos, windows]
|
||||||
|
metadata:
|
||||||
|
hermes:
|
||||||
|
tags: [agent-delegation, mqtt, jobs, orchestration, async-completion]
|
||||||
|
related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring]
|
||||||
|
---
|
||||||
|
|
||||||
|
# delegate-job — Async Job Delegation over MQTT
|
||||||
|
|
||||||
|
Delegate a unit of work to an autonomous agent, then **observe** it instead of
|
||||||
|
blocking on it. Every job gets a unique id and a registry record; the agent
|
||||||
|
publishes lifecycle events (`started`, `permission_required`, `progress`,
|
||||||
|
`completed`, `error`) to a per-job MQTT topic; the delegator subscribes and
|
||||||
|
treats `completed`/`error` — or a timeout — as the terminal state.
|
||||||
|
|
||||||
|
This skill is a **reference implementation**: copy the files in this directory
|
||||||
|
into your project and customise. The `communication_over_mqtt` project is the
|
||||||
|
canonical concrete instance.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The model is deliberately small. A **job** is one delegated task. An **agent**
|
||||||
|
is a worker (a claude-code tmux session, a codex run, a human). The **registry**
|
||||||
|
(`.hermes/jobs/<id>.json`) holds everything about a job so nothing important
|
||||||
|
lives in environment variables — which means one tmux session can process many
|
||||||
|
jobs sequentially, and many sessions can fan out in parallel, with no env
|
||||||
|
collisions. The **event channel** is one MQTT topic per job carrying JSON
|
||||||
|
payloads; `event` discriminates the type.
|
||||||
|
|
||||||
|
Responsibility is split into exactly one entry point each:
|
||||||
|
[`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup,
|
||||||
|
monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py)
|
||||||
|
observes them (timeouts, terminal state machine, defensive parsing). Shared
|
||||||
|
logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in
|
||||||
|
[`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py`
|
||||||
|
in the host project stay frozen.
|
||||||
|
|
||||||
|
Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up
|
||||||
|
the protocol. **Production** moves to your own authenticated TLS broker — the
|
||||||
|
switch is **config only** (env vars + the registry `broker.*` block), never a
|
||||||
|
code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
||||||
|
|
||||||
|
## When to Use / When NOT to Use
|
||||||
|
|
||||||
|
**Use when:**
|
||||||
|
- you want **fire-and-observe** delegation — kick off work and get a completion
|
||||||
|
signal rather than blocking a terminal;
|
||||||
|
- several agent types (claude-code, codex, opencode, human) must follow **one**
|
||||||
|
completion protocol;
|
||||||
|
- you need **multi-job fan-out** across tmux sessions with safe job claiming;
|
||||||
|
- you want a clean PoC → authenticated-broker upgrade path.
|
||||||
|
|
||||||
|
**Do NOT use when:**
|
||||||
|
- a one-shot `claude -p '…'` that returns inline is enough (no async signal
|
||||||
|
needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly;
|
||||||
|
- you need request/response RPC or large artifact transfer (this is a
|
||||||
|
one-direction event stream, not a data bus);
|
||||||
|
- the payload would carry secrets and you're still on the public broker — move
|
||||||
|
to the own-broker stage first.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
The one-line wrapper handles register + subscriber-first + agent launch. If
|
||||||
|
you're new, **start here** and only fall back to the manual 5-step flow when
|
||||||
|
you need finer control.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1) one line: register → start subscriber → launch agent in tmux
|
||||||
|
# (uses public broker by default; last stdout line is the audit-log dir)
|
||||||
|
delegate-job submit \
|
||||||
|
--agent claude-code \
|
||||||
|
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
||||||
|
--workdir /path/to/project \
|
||||||
|
--agent-session tmux:demo \
|
||||||
|
--timeout 600 --idle-timeout 120
|
||||||
|
# → stdout: registered job: <JID>
|
||||||
|
# subscriber pid: …
|
||||||
|
# agent launched in tmux session: demo
|
||||||
|
# subscriber output: <one line per event>
|
||||||
|
# /path/to/project/.hermes/delegate_job_logs/<JID> ← audit log dir
|
||||||
|
|
||||||
|
# 2) at any time, query the job or its audit log
|
||||||
|
delegate-job status --job <JID>
|
||||||
|
delegate-job logs <JID> # pretty timeline
|
||||||
|
delegate-job logs --list # every job, live status
|
||||||
|
|
||||||
|
# 3) run a user-supplied validator against the job's artifacts
|
||||||
|
delegate-job verify --job <JID> --validate ./validate.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The wrapper enforces the **subscribe-before-publish** ordering and **forwards
|
||||||
|
the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls
|
||||||
|
`publish_event.py --job <JID>` with the right id — see Pitfall §"Wrong job_id
|
||||||
|
propagated to the agent"). When you need finer control, the manual flow is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Manual 5-step (same outcome, more knobs)
|
||||||
|
PY=.venv/bin/python
|
||||||
|
SKILL=./skills/delegate-job/scripts
|
||||||
|
|
||||||
|
# 1) register
|
||||||
|
JID=$($PY "$SKILL/registry.py" register \
|
||||||
|
--prompt "…" --agent claude-code --agent-session tmux:demo \
|
||||||
|
--timeout 600 --idle-timeout 120)
|
||||||
|
|
||||||
|
# 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs)
|
||||||
|
$PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 600 --idle-timeout 120 &
|
||||||
|
|
||||||
|
# 3) pass JID to the agent and instruct it to publish events with --job "$JID"
|
||||||
|
# (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id")
|
||||||
|
|
||||||
|
# 4) on completion the subscriber prints events and exits 0/1/2
|
||||||
|
|
||||||
|
# 5) inspect any time
|
||||||
|
$PY "$SKILL/registry.py" get --job "$JID"
|
||||||
|
$PY "$SKILL/registry.py" logs "$JID" # positional job id
|
||||||
|
$PY "$SKILL/registry.py" logs --list
|
||||||
|
```
|
||||||
|
|
||||||
|
## Job Protocol
|
||||||
|
|
||||||
|
One topic per job: `python/mqtt/jobs/<job_id>/events`. Payload (JSON, UTF-8,
|
||||||
|
`schema_version=1`):
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "schema_version": 1, "seq": 7, "job_id": "abc12345",
|
||||||
|
"event": "started|permission_required|progress|completed|error",
|
||||||
|
"timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text",
|
||||||
|
"data": { "optional": "metadata" } }
|
||||||
|
```
|
||||||
|
|
||||||
|
- `seq` is monotonic per job (first = 1); the subscriber uses it to spot
|
||||||
|
reorder/duplication.
|
||||||
|
- `timestamp` is advisory — timeouts are measured from **receive** time.
|
||||||
|
- `detail`/`data` carry **no** secrets or absolute paths.
|
||||||
|
- A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing).
|
||||||
|
|
||||||
|
`started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0,
|
||||||
|
`error`→exit 1. Full catalogue + production `auth_token` handling:
|
||||||
|
[`job-protocol.md`](./job-protocol.md).
|
||||||
|
|
||||||
|
## Registry Format
|
||||||
|
|
||||||
|
```
|
||||||
|
.hermes/jobs/<id>.json # metadata record (single source of truth)
|
||||||
|
.hermes/jobs/<id>.events.log # append-only JSON-lines log (debug, optional)
|
||||||
|
.hermes/jobs/.lock # fcntl advisory lock for the registry
|
||||||
|
```
|
||||||
|
|
||||||
|
The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block,
|
||||||
|
`topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`,
|
||||||
|
`last_seq`, and (production) `auth_token`. Because the `broker` block lives in
|
||||||
|
the record, `publish_event.py` connects from the registry alone. Concurrency,
|
||||||
|
the atomic rename trick, and multi-session job claiming are in
|
||||||
|
[`registry.md`](./registry.md).
|
||||||
|
|
||||||
|
## Audit Logs
|
||||||
|
|
||||||
|
Every job's lifecycle is mirrored to a **persistent, append-only audit log**
|
||||||
|
under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`;
|
||||||
|
default `<cwd>/.hermes/delegate_job_logs`). Unlike the registry — live state
|
||||||
|
mutated in place and liable to be cleaned up — the audit log is durable
|
||||||
|
history you can replay after the fact. It is git-ignored.
|
||||||
|
|
||||||
|
```
|
||||||
|
.hermes/delegate_job_logs/<job_id>/
|
||||||
|
meta.json # registration snapshot: prompt, agent, broker, timeouts, …
|
||||||
|
events.ndjson # append-only, one JSON event per line, in time order
|
||||||
|
status.json # current status only (fast point-query)
|
||||||
|
```
|
||||||
|
|
||||||
|
**What is logged, automatically:**
|
||||||
|
|
||||||
|
| When | `events.ndjson` line | Written by |
|
||||||
|
|------|----------------------|------------|
|
||||||
|
| job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` |
|
||||||
|
| any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` |
|
||||||
|
| event published | `published` (carries the exact payload — reproducible) | `publish_event.py` |
|
||||||
|
| event received | `received` (subscriber's external view) | `job_subscriber.py` |
|
||||||
|
|
||||||
|
Both the emitter side (`published`) and the observer side (`received`) are
|
||||||
|
recorded, so a dropped publish or a missed receive is still visible from the
|
||||||
|
other. Every write is **best-effort and isolated** — an fcntl-locked append
|
||||||
|
guarded by `try/except` that only ever emits a `logger.warning`, so a logging
|
||||||
|
failure can never break a publish, a subscribe, or a registry write. stdout is
|
||||||
|
never touched.
|
||||||
|
|
||||||
|
**Reading them:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
delegate-job logs <job_id> # pretty-print one job's timeline
|
||||||
|
delegate-job logs --list # summarise every logged job (with live status)
|
||||||
|
# or directly via the registry CLI:
|
||||||
|
$PY scripts/registry.py logs <job_id> [--tail N] [--json]
|
||||||
|
$PY scripts/registry.py logs --list [--json]
|
||||||
|
```
|
||||||
|
|
||||||
|
`submit` prints the job's audit-log directory as its last stdout line, so a
|
||||||
|
caller can `tail -n1` to locate it.
|
||||||
|
|
||||||
|
## Broker Setup
|
||||||
|
|
||||||
|
| Stage | Broker | Auth | Transport |
|
||||||
|
|-------|--------|------|-----------|
|
||||||
|
| PoC | `broker.hivemq.com` | none | 1883 plaintext |
|
||||||
|
| Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS |
|
||||||
|
|
||||||
|
All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`,
|
||||||
|
`MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by
|
||||||
|
`broker_config_from_env()`, with the registry `broker.*` block overriding per
|
||||||
|
job. Moving to your own broker is **config only**: install Mosquitto, set
|
||||||
|
`persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant
|
||||||
|
the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip
|
||||||
|
`MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL,
|
||||||
|
`mosquitto_passwd`, self-signed/private-CA certs, cut-over verification):
|
||||||
|
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
||||||
|
|
||||||
|
## Agent Adapters
|
||||||
|
|
||||||
|
Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry
|
||||||
|
path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract
|
||||||
|
in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the
|
||||||
|
**freshly-issued id from the registry record for *this* delegation** — never a
|
||||||
|
job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the
|
||||||
|
agent").
|
||||||
|
|
||||||
|
- **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at
|
||||||
|
lifecycle points. `submit --mode tmux` injects a prompt that already names
|
||||||
|
`$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference
|
||||||
|
instruction block (the wrapper injects something equivalent):
|
||||||
|
|
||||||
|
```text
|
||||||
|
Your job_id is "$JOB_ID" (read it from the registry record for this delegation —
|
||||||
|
do not reuse any job_id you saw before).
|
||||||
|
|
||||||
|
On start: $PY delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started
|
||||||
|
On permission: $PY … --job "$JOB_ID" --event permission_required --detail "<tool>:<what>"
|
||||||
|
On progress: $PY … --job "$JOB_ID" --event progress --detail "<short status>"
|
||||||
|
On success: $PY … --job "$JOB_ID" --event completed --detail "<one-line summary>"
|
||||||
|
On failure: $PY … --job "$JOB_ID" --event error --detail "<one-line reason>"
|
||||||
|
|
||||||
|
Task: <the user's prompt>
|
||||||
|
|
||||||
|
The subscriber for "$JOB_ID" is already running; your completed/error event
|
||||||
|
ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
|
||||||
|
```
|
||||||
|
|
||||||
|
See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns.
|
||||||
|
- **codex** — same contract. Invoke `codex exec "<instruction-block-above>"` or
|
||||||
|
wire `publish_event.py` as an MCP tool so the agent can call it directly.
|
||||||
|
- **opencode** — wire `publish_event.py` as a tool/command the agent can call;
|
||||||
|
identical event points.
|
||||||
|
- **human** — a person does the work, reads the registry record, then runs
|
||||||
|
`publish_event.py --job <id> --event completed` (or `error`) by hand.
|
||||||
|
|
||||||
|
## User Interface
|
||||||
|
|
||||||
|
The [`delegate-job`](./delegate-job) bash wrapper bundles register +
|
||||||
|
subscribe-first + run-agent + validate:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
delegate-job submit --agent claude-code \
|
||||||
|
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
||||||
|
--workdir /path/to/project --timeout 600 [--validate ./validate.sh]
|
||||||
|
delegate-job status --job <id> # one record, pretty-printed
|
||||||
|
delegate-job list # all jobs, one line each
|
||||||
|
delegate-job verify --job <id> --validate ./validate.sh # runs it, reports exit code
|
||||||
|
delegate-job wait [--job <id>] # block until terminal (else --wait-any)
|
||||||
|
```
|
||||||
|
|
||||||
|
`submit` **always starts the subscriber before the agent** (the ordering
|
||||||
|
dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and
|
||||||
|
calls `--validate` afterward if given. The skill automates job-id generation,
|
||||||
|
registry creation, broker resolution, subscriber-first ordering, agent launch,
|
||||||
|
and completion detection; it does **not** automate the agent's internals or your
|
||||||
|
business-logic validation — those are hooks you fill (`validate.sh` reads
|
||||||
|
`$JOB_ID`/`$REGISTRY_DIR`).
|
||||||
|
|
||||||
|
## Common Pitfalls
|
||||||
|
|
||||||
|
- **Publishing before subscribing** — MQTT does not queue non-retained messages
|
||||||
|
for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely
|
||||||
|
on retained terminal events (production). `submit` enforces this.
|
||||||
|
- **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID`
|
||||||
|
on every `submit`. If your agent instruction (or the wrapper's prompt template)
|
||||||
|
hard-codes an old job_id, the agent calls `publish_event.py --job <wrong>`,
|
||||||
|
the subscriber's defensive parser drops it as a `job_id` mismatch, and the
|
||||||
|
delegator waits until idle timeout (exit 2). Fix: instruct the agent to
|
||||||
|
**read the job_id from the registry record for *this* delegation** (or pass it
|
||||||
|
in via env / `--prompt` interpolation), never from prior runs. `submit`'s
|
||||||
|
default prompt template interpolates `$JOB_ID` for you — if you build a custom
|
||||||
|
prompt, do the same.
|
||||||
|
- **tmux session name collision** — `submit --mode tmux` derives the session
|
||||||
|
name from `--agent-session tmux:<name>` (default `tmux:claude`). If a session
|
||||||
|
with that name is already attached (e.g. you ran the demo and the previous
|
||||||
|
session is still open), `tmux new-session -d -s <name>` fails and the agent
|
||||||
|
never launches. Pick a unique `--agent-session` per concurrent delegation
|
||||||
|
(e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one
|
||||||
|
(`tmux kill-session -t claude`) before re-running.
|
||||||
|
- **Timeout before `started`** — a cold-starting agent may not emit `started`
|
||||||
|
for a while; the wall-clock timeout starts at subscribe time so a stuck agent
|
||||||
|
still terminates. Don't set `--timeout` so low you false-positive a slow start.
|
||||||
|
- **No retry on publish** — a dropped `completed` would hang the delegator
|
||||||
|
forever; `publish_event.py` retries with exponential backoff and exits 2 if it
|
||||||
|
still fails, so the delegator is never left waiting silently.
|
||||||
|
- **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or
|
||||||
|
`error` can trail `completed`; the subscriber's terminal state machine
|
||||||
|
finalises each job once and ignores the rest.
|
||||||
|
- **Trusting the public broker** — anyone can publish there; never make a real
|
||||||
|
decision on a PoC signal. Add `auth_token` + an authenticated broker first.
|
||||||
|
- **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or
|
||||||
|
tokens (except the production `auth_token` in `data`).
|
||||||
|
|
||||||
|
## Verification Checklist
|
||||||
|
|
||||||
|
- [ ] `started` → `completed` over the public broker: subscriber prints the
|
||||||
|
lines and exits **0**.
|
||||||
|
- [ ] `error` path: subscriber exits **1**.
|
||||||
|
- [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` →
|
||||||
|
exit **2**.
|
||||||
|
- [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is
|
||||||
|
dropped with a warning, not crashed on.
|
||||||
|
- [ ] one tmux session processes two registry jobs in sequence; a second
|
||||||
|
session with a different `agent_session` claims only its own.
|
||||||
|
- [ ] broker cut-over: same scripts reach an authenticated TLS broker with env
|
||||||
|
changes only; a credential without write ACL is rejected; a late
|
||||||
|
subscriber still receives the retained terminal event.
|
||||||
|
- [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample`
|
||||||
|
still works unchanged (regression).
|
||||||
|
- [ ] **audit log integrity** — for a completed job,
|
||||||
|
`.hermes/delegate_job_logs/<JID>/events.ndjson` contains `registered` →
|
||||||
|
`received started` → `published completed` (in that order), and
|
||||||
|
`status.json.status == "completed"` matches the registry record. A
|
||||||
|
logging failure (e.g. read-only log dir) does not break the publish or
|
||||||
|
subscribe path — only a `logger.warning` is emitted.
|
||||||
|
- [ ] **end-to-end demo smoke** — run
|
||||||
|
`delegate-job submit --agent claude-code --agent-session tmux:demo-smoke
|
||||||
|
--prompt "echo hello and call publish_event.py --job <JID>
|
||||||
|
--event completed" --timeout 120` and confirm
|
||||||
|
(a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session
|
||||||
|
name printed, (d) `events.ndjson` grows as the agent runs, (e) final
|
||||||
|
stdout line is the audit-log dir.
|
||||||
File diff suppressed because it is too large
Load Diff
Executable
+272
@@ -0,0 +1,272 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# delegate-job — user-facing orchestrator for the delegate-job skill.
|
||||||
|
#
|
||||||
|
# Subcommands:
|
||||||
|
# submit register a job, start the subscriber FIRST, then run the agent,
|
||||||
|
# then (optionally) run a validation script.
|
||||||
|
# status show one job record.
|
||||||
|
# list list all jobs.
|
||||||
|
# verify run a user-supplied --validate script against a job's artifacts.
|
||||||
|
# wait block until all running/pending jobs reach a terminal state.
|
||||||
|
#
|
||||||
|
# This is a reference wrapper: it shells out to the python scripts that live
|
||||||
|
# next to it. Copy it into your project and customise as needed. It never hard
|
||||||
|
# fails if `claude`/`codex`/`tmux` are missing — it prints what it would run.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
# Pick an interpreter: prefer a project .venv, else python3.
|
||||||
|
pick_python() {
|
||||||
|
local py_bin
|
||||||
|
if [[ -n "${DELEGATE_JOB_PYTHON:-}" ]]; then
|
||||||
|
py_bin="$DELEGATE_JOB_PYTHON"
|
||||||
|
elif [[ -x "${WORKDIR:-.}/.venv/bin/python" ]]; then
|
||||||
|
py_bin="${WORKDIR}/.venv/bin/python"
|
||||||
|
elif [[ -x ".venv/bin/python" ]]; then
|
||||||
|
py_bin="$(pwd)/.venv/bin/python"
|
||||||
|
else
|
||||||
|
py_bin="python3"
|
||||||
|
fi
|
||||||
|
if ! "$py_bin" -c "import paho.mqtt" 2>/dev/null; then
|
||||||
|
echo "ERROR: paho-mqtt package is missing for $py_bin." >&2
|
||||||
|
echo " Please create a virtual environment and install it:" >&2
|
||||||
|
echo " python3 -m venv .venv && .venv/bin/pip install -r \"$SCRIPT_DIR/requirements.txt\"" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "$py_bin"
|
||||||
|
}
|
||||||
|
|
||||||
|
REGISTRY_DIR_DEFAULT=".hermes/jobs"
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
delegate-job <command> [options]
|
||||||
|
|
||||||
|
submit --agent <name> --prompt <text> [--workdir <dir>] [--agent-session <label>]
|
||||||
|
[--timeout <sec>] [--idle-timeout <sec>] [--validate <script>]
|
||||||
|
[--registry-dir <dir>] [--dry-run]
|
||||||
|
# The skill is tmux-interactive only; --mode print was removed.
|
||||||
|
status --job <id> [--registry-dir <dir>]
|
||||||
|
list [--registry-dir <dir>]
|
||||||
|
verify --job <id> --validate <script> [--registry-dir <dir>]
|
||||||
|
wait [--job <id>] [--timeout <sec>] [--registry-dir <dir>]
|
||||||
|
logs <job_id> | --list # persistent audit log (delegate_job_logs/)
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---- arg parsing helpers --------------------------------------------------
|
||||||
|
AGENT="claude-code"; PROMPT=""; WORKDIR="$(pwd)"; AGENT_SESSION="tmux:claude"
|
||||||
|
TIMEOUT=600; IDLE_TIMEOUT=120; VALIDATE=""; DRY_RUN=0
|
||||||
|
JOB_ID=""; REGISTRY_DIR="$REGISTRY_DIR_DEFAULT"
|
||||||
|
|
||||||
|
parse_opts() {
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--agent) AGENT="$2"; shift 2;;
|
||||||
|
--prompt) PROMPT="$2"; shift 2;;
|
||||||
|
--workdir) WORKDIR="$2"; shift 2;;
|
||||||
|
--agent-session) AGENT_SESSION="$2"; shift 2;;
|
||||||
|
--timeout) TIMEOUT="$2"; shift 2;;
|
||||||
|
--idle-timeout) IDLE_TIMEOUT="$2"; shift 2;;
|
||||||
|
--validate) VALIDATE="$2"; shift 2;;
|
||||||
|
--job) JOB_ID="$2"; shift 2;;
|
||||||
|
--registry-dir) REGISTRY_DIR="$2"; shift 2;;
|
||||||
|
--dry-run) DRY_RUN=1; shift;;
|
||||||
|
*) echo "unknown option: $1" >&2; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_submit() {
|
||||||
|
parse_opts "$@"
|
||||||
|
[[ -n "$PROMPT" ]] || { echo "submit requires --prompt" >&2; exit 1; }
|
||||||
|
PY="$(pick_python)"
|
||||||
|
cd "$WORKDIR"
|
||||||
|
mkdir -p "$REGISTRY_DIR"
|
||||||
|
|
||||||
|
# 1) register job (prints the new job id)
|
||||||
|
JOB_ID="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" register \
|
||||||
|
--prompt "$PROMPT" --agent "$AGENT" --agent-session "$AGENT_SESSION" \
|
||||||
|
--timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT")"
|
||||||
|
echo "registered job: $JOB_ID"
|
||||||
|
|
||||||
|
# 2) START THE SUBSCRIBER FIRST (ordering dependency — MQTT does not queue
|
||||||
|
# non-retained messages for absent subscribers).
|
||||||
|
local logf="$REGISTRY_DIR/$JOB_ID.subscriber.out"
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||||
|
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
|
||||||
|
>"$logf" 2>&1 &
|
||||||
|
local sub_pid=$!
|
||||||
|
echo "subscriber pid: $sub_pid (log: $logf)"
|
||||||
|
sleep 1 # give the subscriber time to CONNACK + SUBSCRIBE before the agent runs
|
||||||
|
|
||||||
|
# 3) run the agent (or print the command for dry-run / missing binary)
|
||||||
|
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
|
||||||
|
# NOTE: the agent MUST use --job "$JOB_ID" (the one we just minted). Hard-coding
|
||||||
|
# an id from an earlier session is the #1 reason a delegated job sits idle and
|
||||||
|
# times out (see SKILL.md "Wrong job_id propagated to the agent"). We make the
|
||||||
|
# freshness explicit in the instruction header.
|
||||||
|
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
|
||||||
|
|
||||||
|
On start run: $pub --event started.
|
||||||
|
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
|
||||||
|
On progress (optional): $pub --event progress --detail '<short status>'.
|
||||||
|
On success run: $pub --event completed --detail '<one-line summary>'.
|
||||||
|
On failure run: $pub --event error --detail '<one-line reason>'.
|
||||||
|
|
||||||
|
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
|
||||||
|
|
||||||
|
Task: $PROMPT"
|
||||||
|
|
||||||
|
run_agent "$JOB_ID" "$instructions"
|
||||||
|
|
||||||
|
# 4) optional validation hook
|
||||||
|
if [[ -n "$VALIDATE" ]]; then
|
||||||
|
echo "running validation: $VALIDATE"
|
||||||
|
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
|
||||||
|
echo "validation: PASS"
|
||||||
|
else
|
||||||
|
local rc=$?
|
||||||
|
echo "validation: FAIL (exit $rc)"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$DRY_RUN" == "1" ]]; then
|
||||||
|
# In dry-run we never started a real subscriber (the wrapper short-circuits
|
||||||
|
# before launching one), but the wait below would still try to join the
|
||||||
|
# background sub_pid from cmd_submit. Skip both the wait and the subscriber
|
||||||
|
# log dump; the user just wants to see the instruction that would have run.
|
||||||
|
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
|
||||||
|
echo "$logs_root_dry/$JOB_ID"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
wait "$sub_pid" || true
|
||||||
|
echo "subscriber output:"; cat "$logf" || true
|
||||||
|
|
||||||
|
# Last stdout line: the persistent audit-log dir for this job (see SKILL.md
|
||||||
|
# "Audit Logs"). Callers can scrape `tail -n1` to find it.
|
||||||
|
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
|
||||||
|
echo "$logs_root/$JOB_ID"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_agent() {
|
||||||
|
local job_id="$1"; local instructions="$2"
|
||||||
|
# The skill is INTERACTIVE-ONLY. We never invoke `claude -p` or any other
|
||||||
|
# one-shot print mode, because:
|
||||||
|
# - claude -p exits the moment stdin is drained, so there's nothing to
|
||||||
|
# `tmux attach` to afterwards.
|
||||||
|
# - fire-and-forget via wrapper defeats the whole point of the audit log
|
||||||
|
# (you can't tell what happened if the agent crashes mid-turn).
|
||||||
|
# - the job registry already gives us an authoritative completion signal,
|
||||||
|
# so we don't need a wrapper-side exit code to know "done".
|
||||||
|
# The user attaches with `tmux attach -t <session>` and types follow-up
|
||||||
|
# prompts themselves. We pre-load the first prompt via stdin and `read`
|
||||||
|
# keeps the pane open after the agent exits so the user can review.
|
||||||
|
case "$AGENT" in
|
||||||
|
claude-code) bin="claude";;
|
||||||
|
codex) bin="codex";;
|
||||||
|
human) echo "[human agent] complete the task, then run publish_event.py --event completed"; return;;
|
||||||
|
*) bin="$AGENT";;
|
||||||
|
esac
|
||||||
|
|
||||||
|
if [[ "$DRY_RUN" == "1" ]]; then
|
||||||
|
echo "[dry-run] would launch agent '$AGENT' in a fresh tmux session with instructions:"
|
||||||
|
echo "----"; echo "$instructions"; echo "----"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! command -v tmux >/dev/null 2>&1; then
|
||||||
|
echo "ERROR: this skill requires tmux (interactive agent sessions)." >&2
|
||||||
|
echo " Install with: brew install tmux (or your package manager)" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
if ! command -v "$bin" >/dev/null 2>&1; then
|
||||||
|
echo "ERROR: agent binary '$bin' not found in PATH." >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
local sess="${AGENT_SESSION#tmux:}"
|
||||||
|
# Detect a stale session with the same name (e.g. the user is still attached
|
||||||
|
# from an earlier run, or a previous wrapper died without cleanup). tmux
|
||||||
|
# new-session on an existing name fails silently; check first and fail loud.
|
||||||
|
if tmux has-session -t "$sess" 2>/dev/null; then
|
||||||
|
local attached
|
||||||
|
attached=$(tmux list-clients -t "$sess" 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
echo "ERROR: tmux session '$sess' already exists (clients attached: $attached)." >&2
|
||||||
|
echo " Pick a unique --agent-session (e.g. tmux:demo, tmux:claude-a) or" >&2
|
||||||
|
echo " kill the stale one first: tmux kill-session -t $sess" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
tmux new-session -d -s "$sess" -c "$WORKDIR" \
|
||||||
|
"printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; echo '--- agent exited (job $job_id); press enter to close ---'; read"
|
||||||
|
echo "agent launched in tmux session: $sess (attach with: tmux attach -t $sess)"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_status() {
|
||||||
|
parse_opts "$@"
|
||||||
|
[[ -n "$JOB_ID" ]] || { echo "status requires --job" >&2; exit 1; }
|
||||||
|
PY="$(pick_python)"
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get --job "$JOB_ID"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_list() {
|
||||||
|
parse_opts "$@"
|
||||||
|
PY="$(pick_python)"
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" list
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_verify() {
|
||||||
|
parse_opts "$@"
|
||||||
|
[[ -n "$JOB_ID" ]] || { echo "verify requires --job" >&2; exit 1; }
|
||||||
|
[[ -n "$VALIDATE" ]] || { echo "verify requires --validate <script>" >&2; exit 1; }
|
||||||
|
echo "verifying job $JOB_ID with $VALIDATE"
|
||||||
|
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
|
||||||
|
echo "verify: PASS (exit 0)"; exit 0
|
||||||
|
else
|
||||||
|
rc=$?; echo "verify: FAIL (exit $rc)"; exit "$rc"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_logs() {
|
||||||
|
# logs <job_id> | logs --list — delegates to registry.py's logs CLI, which
|
||||||
|
# reads the persistent audit log under $DELEGATE_JOB_LOGS_DIR (or
|
||||||
|
# <cwd>/delegate_job_logs). Run from your project dir so the default resolves.
|
||||||
|
PY="$(pick_python)"
|
||||||
|
if [[ "${1:-}" == "--list" ]]; then
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs --list
|
||||||
|
else
|
||||||
|
local jid="${1:-}"
|
||||||
|
[[ -n "$jid" ]] || { echo "logs requires <job_id> or --list" >&2; exit 1; }
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs "$jid"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_wait() {
|
||||||
|
parse_opts "$@"
|
||||||
|
PY="$(pick_python)"
|
||||||
|
if [[ -n "$JOB_ID" ]]; then
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||||
|
--job "$JOB_ID" --timeout "$TIMEOUT"
|
||||||
|
else
|
||||||
|
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||||
|
--wait-any --timeout "$TIMEOUT"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
main() {
|
||||||
|
local sub="${1:-}"; shift || true
|
||||||
|
case "$sub" in
|
||||||
|
submit) cmd_submit "$@";;
|
||||||
|
status) cmd_status "$@";;
|
||||||
|
list) cmd_list "$@";;
|
||||||
|
verify) cmd_verify "$@";;
|
||||||
|
wait) cmd_wait "$@";;
|
||||||
|
logs) cmd_logs "$@";;
|
||||||
|
""|-h|--help|help) usage;;
|
||||||
|
*) echo "unknown command: $sub" >&2; usage; exit 1;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
main "$@"
|
||||||
@@ -0,0 +1,118 @@
|
|||||||
|
# Job Event Protocol
|
||||||
|
|
||||||
|
The wire contract every delegate-job agent (claude-code, codex, opencode,
|
||||||
|
human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across
|
||||||
|
the PoC (public broker) and production (own broker) stages; only transport
|
||||||
|
hardening changes, never the payload shape.
|
||||||
|
|
||||||
|
Reference implementation: [`./scripts/publish_event.py`](./scripts/publish_event.py)
|
||||||
|
(emit) and [`./scripts/job_subscriber.py`](./scripts/job_subscriber.py) (observe).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Topic design
|
||||||
|
|
||||||
|
| Topic | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `python/mqtt/sample` | Legacy demo topic — **never changed** (README compat). |
|
||||||
|
| `python/mqtt/jobs/<job_id>/events` | Per-job event stream (this protocol). |
|
||||||
|
|
||||||
|
- One topic per job, JSON payload, `event` field discriminates the type.
|
||||||
|
- Single-direction publish only (worker → observer). No request/response.
|
||||||
|
- Future split is reserved but not required:
|
||||||
|
`<job_id>/events`, `<job_id>/logs`, `<job_id>/artifacts`.
|
||||||
|
- `topic_prefix` is stored in the job record so publishers resolve the topic
|
||||||
|
from the registry alone (`<topic_prefix>/events`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Payload schema (JSON, UTF-8, `schema_version = 1`)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"schema_version": 1,
|
||||||
|
"seq": 7,
|
||||||
|
"job_id": "abc12345",
|
||||||
|
"event": "started | permission_required | progress | completed | error",
|
||||||
|
"timestamp": "2026-06-19T09:32:00Z",
|
||||||
|
"detail": "generalised, whitelisted human-readable string",
|
||||||
|
"data": { "optional": "metadata" }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| Field | Rule |
|
||||||
|
|-------|------|
|
||||||
|
| `schema_version` | If publisher/subscriber disagree, the subscriber **drops** the event with a warning (defensive parsing). |
|
||||||
|
| `seq` | Monotonic **per `job_id`**, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (`last_seq`) so it survives restarts. |
|
||||||
|
| `job_id` | Subscriber drops any event whose `job_id` it did not subscribe for. |
|
||||||
|
| `timestamp` | Publisher host clock, **advisory only**. The delegator's timeout is measured from *receive* time, not this field. |
|
||||||
|
| `detail` | Generalised text only. **No absolute paths, keys, or tokens.** |
|
||||||
|
| `data` | Optional metadata. Production may add `auth_token`, `build_id`, etc. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Event catalogue
|
||||||
|
|
||||||
|
| event | When emitted | `detail` example | seq |
|
||||||
|
|-------|--------------|------------------|-----|
|
||||||
|
| `started` | Agent first picks up the job | `"Job a1b2c3d4 started"` | 1 |
|
||||||
|
| `permission_required` | Agent needs a tool/permission grant | `"needs to write sort_problems.md"` | as it happens |
|
||||||
|
| `progress` | Optional intermediate checkpoint | `"creating problem 5/10"` | as it happens |
|
||||||
|
| `completed` | Successful terminal state | `"saved to sort_problems.md"` | last |
|
||||||
|
| `error` | Failure / exception terminal state | `"internal error, see logs"` | last |
|
||||||
|
|
||||||
|
`started` and `completed`/`error` are mandatory bookends; `permission_required`
|
||||||
|
and `progress` are optional. `detail` must stay on the whitelist of generalised
|
||||||
|
phrasings — never leak secrets through it.
|
||||||
|
|
||||||
|
### Terminal semantics
|
||||||
|
|
||||||
|
- `completed` → subscriber exits 0; `error` → exits 1.
|
||||||
|
- The subscriber runs a **terminal state machine**: it finalises a job on the
|
||||||
|
first `completed`/`error` it sees and ignores any later terminal event for
|
||||||
|
that job (QoS-1 duplicate, or an `error`-after-`completed` reorder). When all
|
||||||
|
watched jobs are finalised it exits.
|
||||||
|
- Wall-clock timeout *or* idle timeout before a terminal event → exit 2.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Production hardening (own broker stage)
|
||||||
|
|
||||||
|
The payload shape is unchanged; the transport and trust model tighten. See
|
||||||
|
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md) for the broker side.
|
||||||
|
|
||||||
|
- **Auth / ACL** — username/password + per-topic ACL. `jobs/+/events` publish is
|
||||||
|
granted to the worker credential, subscribe to the Hermes credential.
|
||||||
|
- **`auth_token` (the bonus field)** — each job record carries a per-job
|
||||||
|
`auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into
|
||||||
|
**`data.auth_token`**; the subscriber compares it against the registry's
|
||||||
|
expected token and **drops mismatches**. This is an integrity check on top of
|
||||||
|
the broker ACL, useful while still on a shared/public broker.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "...": "...", "data": { "auth_token": "9f3c…", "build_id": "42" } }
|
||||||
|
```
|
||||||
|
|
||||||
|
- **TLS** — port 8883 + private CA. Toggled with `MQTT_TLS=1` (+ `MQTT_CA_CERTS`);
|
||||||
|
no code change.
|
||||||
|
- **Retained terminal events** — `completed`/`error` publish with `retain=True`
|
||||||
|
so a subscriber that joins late immediately receives the last terminal state
|
||||||
|
instead of a stale view. The reference publisher auto-retains terminal events;
|
||||||
|
`--retained` forces it for any event.
|
||||||
|
- **Dual timeouts** — total wall-clock budget + last-activity idle detection,
|
||||||
|
both measured from receive time.
|
||||||
|
- **Clock trust** — never trust the payload `timestamp` for timeout decisions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Why a public broker is PoC-only
|
||||||
|
|
||||||
|
On `broker.hivemq.com` anyone can publish/subscribe the same topic. Therefore:
|
||||||
|
|
||||||
|
- No secret data in payloads.
|
||||||
|
- `started`/`completed`/`error` are *signals*, never a basis for a security
|
||||||
|
decision.
|
||||||
|
- Non-retained messages are **not queued** for absent subscribers — start the
|
||||||
|
subscriber **before** the agent (ordering dependency), or rely on retained
|
||||||
|
terminal events in production.
|
||||||
|
- Real operational decisions belong to the own-broker stage with auth + ACL.
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,176 @@
|
|||||||
|
# MQTT Broker Setup — PoC → Production
|
||||||
|
|
||||||
|
The delegate-job scripts read **all** broker settings from environment
|
||||||
|
variables (or a job record's `broker.*` block) through a single helper,
|
||||||
|
`broker_config_from_env()` in
|
||||||
|
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py). The design goal:
|
||||||
|
**switch from the public PoC broker to your own broker with config only — no
|
||||||
|
code change.**
|
||||||
|
|
||||||
|
| Env var | Meaning | PoC default | Production |
|
||||||
|
|---------|---------|-------------|-----------|
|
||||||
|
| `MQTT_BROKER` | host | `broker.hivemq.com` | internal hostname/IP |
|
||||||
|
| `MQTT_PORT` | port | `1883` | `8883` (TLS) |
|
||||||
|
| `MQTT_TLS` | TLS on/off (`1`/`0`) | `0` | `1` |
|
||||||
|
| `MQTT_USERNAME` / `MQTT_PASSWORD` | auth | (none) | broker-issued |
|
||||||
|
| `MQTT_CA_CERTS` | CA bundle path | (none) | private CA path |
|
||||||
|
| `MQTT_CERTFILE` / `MQTT_KEYFILE` | client cert (optional mTLS) | (none) | per-client |
|
||||||
|
| `MQTT_CLIENT_ID_PREFIX` | client id prefix | `hermes` | per-environment |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. PoC: public broker (`broker.hivemq.com`)
|
||||||
|
|
||||||
|
**Pros** — zero setup, reachable from anywhere, perfect for wiring up the
|
||||||
|
publish/subscribe loop and the timeout/state-machine logic.
|
||||||
|
|
||||||
|
**Cons / accepted assumptions** — no auth, no integrity, shared with the world:
|
||||||
|
|
||||||
|
- no secrets in payloads;
|
||||||
|
- `started`/`completed`/`error` are advisory signals only;
|
||||||
|
- non-retained messages are **not queued** for absent subscribers, so the
|
||||||
|
subscriber must start before the agent;
|
||||||
|
- a re-subscribing client cannot recover past (non-retained) events.
|
||||||
|
|
||||||
|
Use it only to validate the protocol, never for real decisions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Production: self-hosted Mosquitto (or EMQX)
|
||||||
|
|
||||||
|
Both support MQTT 5 + ACL + TLS. Mosquitto shown below; EMQX is a drop-in for
|
||||||
|
the same env vars.
|
||||||
|
|
||||||
|
### 2.1 Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# macOS
|
||||||
|
brew install mosquitto
|
||||||
|
|
||||||
|
# Debian/Ubuntu
|
||||||
|
sudo apt-get update && sudo apt-get install -y mosquitto mosquitto-clients
|
||||||
|
|
||||||
|
# Docker
|
||||||
|
docker run -d --name mosquitto -p 8883:8883 \
|
||||||
|
-v "$PWD/mosquitto.conf:/mosquitto/config/mosquitto.conf" \
|
||||||
|
-v "$PWD/certs:/mosquitto/certs" \
|
||||||
|
-v "$PWD/auth:/mosquitto/auth" \
|
||||||
|
eclipse-mosquitto:2
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 `mosquitto.conf` (key lines)
|
||||||
|
|
||||||
|
```conf
|
||||||
|
persistence true
|
||||||
|
persistence_location /mosquitto/data/
|
||||||
|
|
||||||
|
password_file /mosquitto/auth/passwd
|
||||||
|
acl_file /mosquitto/auth/acl
|
||||||
|
allow_anonymous false
|
||||||
|
|
||||||
|
listener 8883
|
||||||
|
cafile /mosquitto/certs/ca.crt
|
||||||
|
certfile /mosquitto/certs/server.crt
|
||||||
|
keyfile /mosquitto/certs/server.key
|
||||||
|
```
|
||||||
|
|
||||||
|
`persistence true` + QoS 1 + retained terminal events means a subscriber that
|
||||||
|
joins after a job finished still sees the final `completed`/`error`.
|
||||||
|
|
||||||
|
### 2.3 Users (username/password)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# create the file with the first user, then add more with -b
|
||||||
|
mosquitto_passwd -c /mosquitto/auth/passwd hermes # subscriber/delegator
|
||||||
|
mosquitto_passwd /mosquitto/auth/passwd claude-worker # publisher/agent
|
||||||
|
# (omit -c after the first; -c truncates the file)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.4 ACL — least privilege
|
||||||
|
|
||||||
|
The worker only **publishes** events; Hermes only **subscribes**:
|
||||||
|
|
||||||
|
```conf
|
||||||
|
# /mosquitto/auth/acl
|
||||||
|
|
||||||
|
# claude-worker: may publish job events, may not read others' streams
|
||||||
|
user claude-worker
|
||||||
|
topic write python/mqtt/jobs/+/events
|
||||||
|
|
||||||
|
# hermes: observes every job's events
|
||||||
|
user hermes
|
||||||
|
topic read python/mqtt/jobs/+/events
|
||||||
|
|
||||||
|
# keep the legacy demo topic usable for both, if desired
|
||||||
|
pattern readwrite python/mqtt/sample
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.5 TLS certificates
|
||||||
|
|
||||||
|
**Quick self-signed (single host, internal only):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p certs && cd certs
|
||||||
|
openssl req -x509 -newkey rsa:2048 -nodes -days 825 \
|
||||||
|
-keyout server.key -out server.crt \
|
||||||
|
-subj "/CN=mqtt.internal"
|
||||||
|
cp server.crt ca.crt # clients trust this as the CA bundle
|
||||||
|
```
|
||||||
|
|
||||||
|
**Private CA (recommended — separate CA from server cert):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1) CA
|
||||||
|
openssl genrsa -out ca.key 4096
|
||||||
|
openssl req -x509 -new -nodes -key ca.key -days 3650 -out ca.crt -subj "/CN=Hermes-CA"
|
||||||
|
# 2) server cert signed by the CA
|
||||||
|
openssl genrsa -out server.key 2048
|
||||||
|
openssl req -new -key server.key -out server.csr -subj "/CN=mqtt.internal"
|
||||||
|
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
|
||||||
|
-out server.crt -days 825
|
||||||
|
```
|
||||||
|
|
||||||
|
Clients trust `ca.crt` via `MQTT_CA_CERTS=/path/to/ca.crt`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Cut-over verification (config-only, no code change)
|
||||||
|
|
||||||
|
Goal: prove the **same scripts** talk to your broker by changing only env/registry.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1) point the env at the new broker
|
||||||
|
export MQTT_BROKER=mqtt.internal
|
||||||
|
export MQTT_PORT=8883
|
||||||
|
export MQTT_TLS=1
|
||||||
|
export MQTT_CA_CERTS=$PWD/certs/ca.crt
|
||||||
|
export MQTT_USERNAME=hermes
|
||||||
|
export MQTT_PASSWORD=… # subscriber side
|
||||||
|
# (publisher side uses claude-worker creds via the job record's broker block)
|
||||||
|
|
||||||
|
# 2) sanity-check with the mosquitto CLI first
|
||||||
|
mosquitto_sub -h "$MQTT_BROKER" -p 8883 --cafile "$MQTT_CA_CERTS" \
|
||||||
|
-u hermes -P "$MQTT_PASSWORD" -t 'python/mqtt/jobs/+/events' -v &
|
||||||
|
|
||||||
|
# 3) run the unchanged delegate-job loop
|
||||||
|
PY=.venv/bin/python
|
||||||
|
JID=$($PY scripts/registry.py register --prompt "broker cutover smoke")
|
||||||
|
$PY scripts/job_subscriber.py --job "$JID" --timeout 30 &
|
||||||
|
sleep 3
|
||||||
|
$PY scripts/publish_event.py --job "$JID" --event started
|
||||||
|
$PY scripts/publish_event.py --job "$JID" --event completed # auto-retained
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected:
|
||||||
|
- subscriber prints the `started` and `completed` lines and exits 0;
|
||||||
|
- `mosquitto_sub` shows the same events (ACL allows `hermes` to read);
|
||||||
|
- publishing as a credential **without** write ACL is rejected by the broker;
|
||||||
|
- a subscriber started *after* `completed` still receives it (retained).
|
||||||
|
|
||||||
|
If all four hold, the migration is config-only. Persist the broker block into
|
||||||
|
each job record so `publish_event.py` connects from the registry alone:
|
||||||
|
|
||||||
|
```json
|
||||||
|
"broker": { "host": "mqtt.internal", "port": 8883, "tls": true,
|
||||||
|
"username": "claude-worker", "password": "…" }
|
||||||
|
```
|
||||||
@@ -0,0 +1,183 @@
|
|||||||
|
# Job Registry
|
||||||
|
|
||||||
|
The registry is the **single source of truth** for delegated work. Job metadata
|
||||||
|
(id, prompt, broker, status, timeouts) lives in files, **not** environment
|
||||||
|
variables — so one tmux session can handle many jobs sequentially or in
|
||||||
|
parallel without collisions, and `publish_event.py` / `job_subscriber.py` can
|
||||||
|
reconstruct everything they need from the registry alone.
|
||||||
|
|
||||||
|
Reference implementation: [`./scripts/registry.py`](./scripts/registry.py)
|
||||||
|
(library + CLI) over the primitives in
|
||||||
|
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Directory layout
|
||||||
|
|
||||||
|
```
|
||||||
|
.hermes/jobs/
|
||||||
|
<job_id>.json # job metadata record (schema below)
|
||||||
|
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
|
||||||
|
.lock # shared advisory lock (fcntl) for the whole registry
|
||||||
|
```
|
||||||
|
|
||||||
|
`registry_dir` defaults to `.hermes/jobs` and is overridable everywhere via
|
||||||
|
`--registry-dir`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Job record schema
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"schema_version": 1,
|
||||||
|
"job_id": "abc12345",
|
||||||
|
"status": "pending | running | completed | error | cancelled",
|
||||||
|
"created_at": "2026-06-19T09:30:00Z",
|
||||||
|
"updated_at": "2026-06-19T09:32:00Z",
|
||||||
|
"prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
|
||||||
|
"agent": "claude-code",
|
||||||
|
"agent_session": "tmux:claude",
|
||||||
|
"broker": {
|
||||||
|
"host": "broker.hivemq.com",
|
||||||
|
"port": 1883,
|
||||||
|
"tls": false,
|
||||||
|
"username": null,
|
||||||
|
"password": null
|
||||||
|
},
|
||||||
|
"topic_prefix": "python/mqtt/jobs/abc12345",
|
||||||
|
"timeout_sec": 600,
|
||||||
|
"idle_timeout_sec": 120,
|
||||||
|
"expected_artifacts": ["sort_problems.md"],
|
||||||
|
"last_seq": 0,
|
||||||
|
"auth_token": null
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- `broker` lets `publish_event.py` connect from the record alone (env still
|
||||||
|
overrides toggles like `MQTT_TLS`).
|
||||||
|
- `topic_prefix` → the events topic is `<topic_prefix>/events`.
|
||||||
|
- `last_seq` backs the monotonic `seq` counter so it survives process restarts.
|
||||||
|
- `expected_artifacts` is the hook a user `validate.sh` checks (existence/content).
|
||||||
|
- `auth_token` is `null` in PoC; production sets `secrets.token_urlsafe(32)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Concurrency rules
|
||||||
|
|
||||||
|
### PoC — fcntl advisory lock
|
||||||
|
|
||||||
|
Every read-modify-write (`register_job`, `pick_pending`, `update_status`,
|
||||||
|
`next_seq`) runs inside `registry_lock(registry_dir)`, an exclusive
|
||||||
|
`fcntl.flock` over `.lock`. Single-host, good enough for many tmux sessions on
|
||||||
|
one machine.
|
||||||
|
|
||||||
|
### Production — SQLite WAL
|
||||||
|
|
||||||
|
When delegation spans **multiple hosts**, the file lock no longer serialises
|
||||||
|
across machines. Migrate the same operations to a SQLite database in WAL mode
|
||||||
|
(`PRAGMA journal_mode=WAL`) with a transaction per claim. The function
|
||||||
|
signatures stay identical; only the storage backend changes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. How multiple sessions take only their own work
|
||||||
|
|
||||||
|
Each tmux session carries an `agent_session` label (`tmux:claude`,
|
||||||
|
`tmux:claude-a`, `tmux:claude-b`, …). `pick_pending(agent_session)`:
|
||||||
|
|
||||||
|
1. acquires the registry lock,
|
||||||
|
2. scans for the **oldest** record with `status == "pending"` **and**
|
||||||
|
matching `agent_session`,
|
||||||
|
3. flips it to `running` and writes it back **atomically**,
|
||||||
|
4. releases the lock and returns the `job_id` (or `None`).
|
||||||
|
|
||||||
|
Because the scan + flip happen under one lock, two sessions can never claim the
|
||||||
|
same job. Sessions with distinct labels naturally partition the work; sessions
|
||||||
|
sharing a label compete safely — first to acquire the lock wins, the other sees
|
||||||
|
the job already `running` and moves on.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# session A only ever runs its own pending jobs
|
||||||
|
PY scripts/registry.py pick --agent-session tmux:claude-a # prints id or exits 3
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Atomic status updates
|
||||||
|
|
||||||
|
All writes use a temp-file + `os.replace` rename, which is atomic on POSIX:
|
||||||
|
|
||||||
|
1. take the registry lock,
|
||||||
|
2. load the current record,
|
||||||
|
3. mutate fields + refresh `updated_at` (and `last_seq` for `next_seq`),
|
||||||
|
4. write to `.<job_id>.<rand>.tmp` in the **same directory**, `fsync`,
|
||||||
|
5. `os.replace(tmp, <job_id>.json)`,
|
||||||
|
6. release the lock.
|
||||||
|
|
||||||
|
A reader therefore always sees either the old or the new complete record, never
|
||||||
|
a half-written file. This is the file-based equivalent of the rename trick
|
||||||
|
(`pending.<session>` → `running.<session>`) and maps cleanly onto a single
|
||||||
|
SQLite transaction when you migrate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. CLI quick reference
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PY=.venv/bin/python
|
||||||
|
$PY scripts/registry.py register --prompt "…" --agent claude-code \
|
||||||
|
--agent-session tmux:claude --timeout 600 --idle-timeout 120 # → prints job_id
|
||||||
|
$PY scripts/registry.py list # human table
|
||||||
|
$PY scripts/registry.py list --json # full records
|
||||||
|
$PY scripts/registry.py get --job <id> # one record
|
||||||
|
$PY scripts/registry.py status --job <id> --set completed # set status
|
||||||
|
$PY scripts/registry.py pick --agent-session tmux:claude # claim → running
|
||||||
|
```
|
||||||
|
|
||||||
|
Exit codes: `0` ok, `1` not found / bad status, `3` (`pick`) no pending job for
|
||||||
|
that session.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Persistent audit log
|
||||||
|
|
||||||
|
Separate from the registry, every job is also mirrored to a durable append-only
|
||||||
|
audit log at `.hermes/delegate_job_logs/<job_id>/` (override with
|
||||||
|
`DELEGATE_JOB_LOGS_DIR`, default `<cwd>/.hermes/delegate_job_logs`). The registry
|
||||||
|
is **live state** mutated in place; the audit log is **history** that survives
|
||||||
|
even after the registry dir is cleaned up. It is git-ignored.
|
||||||
|
|
||||||
|
```
|
||||||
|
.hermes/delegate_job_logs/<job_id>/
|
||||||
|
meta.json # registration snapshot (the full job record at register time)
|
||||||
|
events.ndjson # append-only, one JSON event per line, time-ordered
|
||||||
|
status.json # current status only (fast point-query)
|
||||||
|
```
|
||||||
|
|
||||||
|
`events.ndjson` lines are written automatically at four points:
|
||||||
|
|
||||||
|
| Trigger | line `event` | Source |
|
||||||
|
|---------|-------------|--------|
|
||||||
|
| `register_job` | `registered` | `registry.register_job` → `mqtt_common.init_job_log` |
|
||||||
|
| status change (`update_status`, `pick`, publish status sync) | `status_changed` (`from`/`to`) | `mqtt_common.update_job_status` / `pick_pending` |
|
||||||
|
| event published | `published` (embeds the exact payload) | `publish_event.py` |
|
||||||
|
| event received | `received` | `job_subscriber.py` |
|
||||||
|
|
||||||
|
Helpers live in [`./scripts/mqtt_common.py`](./scripts/mqtt_common.py):
|
||||||
|
`LOGS_DIR`, `job_log_path`, `init_job_log`, `append_event` (fcntl-locked,
|
||||||
|
concurrent-append safe), `update_logged_status`, and the readers
|
||||||
|
`read_logged_meta` / `read_logged_status` / `iter_logged_events` /
|
||||||
|
`list_logged_jobs`. Every writer is **best-effort and isolated** — wrapped in
|
||||||
|
`try/except` with a `logger.warning`, so an audit-log failure never breaks the
|
||||||
|
registry write, the publish, or the subscribe it shadows.
|
||||||
|
|
||||||
|
Read them via the CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PY=.venv/bin/python
|
||||||
|
$PY scripts/registry.py logs <job_id> # pretty timeline
|
||||||
|
$PY scripts/registry.py logs <job_id> --tail 20 # last 20 events
|
||||||
|
$PY scripts/registry.py logs <job_id> --json # raw JSON lines
|
||||||
|
$PY scripts/registry.py logs --list # every job, live status
|
||||||
|
```
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
paho-mqtt>=2.0.0
|
||||||
+233
@@ -0,0 +1,233 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""job_subscriber.py — the single entry point for observing Job events.
|
||||||
|
|
||||||
|
Subscribes to one job's ``<topic_prefix>/events`` (or, with ``--wait-any``, the
|
||||||
|
events of every running/pending job in the registry), prints one line to stdout
|
||||||
|
per accepted event, and exits on a terminal event or a timeout.
|
||||||
|
|
||||||
|
Design points (all flagged in the PLAN review):
|
||||||
|
- terminal state machine: ``completed``/``error`` is acted on exactly once per
|
||||||
|
job, so QoS-1 duplicates or an ``error``-after-``completed`` reorder are safe.
|
||||||
|
- dual timeouts: a wall-clock ``--timeout`` (total budget, started at
|
||||||
|
subscribe time so a cold start can't hang forever) AND an idle
|
||||||
|
``--idle-timeout`` (no new event for N seconds).
|
||||||
|
- defensive parsing: undecodable payloads, ``schema_version`` mismatches, and
|
||||||
|
``job_id`` values we did not subscribe for are logged and dropped.
|
||||||
|
|
||||||
|
stdout = event lines only. Diagnostics go to stderr via logging.
|
||||||
|
|
||||||
|
Exit codes:
|
||||||
|
0 all watched jobs reached ``completed``
|
||||||
|
1 any watched job reached ``error``
|
||||||
|
2 timed out (wall-clock or idle) before all jobs finished
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import queue
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from typing import Any, Dict, List, Optional, Set, Tuple
|
||||||
|
|
||||||
|
import mqtt_common
|
||||||
|
import registry
|
||||||
|
from mqtt_common import (
|
||||||
|
DEFAULT_REGISTRY_DIR,
|
||||||
|
SCHEMA_VERSION,
|
||||||
|
broker_config_from_job,
|
||||||
|
load_job,
|
||||||
|
make_client,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger("delegate_job.job_subscriber")
|
||||||
|
|
||||||
|
TERMINAL_EVENTS = ("completed", "error")
|
||||||
|
|
||||||
|
|
||||||
|
def _format_line(topic: str, payload: Dict[str, Any]) -> str:
|
||||||
|
return (
|
||||||
|
f"{payload.get('timestamp','-')} "
|
||||||
|
f"job={payload.get('job_id','?')} "
|
||||||
|
f"seq={payload.get('seq','?')} "
|
||||||
|
f"{payload.get('event','?'):<20} "
|
||||||
|
f"{payload.get('detail','')}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class _Watcher:
|
||||||
|
"""Holds the shared queue + the set of job_ids we accept events for."""
|
||||||
|
|
||||||
|
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
|
||||||
|
self.events: "queue.Queue[Tuple[str, Dict[str, Any]]]" = queue.Queue()
|
||||||
|
self.expected = set(expected_job_ids)
|
||||||
|
self.tokens = expected_tokens # job_id -> expected auth_token (or None)
|
||||||
|
|
||||||
|
def on_message(self, _client, _userdata, msg) -> None:
|
||||||
|
# --- defensive parsing -------------------------------------------
|
||||||
|
try:
|
||||||
|
payload = json.loads(msg.payload.decode("utf-8"))
|
||||||
|
except (UnicodeDecodeError, json.JSONDecodeError) as exc:
|
||||||
|
logger.warning("drop unparseable payload on %s: %s", msg.topic, exc)
|
||||||
|
return
|
||||||
|
if not isinstance(payload, dict):
|
||||||
|
logger.warning("drop non-object payload on %s", msg.topic)
|
||||||
|
return
|
||||||
|
if payload.get("schema_version") != SCHEMA_VERSION:
|
||||||
|
logger.warning("drop event with schema_version=%r (expected %d)",
|
||||||
|
payload.get("schema_version"), SCHEMA_VERSION)
|
||||||
|
return
|
||||||
|
jid = payload.get("job_id")
|
||||||
|
if jid not in self.expected:
|
||||||
|
logger.warning("drop event for unexpected job_id=%r on %s", jid, msg.topic)
|
||||||
|
return
|
||||||
|
# --- production auth check: data.auth_token must match if expected ---
|
||||||
|
expected_token = self.tokens.get(jid)
|
||||||
|
if expected_token is not None:
|
||||||
|
got = (payload.get("data") or {}).get("auth_token")
|
||||||
|
if got != expected_token:
|
||||||
|
logger.warning("drop event for job %s: auth_token mismatch", jid)
|
||||||
|
return
|
||||||
|
# Persistent audit log from the *subscriber's* vantage point: every event
|
||||||
|
# that survives defensive parsing is recorded here, including ones a
|
||||||
|
# different host published. This is the external-observer record that
|
||||||
|
# backstops the publisher's own "published" line if it never wrote one.
|
||||||
|
mqtt_common.append_event(jid, {
|
||||||
|
"event": "received",
|
||||||
|
"source_event": payload.get("event"),
|
||||||
|
"seq": payload.get("seq"),
|
||||||
|
"topic": msg.topic,
|
||||||
|
"timestamp": payload.get("timestamp"),
|
||||||
|
"detail": payload.get("detail", ""),
|
||||||
|
})
|
||||||
|
self.events.put((msg.topic, payload))
|
||||||
|
|
||||||
|
|
||||||
|
def _collect_jobs(args) -> List[Dict[str, Any]]:
|
||||||
|
"""Resolve the list of job records this invocation should watch."""
|
||||||
|
if args.wait_any:
|
||||||
|
jobs = [r for r in registry.list_jobs(args.registry_dir)
|
||||||
|
if r.get("status") in ("pending", "running")]
|
||||||
|
if not jobs:
|
||||||
|
logger.error("no pending/running jobs to wait for")
|
||||||
|
return jobs
|
||||||
|
job = load_job(args.job, args.registry_dir) # raises FileNotFoundError
|
||||||
|
return [job]
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv=None) -> int:
|
||||||
|
parser = argparse.ArgumentParser(description="Subscribe to Job events on MQTT")
|
||||||
|
target = parser.add_mutually_exclusive_group(required=True)
|
||||||
|
target.add_argument("--job", help="job id to watch")
|
||||||
|
target.add_argument("--wait-any", action="store_true",
|
||||||
|
help="watch every pending/running job in the registry")
|
||||||
|
parser.add_argument("--timeout", type=float, default=None,
|
||||||
|
help="wall-clock budget in seconds (default: job.timeout_sec or 600)")
|
||||||
|
parser.add_argument("--idle-timeout", type=float, default=None,
|
||||||
|
help="max seconds with no new event (default: job.idle_timeout_sec or 120)")
|
||||||
|
parser.add_argument("--expect-retention", action="store_true",
|
||||||
|
help="warn if no retained terminal event arrives promptly")
|
||||||
|
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||||
|
parser.add_argument("-v", "--verbose", action="store_true")
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
|
||||||
|
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
|
||||||
|
|
||||||
|
try:
|
||||||
|
jobs = _collect_jobs(args)
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
logger.error("%s", exc)
|
||||||
|
return 2
|
||||||
|
if not jobs:
|
||||||
|
return 2
|
||||||
|
|
||||||
|
expected_ids: Set[str] = {j["job_id"] for j in jobs}
|
||||||
|
tokens = {j["job_id"]: j.get("auth_token") for j in jobs}
|
||||||
|
watcher = _Watcher(expected_ids, tokens)
|
||||||
|
|
||||||
|
# Resolve timeouts from CLI, falling back to the (first) job's settings.
|
||||||
|
base_job = jobs[0]
|
||||||
|
wall_timeout = args.timeout if args.timeout is not None else float(base_job.get("timeout_sec", 600))
|
||||||
|
idle_timeout = args.idle_timeout if args.idle_timeout is not None else float(base_job.get("idle_timeout_sec", 120))
|
||||||
|
|
||||||
|
# All watched jobs share a broker in practice; connect using the first
|
||||||
|
# job's broker and subscribe to each job's events topic.
|
||||||
|
config = broker_config_from_job(base_job)
|
||||||
|
client = make_client("subscriber", config)
|
||||||
|
client.on_message = watcher.on_message
|
||||||
|
|
||||||
|
subscribed_topics = []
|
||||||
|
for job in jobs:
|
||||||
|
prefix = job.get("topic_prefix") or mqtt_common.topic_prefix_for(job["job_id"])
|
||||||
|
subscribed_topics.append(f"{prefix}/events")
|
||||||
|
|
||||||
|
def on_connect(_c, _u, _flags, reason_code, _props):
|
||||||
|
if mqtt_common.reason_code_value(reason_code) != 0:
|
||||||
|
logger.error("broker connection failed: rc=%s", reason_code)
|
||||||
|
return
|
||||||
|
for topic in subscribed_topics:
|
||||||
|
_c.subscribe(topic, qos=1)
|
||||||
|
logger.info("subscribed to %s", topic)
|
||||||
|
|
||||||
|
client.on_connect = on_connect
|
||||||
|
client.connect(config.host, config.port, config.keepalive)
|
||||||
|
client.loop_start()
|
||||||
|
|
||||||
|
terminal: Dict[str, str] = {} # job_id -> "completed"/"error"
|
||||||
|
pending: Set[str] = set(expected_ids)
|
||||||
|
start = time.monotonic()
|
||||||
|
wall_deadline = start + wall_timeout
|
||||||
|
last_event = start
|
||||||
|
retention_checked = not args.expect_retention
|
||||||
|
|
||||||
|
try:
|
||||||
|
while pending:
|
||||||
|
now = time.monotonic()
|
||||||
|
if now >= wall_deadline:
|
||||||
|
logger.error("wall-clock timeout (%.0fs); still pending: %s",
|
||||||
|
wall_timeout, ", ".join(sorted(pending)))
|
||||||
|
return 2
|
||||||
|
idle_left = idle_timeout - (now - last_event)
|
||||||
|
if idle_left <= 0:
|
||||||
|
logger.error("idle timeout (%.0fs, no events); still pending: %s",
|
||||||
|
idle_timeout, ", ".join(sorted(pending)))
|
||||||
|
return 2
|
||||||
|
wait = min(wall_deadline - now, idle_left, 1.0)
|
||||||
|
try:
|
||||||
|
topic, payload = watcher.events.get(timeout=wait)
|
||||||
|
except queue.Empty:
|
||||||
|
if not retention_checked and (now - start) > 3.0:
|
||||||
|
logger.warning("--expect-retention set but no retained "
|
||||||
|
"terminal event observed yet")
|
||||||
|
retention_checked = True
|
||||||
|
continue
|
||||||
|
|
||||||
|
last_event = time.monotonic()
|
||||||
|
retention_checked = True
|
||||||
|
print(_format_line(topic, payload), flush=True)
|
||||||
|
|
||||||
|
jid = payload["job_id"]
|
||||||
|
event = payload.get("event")
|
||||||
|
if event in TERMINAL_EVENTS:
|
||||||
|
if jid in terminal:
|
||||||
|
# Already finalised: ignore duplicates / late reorders.
|
||||||
|
logger.info("ignoring duplicate terminal %s for %s", event, jid)
|
||||||
|
continue
|
||||||
|
terminal[jid] = event
|
||||||
|
pending.discard(jid)
|
||||||
|
finally:
|
||||||
|
client.loop_stop()
|
||||||
|
try:
|
||||||
|
client.disconnect()
|
||||||
|
except Exception: # pragma: no cover
|
||||||
|
pass
|
||||||
|
|
||||||
|
# All jobs reached a terminal state. error wins over completed.
|
||||||
|
if any(state == "error" for state in terminal.values()):
|
||||||
|
return 1
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,546 @@
|
|||||||
|
"""Shared MQTT + registry helpers for the delegate-job skill.
|
||||||
|
|
||||||
|
Single entry point for:
|
||||||
|
- broker configuration (env -> dataclass),
|
||||||
|
- paho client construction (auth + TLS + unique client id),
|
||||||
|
- monotonic per-job sequence counters,
|
||||||
|
- retry-with-exponential-backoff,
|
||||||
|
- atomic registry record load/update under an fcntl lock.
|
||||||
|
|
||||||
|
Requires paho-mqtt >= 2.0 (uses CallbackAPIVersion.VERSION2).
|
||||||
|
|
||||||
|
This module is the *only* place that talks to the broker config and to the
|
||||||
|
raw job record file, so PoC -> production migration touches just env/registry
|
||||||
|
values, never code (see references/mqtt-broker-setup.md).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import functools
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
import time
|
||||||
|
import uuid
|
||||||
|
from contextlib import contextmanager
|
||||||
|
from dataclasses import asdict, dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Callable, Dict, Iterable, List, Optional
|
||||||
|
|
||||||
|
import paho.mqtt.client as mqtt
|
||||||
|
|
||||||
|
logger = logging.getLogger("delegate_job.mqtt_common")
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Constants
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
SCHEMA_VERSION = 1
|
||||||
|
DEFAULT_REGISTRY_DIR = ".hermes/jobs"
|
||||||
|
DEFAULT_TOPIC_ROOT = "python/mqtt/jobs"
|
||||||
|
LOCK_FILENAME = ".lock"
|
||||||
|
|
||||||
|
# Persistent audit-log layout: .hermes/delegate_job_logs/<job_id>/{meta,events,status}.
|
||||||
|
# This is a *separate* artifact from the registry: the registry is the live job
|
||||||
|
# record (mutated in place), the audit log is an append-only history that
|
||||||
|
# survives even if the registry dir is cleaned up.
|
||||||
|
META_FILENAME = "meta.json"
|
||||||
|
EVENTS_FILENAME = "events.ndjson"
|
||||||
|
STATUS_FILENAME = "status.json"
|
||||||
|
|
||||||
|
|
||||||
|
def _default_logs_dir() -> str:
|
||||||
|
"""Audit-log root. Overridable with ``DELEGATE_JOB_LOGS_DIR``; otherwise
|
||||||
|
``<cwd>/.hermes/delegate_job_logs`` — we keep audit logs next to the
|
||||||
|
live registry (``.hermes/jobs/``) so the two runtime artifacts sit
|
||||||
|
under the same parent dir and follow the same ``.gitignore`` rule.
|
||||||
|
The cwd of whichever process emits events (the bash wrapper and
|
||||||
|
scripts) is used as the anchor."""
|
||||||
|
env = os.environ.get("DELEGATE_JOB_LOGS_DIR")
|
||||||
|
if env and env.strip():
|
||||||
|
return env
|
||||||
|
return os.path.join(os.getcwd(), ".hermes", "delegate_job_logs")
|
||||||
|
|
||||||
|
|
||||||
|
LOGS_DIR = _default_logs_dir()
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Broker configuration
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
@dataclass
|
||||||
|
class BrokerConfig:
|
||||||
|
"""Resolved broker connection settings.
|
||||||
|
|
||||||
|
PoC defaults target the public HiveMQ broker. Production overrides arrive
|
||||||
|
either from environment variables or from a job record's ``broker.*`` block
|
||||||
|
(see ``broker_config_from_job``).
|
||||||
|
"""
|
||||||
|
|
||||||
|
host: str = "broker.hivemq.com"
|
||||||
|
port: int = 1883
|
||||||
|
tls: bool = False
|
||||||
|
username: Optional[str] = None
|
||||||
|
password: Optional[str] = None
|
||||||
|
client_id_prefix: str = "hermes"
|
||||||
|
# TLS material (only consulted when tls is True).
|
||||||
|
ca_certs: Optional[str] = None
|
||||||
|
certfile: Optional[str] = None
|
||||||
|
keyfile: Optional[str] = None
|
||||||
|
keepalive: int = 60
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
return asdict(self)
|
||||||
|
|
||||||
|
def to_registry_block(self) -> Dict[str, Any]:
|
||||||
|
"""The subset that gets persisted into a job record's broker block."""
|
||||||
|
return {
|
||||||
|
"host": self.host,
|
||||||
|
"port": self.port,
|
||||||
|
"tls": self.tls,
|
||||||
|
"username": self.username,
|
||||||
|
"password": self.password,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _env_bool(name: str, default: bool = False) -> bool:
|
||||||
|
raw = os.environ.get(name)
|
||||||
|
if raw is None:
|
||||||
|
return default
|
||||||
|
return raw.strip().lower() in ("1", "true", "yes", "on")
|
||||||
|
|
||||||
|
|
||||||
|
def _env_int(name: str, default: int) -> int:
|
||||||
|
raw = os.environ.get(name)
|
||||||
|
if raw is None or raw.strip() == "":
|
||||||
|
return default
|
||||||
|
try:
|
||||||
|
return int(raw)
|
||||||
|
except ValueError:
|
||||||
|
logger.warning("invalid int for %s=%r; using default %d", name, raw, default)
|
||||||
|
return default
|
||||||
|
|
||||||
|
|
||||||
|
def broker_config_from_env(overrides: Optional[Dict[str, Any]] = None) -> BrokerConfig:
|
||||||
|
"""Build a :class:`BrokerConfig` from environment variables.
|
||||||
|
|
||||||
|
Recognised vars (all optional, PoC defaults shown):
|
||||||
|
MQTT_BROKER (broker.hivemq.com), MQTT_PORT (1883), MQTT_TLS (0),
|
||||||
|
MQTT_USERNAME, MQTT_PASSWORD, MQTT_CLIENT_ID_PREFIX (hermes),
|
||||||
|
MQTT_CA_CERTS, MQTT_CERTFILE, MQTT_KEYFILE, MQTT_KEEPALIVE (60).
|
||||||
|
|
||||||
|
``overrides`` (e.g. a job record's broker block) wins over the env values
|
||||||
|
for any key it specifies with a non-None value.
|
||||||
|
"""
|
||||||
|
cfg = BrokerConfig(
|
||||||
|
host=os.environ.get("MQTT_BROKER", "broker.hivemq.com"),
|
||||||
|
port=_env_int("MQTT_PORT", 1883),
|
||||||
|
tls=_env_bool("MQTT_TLS", False),
|
||||||
|
username=os.environ.get("MQTT_USERNAME") or None,
|
||||||
|
password=os.environ.get("MQTT_PASSWORD") or None,
|
||||||
|
client_id_prefix=os.environ.get("MQTT_CLIENT_ID_PREFIX", "hermes"),
|
||||||
|
ca_certs=os.environ.get("MQTT_CA_CERTS") or None,
|
||||||
|
certfile=os.environ.get("MQTT_CERTFILE") or None,
|
||||||
|
keyfile=os.environ.get("MQTT_KEYFILE") or None,
|
||||||
|
keepalive=_env_int("MQTT_KEEPALIVE", 60),
|
||||||
|
)
|
||||||
|
if overrides:
|
||||||
|
for key, value in overrides.items():
|
||||||
|
if value is not None and hasattr(cfg, key):
|
||||||
|
setattr(cfg, key, value)
|
||||||
|
return cfg
|
||||||
|
|
||||||
|
|
||||||
|
def broker_config_from_job(job: Dict[str, Any]) -> BrokerConfig:
|
||||||
|
"""Resolve broker config for a job: env defaults, then the job's broker.*
|
||||||
|
block overrides. This lets ``publish_event.py`` connect from the registry
|
||||||
|
alone, while still honouring environment toggles (e.g. MQTT_TLS=1)."""
|
||||||
|
return broker_config_from_env(overrides=job.get("broker") or {})
|
||||||
|
|
||||||
|
|
||||||
|
def make_client(role: str, config: Optional[BrokerConfig] = None) -> mqtt.Client:
|
||||||
|
"""Return a configured paho ``Client`` (not yet connected).
|
||||||
|
|
||||||
|
The client id is ``f"{prefix}-{role}-{uuid8}"`` so concurrent publishers /
|
||||||
|
subscribers never collide on the broker. Auth and TLS are applied when the
|
||||||
|
config supplies them.
|
||||||
|
"""
|
||||||
|
config = config or broker_config_from_env()
|
||||||
|
client_id = f"{config.client_id_prefix}-{role}-{uuid.uuid4().hex[:8]}"
|
||||||
|
client = mqtt.Client(
|
||||||
|
callback_api_version=mqtt.CallbackAPIVersion.VERSION2,
|
||||||
|
client_id=client_id,
|
||||||
|
)
|
||||||
|
if config.username:
|
||||||
|
client.username_pw_set(config.username, config.password)
|
||||||
|
if config.tls:
|
||||||
|
# If ca_certs is None paho uses the system trust store (good enough for
|
||||||
|
# public CAs); a private CA bundle path is passed through unchanged.
|
||||||
|
client.tls_set(
|
||||||
|
ca_certs=config.ca_certs,
|
||||||
|
certfile=config.certfile,
|
||||||
|
keyfile=config.keyfile,
|
||||||
|
)
|
||||||
|
logger.debug("built client id=%s tls=%s host=%s", client_id, config.tls, config.host)
|
||||||
|
return client
|
||||||
|
|
||||||
|
|
||||||
|
def reason_code_value(rc: Any) -> int:
|
||||||
|
"""Normalise a paho v2 connect reason code to an int.
|
||||||
|
|
||||||
|
paho-mqtt 2.x hands callbacks a ``ReasonCode`` object (not an int); older
|
||||||
|
paths may pass a plain int. ``ReasonCode`` exposes ``.value``; 0 == success.
|
||||||
|
"""
|
||||||
|
return int(getattr(rc, "value", rc))
|
||||||
|
|
||||||
|
|
||||||
|
def topic_prefix_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
|
||||||
|
return f"{root}/{job_id}"
|
||||||
|
|
||||||
|
|
||||||
|
def events_topic_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
|
||||||
|
return f"{topic_prefix_for(job_id, root)}/events"
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Registry primitives (single source of truth for raw record I/O)
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
def _job_path(job_id: str, registry_dir: str) -> Path:
|
||||||
|
return Path(registry_dir) / f"{job_id}.json"
|
||||||
|
|
||||||
|
|
||||||
|
def _lock_path(registry_dir: str) -> Path:
|
||||||
|
return Path(registry_dir) / LOCK_FILENAME
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def registry_lock(registry_dir: str):
|
||||||
|
"""Advisory exclusive lock over the whole registry dir via fcntl.
|
||||||
|
|
||||||
|
PoC-grade single-host concurrency control. Multiple tmux sessions / scripts
|
||||||
|
serialise their read-modify-write of job records through this lock so two
|
||||||
|
sessions never claim the same pending job. For multi-host delegation move
|
||||||
|
to SQLite WAL (see references/registry.md)."""
|
||||||
|
import fcntl # POSIX only; imported lazily so import works on Windows.
|
||||||
|
|
||||||
|
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||||
|
lock_file = _lock_path(registry_dir)
|
||||||
|
fh = open(lock_file, "a+")
|
||||||
|
try:
|
||||||
|
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
|
||||||
|
yield
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
|
||||||
|
finally:
|
||||||
|
fh.close()
|
||||||
|
|
||||||
|
|
||||||
|
def load_job(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Dict[str, Any]:
|
||||||
|
"""Load and parse a job record. Raises FileNotFoundError if absent."""
|
||||||
|
path = _job_path(job_id, registry_dir)
|
||||||
|
if not path.exists():
|
||||||
|
raise FileNotFoundError(f"job record not found: {path}")
|
||||||
|
with open(path, "r", encoding="utf-8") as fh:
|
||||||
|
return json.load(fh)
|
||||||
|
|
||||||
|
|
||||||
|
def _atomic_write_record(job_id: str, registry_dir: str, record: Dict[str, Any]) -> None:
|
||||||
|
"""Write a record atomically: temp file in the same dir + os.replace.
|
||||||
|
|
||||||
|
The rename is atomic on POSIX, so readers never observe a half-written
|
||||||
|
file. Callers MUST already hold ``registry_lock`` for read-modify-write
|
||||||
|
correctness."""
|
||||||
|
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||||
|
path = _job_path(job_id, registry_dir)
|
||||||
|
fd, tmp = tempfile.mkstemp(dir=str(path.parent), prefix=f".{job_id}.", suffix=".tmp")
|
||||||
|
try:
|
||||||
|
with os.fdopen(fd, "w", encoding="utf-8") as fh:
|
||||||
|
json.dump(record, fh, ensure_ascii=False, indent=2)
|
||||||
|
fh.write("\n")
|
||||||
|
fh.flush()
|
||||||
|
os.fsync(fh.fileno())
|
||||||
|
os.replace(tmp, path)
|
||||||
|
except BaseException:
|
||||||
|
if os.path.exists(tmp):
|
||||||
|
os.unlink(tmp)
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def update_job_status(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR, **fields: Any) -> Dict[str, Any]:
|
||||||
|
"""Atomically merge ``fields`` into a job record under the registry lock.
|
||||||
|
|
||||||
|
Always refreshes ``updated_at``. Returns the new record. Raises
|
||||||
|
FileNotFoundError if the job does not exist.
|
||||||
|
|
||||||
|
This is the single chokepoint for status writes (both ``registry.update_status``
|
||||||
|
and ``publish_event.py``'s status sync route through here), so it also mirrors
|
||||||
|
any ``status`` change into the persistent audit log — best-effort, after the
|
||||||
|
registry lock is released so a slow/failed log write never blocks the record."""
|
||||||
|
with registry_lock(registry_dir):
|
||||||
|
record = load_job(job_id, registry_dir)
|
||||||
|
old_status = record.get("status")
|
||||||
|
record.update(fields)
|
||||||
|
record["updated_at"] = _utcnow()
|
||||||
|
_atomic_write_record(job_id, registry_dir, record)
|
||||||
|
if "status" in fields:
|
||||||
|
new_status = record.get("status")
|
||||||
|
update_logged_status(job_id, new_status, updated_at=record["updated_at"])
|
||||||
|
if old_status != new_status:
|
||||||
|
append_event(job_id, {
|
||||||
|
"event": "status_changed",
|
||||||
|
"from": old_status,
|
||||||
|
"to": new_status,
|
||||||
|
"timestamp": record["updated_at"],
|
||||||
|
})
|
||||||
|
return record
|
||||||
|
|
||||||
|
|
||||||
|
def next_seq(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> int:
|
||||||
|
"""Return the next monotonic sequence number for a job, persisted in the
|
||||||
|
record's ``last_seq`` field so it stays consistent across process restarts.
|
||||||
|
First call returns 1."""
|
||||||
|
with registry_lock(registry_dir):
|
||||||
|
record = load_job(job_id, registry_dir)
|
||||||
|
seq = int(record.get("last_seq", 0)) + 1
|
||||||
|
record["last_seq"] = seq
|
||||||
|
record["updated_at"] = _utcnow()
|
||||||
|
_atomic_write_record(job_id, registry_dir, record)
|
||||||
|
return seq
|
||||||
|
|
||||||
|
|
||||||
|
def _utcnow() -> str:
|
||||||
|
"""ISO-8601 UTC timestamp with trailing Z (payload `timestamp` field)."""
|
||||||
|
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||||
|
|
||||||
|
|
||||||
|
def _utcnow_precise() -> str:
|
||||||
|
"""ISO-8601 UTC timestamp with millisecond resolution. Used for the audit
|
||||||
|
log's ``logged_at`` so events sort cleanly even within the same second."""
|
||||||
|
now = time.time()
|
||||||
|
base = time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(now))
|
||||||
|
return f"{base}.{int((now % 1) * 1000):03d}Z"
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Persistent audit log (.hermes/delegate_job_logs/<job_id>/...)
|
||||||
|
#
|
||||||
|
# Every function here is idempotent, concurrency-safe, and *best-effort*: a
|
||||||
|
# logging failure is swallowed with a logger.warning and never propagated, so it
|
||||||
|
# can never break a publish, a subscribe, or a registry write. stdout is never
|
||||||
|
# touched (it is reserved for data output).
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
def job_log_dir(job_id: str, logs_dir: Optional[str] = None) -> Path:
|
||||||
|
return Path(logs_dir or LOGS_DIR) / job_id
|
||||||
|
|
||||||
|
|
||||||
|
def job_log_path(job_id: str, kind: str, logs_dir: Optional[str] = None) -> Path:
|
||||||
|
"""Path to one audit-log file for a job. ``kind`` is a filename, e.g. the
|
||||||
|
module constants META_FILENAME / EVENTS_FILENAME / STATUS_FILENAME."""
|
||||||
|
return job_log_dir(job_id, logs_dir) / kind
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def _file_lock(fh):
|
||||||
|
"""Best-effort exclusive lock over a single open file via fcntl, so two
|
||||||
|
processes appending to events.ndjson never interleave a line. A no-op where
|
||||||
|
fcntl is unavailable (Windows); a short append is atomic enough there."""
|
||||||
|
try:
|
||||||
|
import fcntl
|
||||||
|
except ImportError: # pragma: no cover - non-POSIX
|
||||||
|
yield
|
||||||
|
return
|
||||||
|
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
|
||||||
|
try:
|
||||||
|
yield
|
||||||
|
finally:
|
||||||
|
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
|
||||||
|
|
||||||
|
|
||||||
|
def append_event(job_id: str, event_dict: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
|
||||||
|
"""Append one event as a JSON line to ``<logs>/<job_id>/events.ndjson``.
|
||||||
|
|
||||||
|
Concurrency-safe (fcntl lock over the file) and best-effort. A millisecond
|
||||||
|
``logged_at`` is stamped when the caller did not supply one."""
|
||||||
|
try:
|
||||||
|
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
record = dict(event_dict)
|
||||||
|
record.setdefault("logged_at", _utcnow_precise())
|
||||||
|
line = json.dumps(record, ensure_ascii=False) + "\n"
|
||||||
|
with open(path, "a", encoding="utf-8") as fh:
|
||||||
|
with _file_lock(fh):
|
||||||
|
fh.write(line)
|
||||||
|
fh.flush()
|
||||||
|
except Exception as exc: # pragma: no cover - best effort
|
||||||
|
logger.warning("append_event failed for job %s: %s", job_id, exc)
|
||||||
|
|
||||||
|
|
||||||
|
def update_logged_status(job_id: str, status: str, logs_dir: Optional[str] = None, **extras: Any) -> None:
|
||||||
|
"""Rewrite ``<logs>/<job_id>/status.json`` (current status for fast point
|
||||||
|
queries) atomically. Best-effort; merges any ``extras``."""
|
||||||
|
try:
|
||||||
|
path = job_log_path(job_id, STATUS_FILENAME, logs_dir)
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
record: Dict[str, Any] = {"job_id": job_id, "status": status, "updated_at": _utcnow()}
|
||||||
|
record.update(extras)
|
||||||
|
tmp = path.with_name(path.name + ".tmp")
|
||||||
|
with open(tmp, "w", encoding="utf-8") as fh:
|
||||||
|
json.dump(record, fh, ensure_ascii=False, indent=2)
|
||||||
|
fh.write("\n")
|
||||||
|
os.replace(tmp, path)
|
||||||
|
except Exception as exc: # pragma: no cover - best effort
|
||||||
|
logger.warning("update_logged_status failed for job %s: %s", job_id, exc)
|
||||||
|
|
||||||
|
|
||||||
|
def init_job_log(job_id: str, meta: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
|
||||||
|
"""Seed the per-job audit-log dir: write meta.json, status.json, and a first
|
||||||
|
``registered`` line in events.ndjson. Idempotent (the ``registered`` line is
|
||||||
|
written only when events.ndjson does not yet exist) and best-effort."""
|
||||||
|
try:
|
||||||
|
d = job_log_dir(job_id, logs_dir)
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(d / META_FILENAME, "w", encoding="utf-8") as fh:
|
||||||
|
json.dump(meta, fh, ensure_ascii=False, indent=2)
|
||||||
|
fh.write("\n")
|
||||||
|
status = meta.get("status", "pending")
|
||||||
|
update_logged_status(
|
||||||
|
job_id, status, logs_dir=logs_dir,
|
||||||
|
created_at=meta.get("created_at"), prompt=meta.get("prompt"),
|
||||||
|
)
|
||||||
|
events_path = d / EVENTS_FILENAME
|
||||||
|
first_time = not events_path.exists()
|
||||||
|
events_path.touch(exist_ok=True)
|
||||||
|
if first_time:
|
||||||
|
append_event(job_id, {
|
||||||
|
"event": "registered",
|
||||||
|
"status": status,
|
||||||
|
"agent": meta.get("agent"),
|
||||||
|
"agent_session": meta.get("agent_session"),
|
||||||
|
"topic_prefix": meta.get("topic_prefix"),
|
||||||
|
"timestamp": meta.get("created_at"),
|
||||||
|
}, logs_dir=logs_dir)
|
||||||
|
except Exception as exc: # pragma: no cover - best effort
|
||||||
|
logger.warning("init_job_log failed for job %s: %s", job_id, exc)
|
||||||
|
|
||||||
|
|
||||||
|
def read_logged_meta(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Return a job's audit meta.json (registration snapshot), or None."""
|
||||||
|
try:
|
||||||
|
with open(job_log_path(job_id, META_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
|
||||||
|
return json.load(fh)
|
||||||
|
except (OSError, json.JSONDecodeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def read_logged_status(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Return a job's current status.json, or None. This is the fast point-query
|
||||||
|
file (current status only), separate from the registration-time meta.json."""
|
||||||
|
try:
|
||||||
|
with open(job_log_path(job_id, STATUS_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
|
||||||
|
return json.load(fh)
|
||||||
|
except (OSError, json.JSONDecodeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def iter_logged_events(job_id: str, logs_dir: Optional[str] = None):
|
||||||
|
"""Yield each parsed event from a job's events.ndjson in file (time) order.
|
||||||
|
Malformed lines are skipped with a warning."""
|
||||||
|
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
|
||||||
|
if not path.exists():
|
||||||
|
return
|
||||||
|
with open(path, "r", encoding="utf-8") as fh:
|
||||||
|
for line in fh:
|
||||||
|
line = line.strip()
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
yield json.loads(line)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
logger.warning("skipping malformed audit line in %s", path)
|
||||||
|
|
||||||
|
|
||||||
|
def list_logged_jobs(logs_dir: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||||
|
"""Return one meta record per job directory under the logs root, oldest
|
||||||
|
first. Falls back to ``{"job_id": <dir>}`` when meta.json is missing."""
|
||||||
|
base = Path(logs_dir or LOGS_DIR)
|
||||||
|
out: List[Dict[str, Any]] = []
|
||||||
|
if not base.exists():
|
||||||
|
return out
|
||||||
|
for d in sorted(base.iterdir()):
|
||||||
|
if not d.is_dir():
|
||||||
|
continue
|
||||||
|
meta = read_logged_meta(d.name, logs_dir) or {"job_id": d.name}
|
||||||
|
# Overlay the live status.json so the summary reflects current state, not
|
||||||
|
# the registration-time snapshot frozen in meta.json.
|
||||||
|
status = read_logged_status(d.name, logs_dir)
|
||||||
|
if status:
|
||||||
|
meta = {**meta,
|
||||||
|
"status": status.get("status", meta.get("status")),
|
||||||
|
"updated_at": status.get("updated_at", meta.get("updated_at"))}
|
||||||
|
out.append(meta)
|
||||||
|
out.sort(key=lambda m: m.get("created_at") or "")
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# Retry helper
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
def with_retry(
|
||||||
|
fn: Optional[Callable] = None,
|
||||||
|
*,
|
||||||
|
attempts: int = 3,
|
||||||
|
base_delay: float = 0.5,
|
||||||
|
factor: float = 2.0,
|
||||||
|
max_delay: float = 8.0,
|
||||||
|
exceptions: Iterable[type] = (Exception,),
|
||||||
|
) -> Callable:
|
||||||
|
"""Retry ``fn`` with exponential backoff.
|
||||||
|
|
||||||
|
Usable two ways::
|
||||||
|
|
||||||
|
result = with_retry(do_publish, attempts=3)() # wrap-and-call
|
||||||
|
@with_retry(attempts=5, base_delay=1.0) # decorator
|
||||||
|
def do_publish(): ...
|
||||||
|
|
||||||
|
Re-raises the last exception once ``attempts`` is exhausted.
|
||||||
|
"""
|
||||||
|
exc_tuple = tuple(exceptions)
|
||||||
|
|
||||||
|
def decorate(func: Callable) -> Callable:
|
||||||
|
@functools.wraps(func)
|
||||||
|
def wrapper(*args: Any, **kwargs: Any) -> Any:
|
||||||
|
delay = base_delay
|
||||||
|
last_exc: Optional[BaseException] = None
|
||||||
|
for attempt in range(1, attempts + 1):
|
||||||
|
try:
|
||||||
|
return func(*args, **kwargs)
|
||||||
|
except exc_tuple as exc:
|
||||||
|
last_exc = exc
|
||||||
|
if attempt >= attempts:
|
||||||
|
break
|
||||||
|
logger.warning(
|
||||||
|
"attempt %d/%d failed: %s; retrying in %.1fs",
|
||||||
|
attempt, attempts, exc, delay,
|
||||||
|
)
|
||||||
|
time.sleep(delay)
|
||||||
|
delay = min(delay * factor, max_delay)
|
||||||
|
assert last_exc is not None
|
||||||
|
raise last_exc
|
||||||
|
|
||||||
|
return wrapper
|
||||||
|
|
||||||
|
if fn is not None:
|
||||||
|
return decorate(fn)
|
||||||
|
return decorate
|
||||||
|
|
||||||
|
|
||||||
|
def setup_logging(level: int = logging.WARNING) -> None:
|
||||||
|
"""Configure root logging to stderr. stdout is reserved for data output
|
||||||
|
(subscriber event lines, registry ids)."""
|
||||||
|
import sys
|
||||||
|
|
||||||
|
logging.basicConfig(
|
||||||
|
level=level,
|
||||||
|
stream=sys.stderr,
|
||||||
|
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
||||||
|
)
|
||||||
Executable
+225
@@ -0,0 +1,225 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""publish_event.py — the single entry point for emitting a Job event.
|
||||||
|
|
||||||
|
Loads the job record from the registry, resolves its broker, assigns the next
|
||||||
|
monotonic ``seq``, builds the schema-v1 JSON payload, and publishes it to
|
||||||
|
``<topic_prefix>/events`` over QoS 1 with exponential-backoff retry.
|
||||||
|
|
||||||
|
Silent by design: nothing is printed to stdout. Diagnostics go to stderr via
|
||||||
|
logging. Terminal events (``completed``/``error``) publish with retain=True so
|
||||||
|
a late subscriber still observes the final state (production hardening).
|
||||||
|
|
||||||
|
Exit codes:
|
||||||
|
0 published successfully
|
||||||
|
1 parameter / registry error (bad args, unknown job, no pending job)
|
||||||
|
2 publish failed after retries (network / broker / ACK timeout)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
publish_event.py --job <id> --event started [--detail "..."] [--data '{...}']
|
||||||
|
publish_event.py --pick-pending --agent-session tmux:claude --event completed
|
||||||
|
publish_event.py --job <id> --event completed --retained
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
|
||||||
|
import mqtt_common
|
||||||
|
import registry
|
||||||
|
from mqtt_common import (
|
||||||
|
DEFAULT_REGISTRY_DIR,
|
||||||
|
SCHEMA_VERSION,
|
||||||
|
broker_config_from_job,
|
||||||
|
events_topic_for,
|
||||||
|
load_job,
|
||||||
|
make_client,
|
||||||
|
next_seq,
|
||||||
|
with_retry,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger("delegate_job.publish_event")
|
||||||
|
|
||||||
|
VALID_EVENTS = ("started", "permission_required", "progress", "completed", "error")
|
||||||
|
TERMINAL_EVENTS = ("completed", "error")
|
||||||
|
# event -> registry status to sync as a best-effort side effect
|
||||||
|
EVENT_TO_STATUS = {
|
||||||
|
"started": "running",
|
||||||
|
"completed": "completed",
|
||||||
|
"error": "error",
|
||||||
|
}
|
||||||
|
|
||||||
|
CONNECT_ACK_TIMEOUT = 10 # seconds to wait for CONNACK
|
||||||
|
PUBLISH_ACK_TIMEOUT = 5 # seconds to wait for QoS-1 PUBACK
|
||||||
|
|
||||||
|
|
||||||
|
def build_payload(
|
||||||
|
job_id: str,
|
||||||
|
seq: int,
|
||||||
|
event: str,
|
||||||
|
detail: str,
|
||||||
|
data: Optional[Dict[str, Any]],
|
||||||
|
auth_token: Optional[str],
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
payload: Dict[str, Any] = {
|
||||||
|
"schema_version": SCHEMA_VERSION,
|
||||||
|
"seq": seq,
|
||||||
|
"job_id": job_id,
|
||||||
|
"event": event,
|
||||||
|
"timestamp": mqtt_common._utcnow(),
|
||||||
|
"detail": detail,
|
||||||
|
"data": dict(data) if data else {},
|
||||||
|
}
|
||||||
|
# Production: carry the per-job auth token so the subscriber can verify the
|
||||||
|
# publisher. The token is compared in plain text (bearer-token style) by the
|
||||||
|
# subscriber — NOT an HMAC. See SKILL.md "Auth token" and PLAN 8.2. The
|
||||||
|
# registry stores the per-job token in `auth_token`; only include it on the
|
||||||
|
# wire when set so the public broker (no auth) doesn't leak anything.
|
||||||
|
if auth_token:
|
||||||
|
payload["data"]["auth_token"] = auth_token
|
||||||
|
return payload
|
||||||
|
|
||||||
|
|
||||||
|
def _publish_once(config, topic: str, body: bytes, retain: bool) -> None:
|
||||||
|
"""Connect, publish one QoS-1 message, wait for the broker ACK, disconnect.
|
||||||
|
|
||||||
|
Raises on any failure so ``with_retry`` can re-run the whole sequence (a
|
||||||
|
fresh connection per attempt is the robust choice for a PoC)."""
|
||||||
|
client = make_client("publisher", config)
|
||||||
|
connected = {"rc": None}
|
||||||
|
|
||||||
|
def on_connect(_c, _u, _flags, reason_code, _props):
|
||||||
|
connected["rc"] = reason_code
|
||||||
|
|
||||||
|
client.on_connect = on_connect
|
||||||
|
client.connect(config.host, config.port, config.keepalive)
|
||||||
|
client.loop_start()
|
||||||
|
try:
|
||||||
|
# Wait for CONNACK so we fail fast on auth/TLS errors.
|
||||||
|
deadline = time.monotonic() + CONNECT_ACK_TIMEOUT
|
||||||
|
while connected["rc"] is None and time.monotonic() < deadline:
|
||||||
|
time.sleep(0.05)
|
||||||
|
if connected["rc"] is None:
|
||||||
|
raise TimeoutError("no CONNACK from broker")
|
||||||
|
if mqtt_common.reason_code_value(connected["rc"]) != 0:
|
||||||
|
raise ConnectionError(f"broker refused connection: rc={connected['rc']}")
|
||||||
|
|
||||||
|
info = client.publish(topic, payload=body, qos=1, retain=retain)
|
||||||
|
info.wait_for_publish(timeout=PUBLISH_ACK_TIMEOUT)
|
||||||
|
if not info.is_published():
|
||||||
|
raise TimeoutError("publish not acknowledged within timeout")
|
||||||
|
finally:
|
||||||
|
client.loop_stop()
|
||||||
|
try:
|
||||||
|
client.disconnect()
|
||||||
|
except Exception: # pragma: no cover - disconnect best effort
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def _resolve_job_id(args) -> Optional[str]:
|
||||||
|
if args.pick_pending:
|
||||||
|
return registry.pick_pending(args.agent_session, args.registry_dir)
|
||||||
|
return args.job
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv=None) -> int:
|
||||||
|
parser = argparse.ArgumentParser(description="Publish a Job event to MQTT")
|
||||||
|
target = parser.add_mutually_exclusive_group(required=True)
|
||||||
|
target.add_argument("--job", help="job id to publish for")
|
||||||
|
target.add_argument("--pick-pending", action="store_true",
|
||||||
|
help="auto-select a pending job for --agent-session")
|
||||||
|
parser.add_argument("--agent-session", default="tmux:claude",
|
||||||
|
help="session label used with --pick-pending")
|
||||||
|
parser.add_argument("--event", default="progress", choices=VALID_EVENTS)
|
||||||
|
parser.add_argument("--detail", default="")
|
||||||
|
parser.add_argument("--data", default=None, help="optional JSON object string")
|
||||||
|
parser.add_argument("--retained", action="store_true",
|
||||||
|
help="force retain=True (auto for completed/error)")
|
||||||
|
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||||
|
parser.add_argument("--attempts", type=int, default=3)
|
||||||
|
parser.add_argument("-v", "--verbose", action="store_true")
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
|
||||||
|
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
|
||||||
|
|
||||||
|
# --- parse optional data JSON (parameter error -> exit 1) ---
|
||||||
|
data: Optional[Dict[str, Any]] = None
|
||||||
|
if args.data:
|
||||||
|
try:
|
||||||
|
data = json.loads(args.data)
|
||||||
|
if not isinstance(data, dict):
|
||||||
|
raise ValueError("--data must be a JSON object")
|
||||||
|
except (ValueError, json.JSONDecodeError) as exc:
|
||||||
|
logger.error("invalid --data: %s", exc)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
job_id = _resolve_job_id(args)
|
||||||
|
if not job_id:
|
||||||
|
logger.error("no job to publish for (unknown --job or no pending job)")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
try:
|
||||||
|
job = load_job(job_id, args.registry_dir)
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
logger.error("%s", exc)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
config = broker_config_from_job(job)
|
||||||
|
topic = job.get("topic_prefix")
|
||||||
|
topic = f"{topic}/events" if topic else events_topic_for(job_id)
|
||||||
|
seq = next_seq(job_id, args.registry_dir)
|
||||||
|
payload = build_payload(
|
||||||
|
job_id=job_id,
|
||||||
|
seq=seq,
|
||||||
|
event=args.event,
|
||||||
|
detail=args.detail,
|
||||||
|
data=data,
|
||||||
|
auth_token=job.get("auth_token"),
|
||||||
|
)
|
||||||
|
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
|
||||||
|
retain = args.retained or args.event in TERMINAL_EVENTS
|
||||||
|
|
||||||
|
publish = with_retry(
|
||||||
|
_publish_once,
|
||||||
|
attempts=args.attempts,
|
||||||
|
exceptions=(OSError, TimeoutError, ConnectionError, ValueError),
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
publish(config, topic, body, retain)
|
||||||
|
except Exception as exc:
|
||||||
|
logger.error("publish failed after %d attempts: %s", args.attempts, exc)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
# Persistent audit log: record the exact payload we put on the wire so the
|
||||||
|
# publish is reproducible from the log alone. Best-effort (isolated inside
|
||||||
|
# append_event) — never fails the publish.
|
||||||
|
mqtt_common.append_event(job_id, {
|
||||||
|
"event": "published",
|
||||||
|
"source_event": args.event,
|
||||||
|
"seq": seq,
|
||||||
|
"topic": topic,
|
||||||
|
"retain": retain,
|
||||||
|
"timestamp": payload["timestamp"],
|
||||||
|
"detail": args.detail,
|
||||||
|
"payload": payload,
|
||||||
|
})
|
||||||
|
|
||||||
|
# Best-effort side effects: registry status sync + (debug) event log. Never
|
||||||
|
# fail the publish on these.
|
||||||
|
registry.append_event(job_id, args.registry_dir, payload)
|
||||||
|
new_status = EVENT_TO_STATUS.get(args.event)
|
||||||
|
if new_status:
|
||||||
|
try:
|
||||||
|
mqtt_common.update_job_status(job_id, args.registry_dir, status=new_status)
|
||||||
|
except Exception as exc: # pragma: no cover - best effort
|
||||||
|
logger.warning("status sync failed: %s", exc)
|
||||||
|
|
||||||
|
logger.info("published %s seq=%d job=%s retain=%s", args.event, seq, job_id, retain)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
Executable
+327
@@ -0,0 +1,327 @@
|
|||||||
|
"""Job registry for the delegate-job skill.
|
||||||
|
|
||||||
|
A job record is the single source of truth for one delegated unit of work:
|
||||||
|
its id, prompt, owning agent session, broker connection, timeouts, and status.
|
||||||
|
Records live as ``<registry_dir>/<job_id>.json`` with an append-only event log
|
||||||
|
``<registry_dir>/<job_id>.events.log`` and a shared ``<registry_dir>/.lock``.
|
||||||
|
|
||||||
|
Concurrency is handled via the fcntl lock in :mod:`mqtt_common` (PoC). For
|
||||||
|
multi-host delegation, migrate to SQLite WAL — see references/registry.md.
|
||||||
|
|
||||||
|
Importable as a library and runnable as a CLI (``register``/``list``/``get``/
|
||||||
|
``status``/``pick``) so the ``delegate-job`` bash wrapper can shell out.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
import mqtt_common
|
||||||
|
from mqtt_common import (
|
||||||
|
DEFAULT_REGISTRY_DIR,
|
||||||
|
SCHEMA_VERSION,
|
||||||
|
_atomic_write_record,
|
||||||
|
_utcnow,
|
||||||
|
broker_config_from_env,
|
||||||
|
load_job,
|
||||||
|
registry_lock,
|
||||||
|
topic_prefix_for,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger("delegate_job.registry")
|
||||||
|
|
||||||
|
TERMINAL_STATUSES = ("completed", "error", "cancelled")
|
||||||
|
VALID_STATUSES = ("pending", "running", "completed", "error", "cancelled")
|
||||||
|
|
||||||
|
|
||||||
|
def generate_job_id(bits: int = 32) -> str:
|
||||||
|
"""PoC: 32-bit hex (8 chars). Production: 128-bit (full uuid4 hex)."""
|
||||||
|
if bits >= 128:
|
||||||
|
return uuid.uuid4().hex
|
||||||
|
nibbles = max(1, bits // 4)
|
||||||
|
return uuid.uuid4().hex[:nibbles]
|
||||||
|
|
||||||
|
|
||||||
|
def register_job(
|
||||||
|
prompt: str,
|
||||||
|
agent: str = "claude-code",
|
||||||
|
agent_session: str = "tmux:claude",
|
||||||
|
broker: Optional[Dict[str, Any]] = None,
|
||||||
|
timeout_sec: int = 600,
|
||||||
|
idle_timeout_sec: int = 120,
|
||||||
|
registry_dir: str = DEFAULT_REGISTRY_DIR,
|
||||||
|
job_id: Optional[str] = None,
|
||||||
|
expected_artifacts: Optional[List[str]] = None,
|
||||||
|
bits: int = 32,
|
||||||
|
auth_token: Optional[str] = None,
|
||||||
|
) -> str:
|
||||||
|
"""Create a new ``pending`` job record and return its id.
|
||||||
|
|
||||||
|
``broker`` defaults to the current environment's resolved broker block, so
|
||||||
|
the registry alone is enough for ``publish_event.py`` to connect later.
|
||||||
|
"""
|
||||||
|
job_id = job_id or generate_job_id(bits)
|
||||||
|
if broker is None:
|
||||||
|
broker = broker_config_from_env().to_registry_block()
|
||||||
|
now = _utcnow()
|
||||||
|
record: Dict[str, Any] = {
|
||||||
|
"schema_version": SCHEMA_VERSION,
|
||||||
|
"job_id": job_id,
|
||||||
|
"status": "pending",
|
||||||
|
"created_at": now,
|
||||||
|
"updated_at": now,
|
||||||
|
"prompt": prompt,
|
||||||
|
"agent": agent,
|
||||||
|
"agent_session": agent_session,
|
||||||
|
"broker": broker,
|
||||||
|
"topic_prefix": topic_prefix_for(job_id),
|
||||||
|
"timeout_sec": int(timeout_sec),
|
||||||
|
"idle_timeout_sec": int(idle_timeout_sec),
|
||||||
|
"expected_artifacts": expected_artifacts or [],
|
||||||
|
"last_seq": 0,
|
||||||
|
"auth_token": auth_token,
|
||||||
|
}
|
||||||
|
with registry_lock(registry_dir):
|
||||||
|
if mqtt_common._job_path(job_id, registry_dir).exists():
|
||||||
|
raise FileExistsError(f"job already exists: {job_id}")
|
||||||
|
_atomic_write_record(job_id, registry_dir, record)
|
||||||
|
# Seed the persistent audit log (meta.json + status.json + a "registered"
|
||||||
|
# event). Best-effort inside init_job_log — never blocks registration.
|
||||||
|
mqtt_common.init_job_log(job_id, meta=record)
|
||||||
|
logger.info("registered job %s (agent=%s session=%s)", job_id, agent, agent_session)
|
||||||
|
return job_id
|
||||||
|
|
||||||
|
|
||||||
|
def pick_pending(agent_session: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Optional[str]:
|
||||||
|
"""Claim the oldest ``pending`` job for ``agent_session``, flipping it to
|
||||||
|
``running`` atomically under the lock. Returns the job id, or None if no
|
||||||
|
pending job matches. This is how each tmux session takes only its own work
|
||||||
|
without two sessions grabbing the same job."""
|
||||||
|
with registry_lock(registry_dir):
|
||||||
|
candidates = []
|
||||||
|
for record in _iter_records(registry_dir):
|
||||||
|
if record.get("status") == "pending" and record.get("agent_session") == agent_session:
|
||||||
|
candidates.append(record)
|
||||||
|
if not candidates:
|
||||||
|
return None
|
||||||
|
candidates.sort(key=lambda r: r.get("created_at", ""))
|
||||||
|
chosen = candidates[0]
|
||||||
|
chosen["status"] = "running"
|
||||||
|
chosen["updated_at"] = _utcnow()
|
||||||
|
_atomic_write_record(chosen["job_id"], registry_dir, chosen)
|
||||||
|
logger.info("session %s picked job %s", agent_session, chosen["job_id"])
|
||||||
|
job_id = chosen["job_id"]
|
||||||
|
updated_at = chosen["updated_at"]
|
||||||
|
# pick_pending writes the record directly (not via update_job_status), so it
|
||||||
|
# mirrors the pending->running transition into the audit log here. Best-effort.
|
||||||
|
mqtt_common.update_logged_status(job_id, "running", updated_at=updated_at)
|
||||||
|
mqtt_common.append_event(job_id, {
|
||||||
|
"event": "status_changed",
|
||||||
|
"from": "pending",
|
||||||
|
"to": "running",
|
||||||
|
"by": agent_session,
|
||||||
|
"timestamp": updated_at,
|
||||||
|
})
|
||||||
|
return job_id
|
||||||
|
|
||||||
|
|
||||||
|
def update_status(job_id: str, registry_dir: str, status: str) -> Dict[str, Any]:
|
||||||
|
if status not in VALID_STATUSES:
|
||||||
|
raise ValueError(f"invalid status {status!r}; expected one of {VALID_STATUSES}")
|
||||||
|
return mqtt_common.update_job_status(job_id, registry_dir, status=status)
|
||||||
|
|
||||||
|
|
||||||
|
def list_jobs(registry_dir: str = DEFAULT_REGISTRY_DIR, status: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||||
|
records = list(_iter_records(registry_dir))
|
||||||
|
if status:
|
||||||
|
records = [r for r in records if r.get("status") == status]
|
||||||
|
records.sort(key=lambda r: r.get("created_at", ""))
|
||||||
|
return records
|
||||||
|
|
||||||
|
|
||||||
|
def append_event(job_id: str, registry_dir: str, payload: Dict[str, Any]) -> None:
|
||||||
|
"""Append one event payload as a JSON line to the job's events log. Best
|
||||||
|
effort, debug-only; failures are logged but never raised to the caller."""
|
||||||
|
try:
|
||||||
|
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||||
|
log_path = Path(registry_dir) / f"{job_id}.events.log"
|
||||||
|
with open(log_path, "a", encoding="utf-8") as fh:
|
||||||
|
fh.write(json.dumps(payload, ensure_ascii=False) + "\n")
|
||||||
|
except OSError as exc: # pragma: no cover - best effort
|
||||||
|
logger.warning("could not append event for %s: %s", job_id, exc)
|
||||||
|
|
||||||
|
|
||||||
|
# convenience re-export so callers can `from registry import load_job`
|
||||||
|
__all__ = [
|
||||||
|
"register_job", "pick_pending", "update_status", "load_job",
|
||||||
|
"list_jobs", "append_event", "generate_job_id",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _iter_records(registry_dir: str):
|
||||||
|
base = Path(registry_dir)
|
||||||
|
if not base.exists():
|
||||||
|
return
|
||||||
|
for path in sorted(base.glob("*.json")):
|
||||||
|
try:
|
||||||
|
with open(path, "r", encoding="utf-8") as fh:
|
||||||
|
yield json.load(fh)
|
||||||
|
except (OSError, json.JSONDecodeError) as exc:
|
||||||
|
logger.warning("skipping unreadable record %s: %s", path, exc)
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
# CLI (so the bash wrapper can shell out without inline python)
|
||||||
|
# --------------------------------------------------------------------------
|
||||||
|
def _build_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(description="delegate-job registry CLI")
|
||||||
|
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||||
|
sub = parser.add_subparsers(dest="command", required=True)
|
||||||
|
|
||||||
|
p_reg = sub.add_parser("register", help="create a pending job; prints the job id")
|
||||||
|
p_reg.add_argument("--prompt", required=True)
|
||||||
|
p_reg.add_argument("--agent", default="claude-code")
|
||||||
|
p_reg.add_argument("--agent-session", default="tmux:claude")
|
||||||
|
p_reg.add_argument("--timeout", type=int, default=600)
|
||||||
|
p_reg.add_argument("--idle-timeout", type=int, default=120)
|
||||||
|
p_reg.add_argument("--bits", type=int, default=32, help="32 (PoC) or 128 (prod)")
|
||||||
|
p_reg.add_argument("--artifact", action="append", default=[], dest="artifacts")
|
||||||
|
|
||||||
|
p_list = sub.add_parser("list", help="list jobs (optionally by status)")
|
||||||
|
p_list.add_argument("--status", default=None)
|
||||||
|
p_list.add_argument("--json", action="store_true")
|
||||||
|
|
||||||
|
p_get = sub.add_parser("get", help="print one job record as JSON")
|
||||||
|
p_get.add_argument("--job", required=True)
|
||||||
|
|
||||||
|
p_status = sub.add_parser("status", help="set a job status")
|
||||||
|
p_status.add_argument("--job", required=True)
|
||||||
|
p_status.add_argument("--set", required=True, dest="status")
|
||||||
|
|
||||||
|
p_pick = sub.add_parser("pick", help="claim a pending job for a session; prints id")
|
||||||
|
p_pick.add_argument("--agent-session", default="tmux:claude")
|
||||||
|
|
||||||
|
p_logs = sub.add_parser(
|
||||||
|
"logs",
|
||||||
|
help="show the persistent audit log for a job, or --list every logged job",
|
||||||
|
)
|
||||||
|
p_logs.add_argument("job_id", nargs="?", default=None,
|
||||||
|
help="job id whose events.ndjson to print")
|
||||||
|
p_logs.add_argument("--list", action="store_true", dest="list_all",
|
||||||
|
help="summarise every job under the logs dir instead")
|
||||||
|
p_logs.add_argument("--logs-dir", default=None,
|
||||||
|
help="override the audit-log root (default: $DELEGATE_JOB_LOGS_DIR "
|
||||||
|
"or <cwd>/.hermes/delegate_job_logs)")
|
||||||
|
p_logs.add_argument("--tail", type=int, default=0,
|
||||||
|
help="show only the last N events (0 = all)")
|
||||||
|
p_logs.add_argument("--json", action="store_true",
|
||||||
|
help="emit raw JSON lines / records instead of a table")
|
||||||
|
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: Optional[List[str]] = None) -> int:
|
||||||
|
mqtt_common.setup_logging(logging.INFO)
|
||||||
|
args = _build_parser().parse_args(argv)
|
||||||
|
rd = args.registry_dir
|
||||||
|
|
||||||
|
if args.command == "register":
|
||||||
|
job_id = register_job(
|
||||||
|
prompt=args.prompt,
|
||||||
|
agent=args.agent,
|
||||||
|
agent_session=args.agent_session,
|
||||||
|
timeout_sec=args.timeout,
|
||||||
|
idle_timeout_sec=args.idle_timeout,
|
||||||
|
registry_dir=rd,
|
||||||
|
expected_artifacts=args.artifacts,
|
||||||
|
bits=args.bits,
|
||||||
|
)
|
||||||
|
print(job_id)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.command == "list":
|
||||||
|
records = list_jobs(rd, status=args.status)
|
||||||
|
if args.json:
|
||||||
|
print(json.dumps(records, ensure_ascii=False, indent=2))
|
||||||
|
else:
|
||||||
|
if not records:
|
||||||
|
print("(no jobs)")
|
||||||
|
for r in records:
|
||||||
|
print(f"{r['job_id']} {r.get('status','?'):10s} {r.get('agent_session','')}"
|
||||||
|
f" {r.get('prompt','')[:48]}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.command == "get":
|
||||||
|
try:
|
||||||
|
print(json.dumps(load_job(args.job, rd), ensure_ascii=False, indent=2))
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
print(str(exc), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.command == "status":
|
||||||
|
try:
|
||||||
|
update_status(args.job, rd, args.status)
|
||||||
|
except (FileNotFoundError, ValueError) as exc:
|
||||||
|
print(str(exc), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.command == "pick":
|
||||||
|
job_id = pick_pending(args.agent_session, rd)
|
||||||
|
if job_id is None:
|
||||||
|
return 3 # no pending job for this session
|
||||||
|
print(job_id)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if args.command == "logs":
|
||||||
|
return _cmd_logs(args)
|
||||||
|
|
||||||
|
return 1
|
||||||
|
|
||||||
|
|
||||||
|
def _cmd_logs(args) -> int:
|
||||||
|
"""Pretty-print one job's events.ndjson, or summarise all logged jobs."""
|
||||||
|
logs_dir = args.logs_dir or mqtt_common.LOGS_DIR
|
||||||
|
|
||||||
|
if args.list_all:
|
||||||
|
jobs = mqtt_common.list_logged_jobs(logs_dir)
|
||||||
|
if args.json:
|
||||||
|
print(json.dumps(jobs, ensure_ascii=False, indent=2))
|
||||||
|
return 0
|
||||||
|
if not jobs:
|
||||||
|
print(f"(no logged jobs under {logs_dir})")
|
||||||
|
return 0
|
||||||
|
for m in jobs:
|
||||||
|
print(f"{m.get('job_id','?')} {m.get('status','?'):10s} "
|
||||||
|
f"{m.get('created_at','-'):20s} {(m.get('prompt') or '')[:48]}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if not args.job_id:
|
||||||
|
print("logs requires a <job_id> or --list", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
events = list(mqtt_common.iter_logged_events(args.job_id, logs_dir))
|
||||||
|
if not events and not mqtt_common.job_log_dir(args.job_id, logs_dir).exists():
|
||||||
|
print(f"no audit log for job {args.job_id} under {logs_dir}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
if args.tail and args.tail > 0:
|
||||||
|
events = events[-args.tail:]
|
||||||
|
if args.json:
|
||||||
|
for e in events:
|
||||||
|
print(json.dumps(e, ensure_ascii=False))
|
||||||
|
return 0
|
||||||
|
for e in events:
|
||||||
|
ts = e.get("logged_at") or e.get("timestamp") or "-"
|
||||||
|
extra = e.get("detail") or e.get("to") or e.get("source_event") or ""
|
||||||
|
print(f"{ts:24s} {e.get('event','?'):<16s} {extra}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
Reference in New Issue
Block a user