refactor(skills): rename multi-agent-* + agent-sessions-monitor + delegate-job to tmux-agent-orchestrate-*
Renamed 6 skills directories to tmux-agent-orchestrate-* prefix: - multi-agent-create → tmux-agent-orchestrate-create - multi-agent-resume → tmux-agent-orchestrate-resume - multi-agent-delete → tmux-agent-orchestrate-delete - multi-agent-status → tmux-agent-orchestrate-status - agent-sessions-monitor → tmux-agent-orchestrate-monitor - delegate-job → tmux-agent-orchestrate-delegate-job Updated: - skills/lib.sh internal paths (delegate_submit_job etc.) - skills/tmux-agent-orchestrate-status/scripts/status.sh (monitor path) - skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh - .gitignore (HTML ignore patterns) - 6 SKILL.md frontmatter (name, related_skills, prereq_skills) and body - All script headers and Korean comments Notes: - tmux session naming convention unchanged (<slug>-creator-<agent>) — workspace identifier based, kept for backward compatibility - Existing 2 sessions in -L multi-agent-canary untouched - YAML delegate_job_id / agent-session (tmux:canary-...) preserved for log history compatibility Verified on isolated server -L agy-rename-test (kill-server after).
This commit is contained in:
@@ -0,0 +1,11 @@
|
||||
# tmux-agent-orchestrate-delegate-job 스킬
|
||||
|
||||
작업(Job)을 자율 에이전트(claude-code/codex/opencode/human)에게 위임하고 MQTT
|
||||
이벤트 채널로 비동기 관찰하는 Hermes 스킬. **시작점은 [`SKILL.md`](./SKILL.md).**
|
||||
|
||||
- 프로토콜/스키마: [`job-protocol.md`](./job-protocol.md)
|
||||
- 브로커 PoC→운영 전환: [`mqtt-broker-setup.md`](./mqtt-broker-setup.md)
|
||||
- 레지스트리 포맷/동시성: [`registry.md`](./registry.md)
|
||||
- 참조 구현: [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) (bash wrapper), [`scripts/publish_event.py`](./scripts/publish_event.py), [`scripts/job_subscriber.py`](./scripts/job_subscriber.py), [`scripts/registry.py`](./scripts/registry.py), [`scripts/mqtt_common.py`](./scripts/mqtt_common.py)
|
||||
- 영구 감사 로그: `.hermes/delegate_job_logs/<job_id>/` (`meta.json`·`events.ndjson`·`status.json`)
|
||||
— `tmux-agent-orchestrate-delegate-job logs <id>` 또는 `tmux-agent-orchestrate-delegate-job logs --list`로 조회 (SKILL.md "Audit Logs" 참조)
|
||||
@@ -0,0 +1,348 @@
|
||||
---
|
||||
name: tmux-agent-orchestrate-delegate-job
|
||||
description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, tmux-agent-orchestrate-delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types."
|
||||
version: 1.0.0
|
||||
author: Hermes Agent
|
||||
license: MIT
|
||||
platforms: [linux, macos, windows]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [agent-delegation, mqtt, jobs, orchestration, async-completion]
|
||||
related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring]
|
||||
---
|
||||
|
||||
# tmux-agent-orchestrate-delegate-job — Async Job Delegation over MQTT
|
||||
|
||||
Delegate a unit of work to an autonomous agent, then **observe** it instead of
|
||||
blocking on it. Every job gets a unique id and a registry record; the agent
|
||||
publishes lifecycle events (`started`, `permission_required`, `progress`,
|
||||
`completed`, `error`) to a per-job MQTT topic; the delegator subscribes and
|
||||
treats `completed`/`error` — or a timeout — as the terminal state.
|
||||
|
||||
This skill is a **reference implementation**: copy the files in this directory
|
||||
into your project and customise. The `communication_over_mqtt` project is the
|
||||
canonical concrete instance.
|
||||
|
||||
## Overview
|
||||
|
||||
The model is deliberately small. A **job** is one delegated task. An **agent**
|
||||
is a worker (a claude-code tmux session, a codex run, a human). The **registry**
|
||||
(`.hermes/jobs/<id>.json`) holds everything about a job so nothing important
|
||||
lives in environment variables — which means one tmux session can process many
|
||||
jobs sequentially, and many sessions can fan out in parallel, with no env
|
||||
collisions. The **event channel** is one MQTT topic per job carrying JSON
|
||||
payloads; `event` discriminates the type.
|
||||
|
||||
Responsibility is split into exactly one entry point each:
|
||||
[`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup,
|
||||
monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py)
|
||||
observes them (timeouts, terminal state machine, defensive parsing). Shared
|
||||
logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in
|
||||
[`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py`
|
||||
in the host project stay frozen.
|
||||
|
||||
Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up
|
||||
the protocol. **Production** moves to your own authenticated TLS broker — the
|
||||
switch is **config only** (env vars + the registry `broker.*` block), never a
|
||||
code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
||||
|
||||
## When to Use / When NOT to Use
|
||||
|
||||
**Use when:**
|
||||
- you want **fire-and-observe** delegation — kick off work and get a completion
|
||||
signal rather than blocking a terminal;
|
||||
- several agent types (claude-code, codex, opencode, human) must follow **one**
|
||||
completion protocol;
|
||||
- you need **multi-job fan-out** across tmux sessions with safe job claiming;
|
||||
- you want a clean PoC → authenticated-broker upgrade path.
|
||||
|
||||
**Do NOT use when:**
|
||||
- a one-shot `claude -p '…'` that returns inline is enough (no async signal
|
||||
needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly;
|
||||
- you need request/response RPC or large artifact transfer (this is a
|
||||
one-direction event stream, not a data bus);
|
||||
- the payload would carry secrets and you're still on the public broker — move
|
||||
to the own-broker stage first.
|
||||
|
||||
## Quick Start
|
||||
|
||||
The one-line wrapper handles register + subscriber-first + agent launch. If
|
||||
you're new, **start here** and only fall back to the manual 5-step flow when
|
||||
you need finer control.
|
||||
|
||||
```bash
|
||||
# 1) one line: register → start subscriber → launch agent in tmux
|
||||
# (uses public broker by default; last stdout line is the audit-log dir)
|
||||
tmux-agent-orchestrate-delegate-job submit \
|
||||
--agent claude-code \
|
||||
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
||||
--workdir /path/to/project \
|
||||
--agent-session tmux:demo \
|
||||
--timeout 600 --idle-timeout 120
|
||||
# → stdout: registered job: <JID>
|
||||
# subscriber pid: …
|
||||
# agent launched in tmux session: demo
|
||||
# subscriber output: <one line per event>
|
||||
# /path/to/project/.hermes/delegate_job_logs/<JID> ← audit log dir
|
||||
|
||||
# 2) at any time, query the job or its audit log
|
||||
tmux-agent-orchestrate-delegate-job status --job <JID>
|
||||
tmux-agent-orchestrate-delegate-job logs <JID> # pretty timeline
|
||||
tmux-agent-orchestrate-delegate-job logs --list # every job, live status
|
||||
|
||||
# 3) run a user-supplied validator against the job's artifacts
|
||||
tmux-agent-orchestrate-delegate-job verify --job <JID> --validate ./validate.sh
|
||||
```
|
||||
|
||||
The wrapper enforces the **subscribe-before-publish** ordering and **forwards
|
||||
the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls
|
||||
`publish_event.py --job <JID>` with the right id — see Pitfall §"Wrong job_id
|
||||
propagated to the agent"). When you need finer control, the manual flow is:
|
||||
|
||||
```bash
|
||||
# Manual 5-step (same outcome, more knobs)
|
||||
PY=.venv/bin/python
|
||||
SKILL=./skills/tmux-agent-orchestrate-delegate-job/scripts
|
||||
|
||||
# 1) register
|
||||
JID=$($PY "$SKILL/registry.py" register \
|
||||
--prompt "…" --agent claude-code --agent-session tmux:demo \
|
||||
--timeout 600 --idle-timeout 120)
|
||||
|
||||
# 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs)
|
||||
$PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 600 --idle-timeout 120 &
|
||||
|
||||
# 3) pass JID to the agent and instruct it to publish events with --job "$JID"
|
||||
# (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id")
|
||||
|
||||
# 4) on completion the subscriber prints events and exits 0/1/2
|
||||
|
||||
# 5) inspect any time
|
||||
$PY "$SKILL/registry.py" get --job "$JID"
|
||||
$PY "$SKILL/registry.py" logs "$JID" # positional job id
|
||||
$PY "$SKILL/registry.py" logs --list
|
||||
```
|
||||
|
||||
## Job Protocol
|
||||
|
||||
One topic per job: `python/mqtt/jobs/<job_id>/events`. Payload (JSON, UTF-8,
|
||||
`schema_version=1`):
|
||||
|
||||
```json
|
||||
{ "schema_version": 1, "seq": 7, "job_id": "abc12345",
|
||||
"event": "started|permission_required|progress|completed|error",
|
||||
"timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text",
|
||||
"data": { "optional": "metadata" } }
|
||||
```
|
||||
|
||||
- `seq` is monotonic per job (first = 1); the subscriber uses it to spot
|
||||
reorder/duplication.
|
||||
- `timestamp` is advisory — timeouts are measured from **receive** time.
|
||||
- `detail`/`data` carry **no** secrets or absolute paths.
|
||||
- A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing).
|
||||
|
||||
`started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0,
|
||||
`error`→exit 1. Full catalogue + production `auth_token` handling:
|
||||
[`job-protocol.md`](./job-protocol.md).
|
||||
|
||||
## Registry Format
|
||||
|
||||
```
|
||||
.hermes/jobs/<id>.json # metadata record (single source of truth)
|
||||
.hermes/jobs/<id>.events.log # append-only JSON-lines log (debug, optional)
|
||||
.hermes/jobs/.lock # fcntl advisory lock for the registry
|
||||
```
|
||||
|
||||
The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block,
|
||||
`topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`,
|
||||
`last_seq`, and (production) `auth_token`. Because the `broker` block lives in
|
||||
the record, `publish_event.py` connects from the registry alone. Concurrency,
|
||||
the atomic rename trick, and multi-session job claiming are in
|
||||
[`registry.md`](./registry.md).
|
||||
|
||||
## Audit Logs
|
||||
|
||||
Every job's lifecycle is mirrored to a **persistent, append-only audit log**
|
||||
under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`;
|
||||
default `<cwd>/.hermes/delegate_job_logs`). Unlike the registry — live state
|
||||
mutated in place and liable to be cleaned up — the audit log is durable
|
||||
history you can replay after the fact. It is git-ignored.
|
||||
|
||||
```
|
||||
.hermes/delegate_job_logs/<job_id>/
|
||||
meta.json # registration snapshot: prompt, agent, broker, timeouts, …
|
||||
events.ndjson # append-only, one JSON event per line, in time order
|
||||
status.json # current status only (fast point-query)
|
||||
```
|
||||
|
||||
**What is logged, automatically:**
|
||||
|
||||
| When | `events.ndjson` line | Written by |
|
||||
|------|----------------------|------------|
|
||||
| job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` |
|
||||
| any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` |
|
||||
| event published | `published` (carries the exact payload — reproducible) | `publish_event.py` |
|
||||
| event received | `received` (subscriber's external view) | `job_subscriber.py` |
|
||||
|
||||
Both the emitter side (`published`) and the observer side (`received`) are
|
||||
recorded, so a dropped publish or a missed receive is still visible from the
|
||||
other. Every write is **best-effort and isolated** — an fcntl-locked append
|
||||
guarded by `try/except` that only ever emits a `logger.warning`, so a logging
|
||||
failure can never break a publish, a subscribe, or a registry write. stdout is
|
||||
never touched.
|
||||
|
||||
**Reading them:**
|
||||
|
||||
```bash
|
||||
tmux-agent-orchestrate-delegate-job logs <job_id> # pretty-print one job's timeline
|
||||
tmux-agent-orchestrate-delegate-job logs --list # summarise every logged job (with live status)
|
||||
# or directly via the registry CLI:
|
||||
$PY scripts/registry.py logs <job_id> [--tail N] [--json]
|
||||
$PY scripts/registry.py logs --list [--json]
|
||||
```
|
||||
|
||||
`submit` prints the job's audit-log directory as its last stdout line, so a
|
||||
caller can `tail -n1` to locate it.
|
||||
|
||||
## Broker Setup
|
||||
|
||||
| Stage | Broker | Auth | Transport |
|
||||
|-------|--------|------|-----------|
|
||||
| PoC | `broker.hivemq.com` | none | 1883 plaintext |
|
||||
| Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS |
|
||||
|
||||
All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`,
|
||||
`MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by
|
||||
`broker_config_from_env()`, with the registry `broker.*` block overriding per
|
||||
job. Moving to your own broker is **config only**: install Mosquitto, set
|
||||
`persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant
|
||||
the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip
|
||||
`MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL,
|
||||
`mosquitto_passwd`, self-signed/private-CA certs, cut-over verification):
|
||||
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
||||
|
||||
## Agent Adapters
|
||||
|
||||
Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry
|
||||
path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract
|
||||
in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the
|
||||
**freshly-issued id from the registry record for *this* delegation** — never a
|
||||
job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the
|
||||
agent").
|
||||
|
||||
- **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at
|
||||
lifecycle points. `submit --mode tmux` injects a prompt that already names
|
||||
`$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference
|
||||
instruction block (the wrapper injects something equivalent):
|
||||
|
||||
```text
|
||||
Your job_id is "$JOB_ID" (read it from the registry record for this delegation —
|
||||
do not reuse any job_id you saw before).
|
||||
|
||||
On start: $PY tmux-agent-orchestrate-delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started
|
||||
On permission: $PY … --job "$JOB_ID" --event permission_required --detail "<tool>:<what>"
|
||||
On progress: $PY … --job "$JOB_ID" --event progress --detail "<short status>"
|
||||
On success: $PY … --job "$JOB_ID" --event completed --detail "<one-line summary>"
|
||||
On failure: $PY … --job "$JOB_ID" --event error --detail "<one-line reason>"
|
||||
|
||||
Task: <the user's prompt>
|
||||
|
||||
The subscriber for "$JOB_ID" is already running; your completed/error event
|
||||
ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
|
||||
```
|
||||
|
||||
See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns.
|
||||
- **codex** — same contract. Invoke `codex exec "<instruction-block-above>"` or
|
||||
wire `publish_event.py` as an MCP tool so the agent can call it directly.
|
||||
- **opencode** — wire `publish_event.py` as a tool/command the agent can call;
|
||||
identical event points.
|
||||
- **human** — a person does the work, reads the registry record, then runs
|
||||
`publish_event.py --job <id> --event completed` (or `error`) by hand.
|
||||
|
||||
## User Interface
|
||||
|
||||
The [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) bash wrapper bundles register +
|
||||
subscribe-first + run-agent + validate:
|
||||
|
||||
```bash
|
||||
tmux-agent-orchestrate-delegate-job submit --agent claude-code \
|
||||
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
||||
--workdir /path/to/project --timeout 600 [--validate ./validate.sh]
|
||||
tmux-agent-orchestrate-delegate-job status --job <id> # one record, pretty-printed
|
||||
tmux-agent-orchestrate-delegate-job list # all jobs, one line each
|
||||
tmux-agent-orchestrate-delegate-job verify --job <id> --validate ./validate.sh # runs it, reports exit code
|
||||
tmux-agent-orchestrate-delegate-job wait [--job <id>] # block until terminal (else --wait-any)
|
||||
```
|
||||
|
||||
`submit` **always starts the subscriber before the agent** (the ordering
|
||||
dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and
|
||||
calls `--validate` afterward if given. The skill automates job-id generation,
|
||||
registry creation, broker resolution, subscriber-first ordering, agent launch,
|
||||
and completion detection; it does **not** automate the agent's internals or your
|
||||
business-logic validation — those are hooks you fill (`validate.sh` reads
|
||||
`$JOB_ID`/`$REGISTRY_DIR`).
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
- **Publishing before subscribing** — MQTT does not queue non-retained messages
|
||||
for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely
|
||||
on retained terminal events (production). `submit` enforces this.
|
||||
- **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID`
|
||||
on every `submit`. If your agent instruction (or the wrapper's prompt template)
|
||||
hard-codes an old job_id, the agent calls `publish_event.py --job <wrong>`,
|
||||
the subscriber's defensive parser drops it as a `job_id` mismatch, and the
|
||||
delegator waits until idle timeout (exit 2). Fix: instruct the agent to
|
||||
**read the job_id from the registry record for *this* delegation** (or pass it
|
||||
in via env / `--prompt` interpolation), never from prior runs. `submit`'s
|
||||
default prompt template interpolates `$JOB_ID` for you — if you build a custom
|
||||
prompt, do the same.
|
||||
- **tmux session name collision** — `submit --mode tmux` derives the session
|
||||
name from `--agent-session tmux:<name>` (default `tmux:claude`). If a session
|
||||
with that name is already attached (e.g. you ran the demo and the previous
|
||||
session is still open), `tmux new-session -d -s <name>` fails and the agent
|
||||
never launches. Pick a unique `--agent-session` per concurrent delegation
|
||||
(e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one
|
||||
(`tmux kill-session -t claude`) before re-running.
|
||||
- **Timeout before `started`** — a cold-starting agent may not emit `started`
|
||||
for a while; the wall-clock timeout starts at subscribe time so a stuck agent
|
||||
still terminates. Don't set `--timeout` so low you false-positive a slow start.
|
||||
- **No retry on publish** — a dropped `completed` would hang the delegator
|
||||
forever; `publish_event.py` retries with exponential backoff and exits 2 if it
|
||||
still fails, so the delegator is never left waiting silently.
|
||||
- **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or
|
||||
`error` can trail `completed`; the subscriber's terminal state machine
|
||||
finalises each job once and ignores the rest.
|
||||
- **Trusting the public broker** — anyone can publish there; never make a real
|
||||
decision on a PoC signal. Add `auth_token` + an authenticated broker first.
|
||||
- **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or
|
||||
tokens (except the production `auth_token` in `data`).
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
- [ ] `started` → `completed` over the public broker: subscriber prints the
|
||||
lines and exits **0**.
|
||||
- [ ] `error` path: subscriber exits **1**.
|
||||
- [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` →
|
||||
exit **2**.
|
||||
- [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is
|
||||
dropped with a warning, not crashed on.
|
||||
- [ ] one tmux session processes two registry jobs in sequence; a second
|
||||
session with a different `agent_session` claims only its own.
|
||||
- [ ] broker cut-over: same scripts reach an authenticated TLS broker with env
|
||||
changes only; a credential without write ACL is rejected; a late
|
||||
subscriber still receives the retained terminal event.
|
||||
- [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample`
|
||||
still works unchanged (regression).
|
||||
- [ ] **audit log integrity** — for a completed job,
|
||||
`.hermes/delegate_job_logs/<JID>/events.ndjson` contains `registered` →
|
||||
`received started` → `published completed` (in that order), and
|
||||
`status.json.status == "completed"` matches the registry record. A
|
||||
logging failure (e.g. read-only log dir) does not break the publish or
|
||||
subscribe path — only a `logger.warning` is emitted.
|
||||
- [ ] **end-to-end demo smoke** — run
|
||||
`tmux-agent-orchestrate-delegate-job submit --agent claude-code --agent-session tmux:demo-smoke
|
||||
--prompt "echo hello and call publish_event.py --job <JID>
|
||||
--event completed" --timeout 120` and confirm
|
||||
(a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session
|
||||
name printed, (d) `events.ndjson` grows as the agent runs, (e) final
|
||||
stdout line is the audit-log dir.
|
||||
@@ -0,0 +1,118 @@
|
||||
# Job Event Protocol
|
||||
|
||||
The wire contract every tmux-agent-orchestrate-delegate-job agent (claude-code, codex, opencode,
|
||||
human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across
|
||||
the PoC (public broker) and production (own broker) stages; only transport
|
||||
hardening changes, never the payload shape.
|
||||
|
||||
Reference implementation: [`./scripts/publish_event.py`](./scripts/publish_event.py)
|
||||
(emit) and [`./scripts/job_subscriber.py`](./scripts/job_subscriber.py) (observe).
|
||||
|
||||
---
|
||||
|
||||
## 1. Topic design
|
||||
|
||||
| Topic | Purpose |
|
||||
|-------|---------|
|
||||
| `python/mqtt/sample` | Legacy demo topic — **never changed** (README compat). |
|
||||
| `python/mqtt/jobs/<job_id>/events` | Per-job event stream (this protocol). |
|
||||
|
||||
- One topic per job, JSON payload, `event` field discriminates the type.
|
||||
- Single-direction publish only (worker → observer). No request/response.
|
||||
- Future split is reserved but not required:
|
||||
`<job_id>/events`, `<job_id>/logs`, `<job_id>/artifacts`.
|
||||
- `topic_prefix` is stored in the job record so publishers resolve the topic
|
||||
from the registry alone (`<topic_prefix>/events`).
|
||||
|
||||
---
|
||||
|
||||
## 2. Payload schema (JSON, UTF-8, `schema_version = 1`)
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"seq": 7,
|
||||
"job_id": "abc12345",
|
||||
"event": "started | permission_required | progress | completed | error",
|
||||
"timestamp": "2026-06-19T09:32:00Z",
|
||||
"detail": "generalised, whitelisted human-readable string",
|
||||
"data": { "optional": "metadata" }
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `schema_version` | If publisher/subscriber disagree, the subscriber **drops** the event with a warning (defensive parsing). |
|
||||
| `seq` | Monotonic **per `job_id`**, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (`last_seq`) so it survives restarts. |
|
||||
| `job_id` | Subscriber drops any event whose `job_id` it did not subscribe for. |
|
||||
| `timestamp` | Publisher host clock, **advisory only**. The delegator's timeout is measured from *receive* time, not this field. |
|
||||
| `detail` | Generalised text only. **No absolute paths, keys, or tokens.** |
|
||||
| `data` | Optional metadata. Production may add `auth_token`, `build_id`, etc. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Event catalogue
|
||||
|
||||
| event | When emitted | `detail` example | seq |
|
||||
|-------|--------------|------------------|-----|
|
||||
| `started` | Agent first picks up the job | `"Job a1b2c3d4 started"` | 1 |
|
||||
| `permission_required` | Agent needs a tool/permission grant | `"needs to write sort_problems.md"` | as it happens |
|
||||
| `progress` | Optional intermediate checkpoint | `"creating problem 5/10"` | as it happens |
|
||||
| `completed` | Successful terminal state | `"saved to sort_problems.md"` | last |
|
||||
| `error` | Failure / exception terminal state | `"internal error, see logs"` | last |
|
||||
|
||||
`started` and `completed`/`error` are mandatory bookends; `permission_required`
|
||||
and `progress` are optional. `detail` must stay on the whitelist of generalised
|
||||
phrasings — never leak secrets through it.
|
||||
|
||||
### Terminal semantics
|
||||
|
||||
- `completed` → subscriber exits 0; `error` → exits 1.
|
||||
- The subscriber runs a **terminal state machine**: it finalises a job on the
|
||||
first `completed`/`error` it sees and ignores any later terminal event for
|
||||
that job (QoS-1 duplicate, or an `error`-after-`completed` reorder). When all
|
||||
watched jobs are finalised it exits.
|
||||
- Wall-clock timeout *or* idle timeout before a terminal event → exit 2.
|
||||
|
||||
---
|
||||
|
||||
## 4. Production hardening (own broker stage)
|
||||
|
||||
The payload shape is unchanged; the transport and trust model tighten. See
|
||||
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md) for the broker side.
|
||||
|
||||
- **Auth / ACL** — username/password + per-topic ACL. `jobs/+/events` publish is
|
||||
granted to the worker credential, subscribe to the Hermes credential.
|
||||
- **`auth_token` (the bonus field)** — each job record carries a per-job
|
||||
`auth_token` (`secrets.token_urlsafe(32)`). The publisher copies it into
|
||||
**`data.auth_token`**; the subscriber compares it against the registry's
|
||||
expected token and **drops mismatches**. This is an integrity check on top of
|
||||
the broker ACL, useful while still on a shared/public broker.
|
||||
|
||||
```json
|
||||
{ "...": "...", "data": { "auth_token": "9f3c…", "build_id": "42" } }
|
||||
```
|
||||
|
||||
- **TLS** — port 8883 + private CA. Toggled with `MQTT_TLS=1` (+ `MQTT_CA_CERTS`);
|
||||
no code change.
|
||||
- **Retained terminal events** — `completed`/`error` publish with `retain=True`
|
||||
so a subscriber that joins late immediately receives the last terminal state
|
||||
instead of a stale view. The reference publisher auto-retains terminal events;
|
||||
`--retained` forces it for any event.
|
||||
- **Dual timeouts** — total wall-clock budget + last-activity idle detection,
|
||||
both measured from receive time.
|
||||
- **Clock trust** — never trust the payload `timestamp` for timeout decisions.
|
||||
|
||||
---
|
||||
|
||||
## 5. Why a public broker is PoC-only
|
||||
|
||||
On `broker.hivemq.com` anyone can publish/subscribe the same topic. Therefore:
|
||||
|
||||
- No secret data in payloads.
|
||||
- `started`/`completed`/`error` are *signals*, never a basis for a security
|
||||
decision.
|
||||
- Non-retained messages are **not queued** for absent subscribers — start the
|
||||
subscriber **before** the agent (ordering dependency), or rely on retained
|
||||
terminal events in production.
|
||||
- Real operational decisions belong to the own-broker stage with auth + ACL.
|
||||
@@ -0,0 +1,176 @@
|
||||
# MQTT Broker Setup — PoC → Production
|
||||
|
||||
The tmux-agent-orchestrate-delegate-job scripts read **all** broker settings from environment
|
||||
variables (or a job record's `broker.*` block) through a single helper,
|
||||
`broker_config_from_env()` in
|
||||
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py). The design goal:
|
||||
**switch from the public PoC broker to your own broker with config only — no
|
||||
code change.**
|
||||
|
||||
| Env var | Meaning | PoC default | Production |
|
||||
|---------|---------|-------------|-----------|
|
||||
| `MQTT_BROKER` | host | `broker.hivemq.com` | internal hostname/IP |
|
||||
| `MQTT_PORT` | port | `1883` | `8883` (TLS) |
|
||||
| `MQTT_TLS` | TLS on/off (`1`/`0`) | `0` | `1` |
|
||||
| `MQTT_USERNAME` / `MQTT_PASSWORD` | auth | (none) | broker-issued |
|
||||
| `MQTT_CA_CERTS` | CA bundle path | (none) | private CA path |
|
||||
| `MQTT_CERTFILE` / `MQTT_KEYFILE` | client cert (optional mTLS) | (none) | per-client |
|
||||
| `MQTT_CLIENT_ID_PREFIX` | client id prefix | `hermes` | per-environment |
|
||||
|
||||
---
|
||||
|
||||
## 1. PoC: public broker (`broker.hivemq.com`)
|
||||
|
||||
**Pros** — zero setup, reachable from anywhere, perfect for wiring up the
|
||||
publish/subscribe loop and the timeout/state-machine logic.
|
||||
|
||||
**Cons / accepted assumptions** — no auth, no integrity, shared with the world:
|
||||
|
||||
- no secrets in payloads;
|
||||
- `started`/`completed`/`error` are advisory signals only;
|
||||
- non-retained messages are **not queued** for absent subscribers, so the
|
||||
subscriber must start before the agent;
|
||||
- a re-subscribing client cannot recover past (non-retained) events.
|
||||
|
||||
Use it only to validate the protocol, never for real decisions.
|
||||
|
||||
---
|
||||
|
||||
## 2. Production: self-hosted Mosquitto (or EMQX)
|
||||
|
||||
Both support MQTT 5 + ACL + TLS. Mosquitto shown below; EMQX is a drop-in for
|
||||
the same env vars.
|
||||
|
||||
### 2.1 Install
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install mosquitto
|
||||
|
||||
# Debian/Ubuntu
|
||||
sudo apt-get update && sudo apt-get install -y mosquitto mosquitto-clients
|
||||
|
||||
# Docker
|
||||
docker run -d --name mosquitto -p 8883:8883 \
|
||||
-v "$PWD/mosquitto.conf:/mosquitto/config/mosquitto.conf" \
|
||||
-v "$PWD/certs:/mosquitto/certs" \
|
||||
-v "$PWD/auth:/mosquitto/auth" \
|
||||
eclipse-mosquitto:2
|
||||
```
|
||||
|
||||
### 2.2 `mosquitto.conf` (key lines)
|
||||
|
||||
```conf
|
||||
persistence true
|
||||
persistence_location /mosquitto/data/
|
||||
|
||||
password_file /mosquitto/auth/passwd
|
||||
acl_file /mosquitto/auth/acl
|
||||
allow_anonymous false
|
||||
|
||||
listener 8883
|
||||
cafile /mosquitto/certs/ca.crt
|
||||
certfile /mosquitto/certs/server.crt
|
||||
keyfile /mosquitto/certs/server.key
|
||||
```
|
||||
|
||||
`persistence true` + QoS 1 + retained terminal events means a subscriber that
|
||||
joins after a job finished still sees the final `completed`/`error`.
|
||||
|
||||
### 2.3 Users (username/password)
|
||||
|
||||
```bash
|
||||
# create the file with the first user, then add more with -b
|
||||
mosquitto_passwd -c /mosquitto/auth/passwd hermes # subscriber/delegator
|
||||
mosquitto_passwd /mosquitto/auth/passwd claude-worker # publisher/agent
|
||||
# (omit -c after the first; -c truncates the file)
|
||||
```
|
||||
|
||||
### 2.4 ACL — least privilege
|
||||
|
||||
The worker only **publishes** events; Hermes only **subscribes**:
|
||||
|
||||
```conf
|
||||
# /mosquitto/auth/acl
|
||||
|
||||
# claude-worker: may publish job events, may not read others' streams
|
||||
user claude-worker
|
||||
topic write python/mqtt/jobs/+/events
|
||||
|
||||
# hermes: observes every job's events
|
||||
user hermes
|
||||
topic read python/mqtt/jobs/+/events
|
||||
|
||||
# keep the legacy demo topic usable for both, if desired
|
||||
pattern readwrite python/mqtt/sample
|
||||
```
|
||||
|
||||
### 2.5 TLS certificates
|
||||
|
||||
**Quick self-signed (single host, internal only):**
|
||||
|
||||
```bash
|
||||
mkdir -p certs && cd certs
|
||||
openssl req -x509 -newkey rsa:2048 -nodes -days 825 \
|
||||
-keyout server.key -out server.crt \
|
||||
-subj "/CN=mqtt.internal"
|
||||
cp server.crt ca.crt # clients trust this as the CA bundle
|
||||
```
|
||||
|
||||
**Private CA (recommended — separate CA from server cert):**
|
||||
|
||||
```bash
|
||||
# 1) CA
|
||||
openssl genrsa -out ca.key 4096
|
||||
openssl req -x509 -new -nodes -key ca.key -days 3650 -out ca.crt -subj "/CN=Hermes-CA"
|
||||
# 2) server cert signed by the CA
|
||||
openssl genrsa -out server.key 2048
|
||||
openssl req -new -key server.key -out server.csr -subj "/CN=mqtt.internal"
|
||||
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial \
|
||||
-out server.crt -days 825
|
||||
```
|
||||
|
||||
Clients trust `ca.crt` via `MQTT_CA_CERTS=/path/to/ca.crt`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Cut-over verification (config-only, no code change)
|
||||
|
||||
Goal: prove the **same scripts** talk to your broker by changing only env/registry.
|
||||
|
||||
```bash
|
||||
# 1) point the env at the new broker
|
||||
export MQTT_BROKER=mqtt.internal
|
||||
export MQTT_PORT=8883
|
||||
export MQTT_TLS=1
|
||||
export MQTT_CA_CERTS=$PWD/certs/ca.crt
|
||||
export MQTT_USERNAME=hermes
|
||||
export MQTT_PASSWORD=… # subscriber side
|
||||
# (publisher side uses claude-worker creds via the job record's broker block)
|
||||
|
||||
# 2) sanity-check with the mosquitto CLI first
|
||||
mosquitto_sub -h "$MQTT_BROKER" -p 8883 --cafile "$MQTT_CA_CERTS" \
|
||||
-u hermes -P "$MQTT_PASSWORD" -t 'python/mqtt/jobs/+/events' -v &
|
||||
|
||||
# 3) run the unchanged tmux-agent-orchestrate-delegate-job loop
|
||||
PY=.venv/bin/python
|
||||
JID=$($PY scripts/registry.py register --prompt "broker cutover smoke")
|
||||
$PY scripts/job_subscriber.py --job "$JID" --timeout 30 &
|
||||
sleep 3
|
||||
$PY scripts/publish_event.py --job "$JID" --event started
|
||||
$PY scripts/publish_event.py --job "$JID" --event completed # auto-retained
|
||||
```
|
||||
|
||||
Expected:
|
||||
- subscriber prints the `started` and `completed` lines and exits 0;
|
||||
- `mosquitto_sub` shows the same events (ACL allows `hermes` to read);
|
||||
- publishing as a credential **without** write ACL is rejected by the broker;
|
||||
- a subscriber started *after* `completed` still receives it (retained).
|
||||
|
||||
If all four hold, the migration is config-only. Persist the broker block into
|
||||
each job record so `publish_event.py` connects from the registry alone:
|
||||
|
||||
```json
|
||||
"broker": { "host": "mqtt.internal", "port": 8883, "tls": true,
|
||||
"username": "claude-worker", "password": "…" }
|
||||
```
|
||||
@@ -0,0 +1,183 @@
|
||||
# Job Registry
|
||||
|
||||
The registry is the **single source of truth** for delegated work. Job metadata
|
||||
(id, prompt, broker, status, timeouts) lives in files, **not** environment
|
||||
variables — so one tmux session can handle many jobs sequentially or in
|
||||
parallel without collisions, and `publish_event.py` / `job_subscriber.py` can
|
||||
reconstruct everything they need from the registry alone.
|
||||
|
||||
Reference implementation: [`./scripts/registry.py`](./scripts/registry.py)
|
||||
(library + CLI) over the primitives in
|
||||
[`./scripts/mqtt_common.py`](./scripts/mqtt_common.py).
|
||||
|
||||
---
|
||||
|
||||
## 1. Directory layout
|
||||
|
||||
```
|
||||
.hermes/jobs/
|
||||
<job_id>.json # job metadata record (schema below)
|
||||
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
|
||||
.lock # shared advisory lock (fcntl) for the whole registry
|
||||
```
|
||||
|
||||
`registry_dir` defaults to `.hermes/jobs` and is overridable everywhere via
|
||||
`--registry-dir`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Job record schema
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"job_id": "abc12345",
|
||||
"status": "pending | running | completed | error | cancelled",
|
||||
"created_at": "2026-06-19T09:30:00Z",
|
||||
"updated_at": "2026-06-19T09:32:00Z",
|
||||
"prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
|
||||
"agent": "claude-code",
|
||||
"agent_session": "tmux:claude",
|
||||
"broker": {
|
||||
"host": "broker.hivemq.com",
|
||||
"port": 1883,
|
||||
"tls": false,
|
||||
"username": null,
|
||||
"password": null
|
||||
},
|
||||
"topic_prefix": "python/mqtt/jobs/abc12345",
|
||||
"timeout_sec": 600,
|
||||
"idle_timeout_sec": 120,
|
||||
"expected_artifacts": ["sort_problems.md"],
|
||||
"last_seq": 0,
|
||||
"auth_token": null
|
||||
}
|
||||
```
|
||||
|
||||
- `broker` lets `publish_event.py` connect from the record alone (env still
|
||||
overrides toggles like `MQTT_TLS`).
|
||||
- `topic_prefix` → the events topic is `<topic_prefix>/events`.
|
||||
- `last_seq` backs the monotonic `seq` counter so it survives process restarts.
|
||||
- `expected_artifacts` is the hook a user `validate.sh` checks (existence/content).
|
||||
- `auth_token` is `null` in PoC; production sets `secrets.token_urlsafe(32)`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Concurrency rules
|
||||
|
||||
### PoC — fcntl advisory lock
|
||||
|
||||
Every read-modify-write (`register_job`, `pick_pending`, `update_status`,
|
||||
`next_seq`) runs inside `registry_lock(registry_dir)`, an exclusive
|
||||
`fcntl.flock` over `.lock`. Single-host, good enough for many tmux sessions on
|
||||
one machine.
|
||||
|
||||
### Production — SQLite WAL
|
||||
|
||||
When delegation spans **multiple hosts**, the file lock no longer serialises
|
||||
across machines. Migrate the same operations to a SQLite database in WAL mode
|
||||
(`PRAGMA journal_mode=WAL`) with a transaction per claim. The function
|
||||
signatures stay identical; only the storage backend changes.
|
||||
|
||||
---
|
||||
|
||||
## 4. How multiple sessions take only their own work
|
||||
|
||||
Each tmux session carries an `agent_session` label (`tmux:claude`,
|
||||
`tmux:claude-a`, `tmux:claude-b`, …). `pick_pending(agent_session)`:
|
||||
|
||||
1. acquires the registry lock,
|
||||
2. scans for the **oldest** record with `status == "pending"` **and**
|
||||
matching `agent_session`,
|
||||
3. flips it to `running` and writes it back **atomically**,
|
||||
4. releases the lock and returns the `job_id` (or `None`).
|
||||
|
||||
Because the scan + flip happen under one lock, two sessions can never claim the
|
||||
same job. Sessions with distinct labels naturally partition the work; sessions
|
||||
sharing a label compete safely — first to acquire the lock wins, the other sees
|
||||
the job already `running` and moves on.
|
||||
|
||||
```bash
|
||||
# session A only ever runs its own pending jobs
|
||||
PY scripts/registry.py pick --agent-session tmux:claude-a # prints id or exits 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Atomic status updates
|
||||
|
||||
All writes use a temp-file + `os.replace` rename, which is atomic on POSIX:
|
||||
|
||||
1. take the registry lock,
|
||||
2. load the current record,
|
||||
3. mutate fields + refresh `updated_at` (and `last_seq` for `next_seq`),
|
||||
4. write to `.<job_id>.<rand>.tmp` in the **same directory**, `fsync`,
|
||||
5. `os.replace(tmp, <job_id>.json)`,
|
||||
6. release the lock.
|
||||
|
||||
A reader therefore always sees either the old or the new complete record, never
|
||||
a half-written file. This is the file-based equivalent of the rename trick
|
||||
(`pending.<session>` → `running.<session>`) and maps cleanly onto a single
|
||||
SQLite transaction when you migrate.
|
||||
|
||||
---
|
||||
|
||||
## 6. CLI quick reference
|
||||
|
||||
```bash
|
||||
PY=.venv/bin/python
|
||||
$PY scripts/registry.py register --prompt "…" --agent claude-code \
|
||||
--agent-session tmux:claude --timeout 600 --idle-timeout 120 # → prints job_id
|
||||
$PY scripts/registry.py list # human table
|
||||
$PY scripts/registry.py list --json # full records
|
||||
$PY scripts/registry.py get --job <id> # one record
|
||||
$PY scripts/registry.py status --job <id> --set completed # set status
|
||||
$PY scripts/registry.py pick --agent-session tmux:claude # claim → running
|
||||
```
|
||||
|
||||
Exit codes: `0` ok, `1` not found / bad status, `3` (`pick`) no pending job for
|
||||
that session.
|
||||
|
||||
---
|
||||
|
||||
## 7. Persistent audit log
|
||||
|
||||
Separate from the registry, every job is also mirrored to a durable append-only
|
||||
audit log at `.hermes/delegate_job_logs/<job_id>/` (override with
|
||||
`DELEGATE_JOB_LOGS_DIR`, default `<cwd>/.hermes/delegate_job_logs`). The registry
|
||||
is **live state** mutated in place; the audit log is **history** that survives
|
||||
even after the registry dir is cleaned up. It is git-ignored.
|
||||
|
||||
```
|
||||
.hermes/delegate_job_logs/<job_id>/
|
||||
meta.json # registration snapshot (the full job record at register time)
|
||||
events.ndjson # append-only, one JSON event per line, time-ordered
|
||||
status.json # current status only (fast point-query)
|
||||
```
|
||||
|
||||
`events.ndjson` lines are written automatically at four points:
|
||||
|
||||
| Trigger | line `event` | Source |
|
||||
|---------|-------------|--------|
|
||||
| `register_job` | `registered` | `registry.register_job` → `mqtt_common.init_job_log` |
|
||||
| status change (`update_status`, `pick`, publish status sync) | `status_changed` (`from`/`to`) | `mqtt_common.update_job_status` / `pick_pending` |
|
||||
| event published | `published` (embeds the exact payload) | `publish_event.py` |
|
||||
| event received | `received` | `job_subscriber.py` |
|
||||
|
||||
Helpers live in [`./scripts/mqtt_common.py`](./scripts/mqtt_common.py):
|
||||
`LOGS_DIR`, `job_log_path`, `init_job_log`, `append_event` (fcntl-locked,
|
||||
concurrent-append safe), `update_logged_status`, and the readers
|
||||
`read_logged_meta` / `read_logged_status` / `iter_logged_events` /
|
||||
`list_logged_jobs`. Every writer is **best-effort and isolated** — wrapped in
|
||||
`try/except` with a `logger.warning`, so an audit-log failure never breaks the
|
||||
registry write, the publish, or the subscribe it shadows.
|
||||
|
||||
Read them via the CLI:
|
||||
|
||||
```bash
|
||||
PY=.venv/bin/python
|
||||
$PY scripts/registry.py logs <job_id> # pretty timeline
|
||||
$PY scripts/registry.py logs <job_id> --tail 20 # last 20 events
|
||||
$PY scripts/registry.py logs <job_id> --json # raw JSON lines
|
||||
$PY scripts/registry.py logs --list # every job, live status
|
||||
```
|
||||
@@ -0,0 +1 @@
|
||||
paho-mqtt>=2.0.0
|
||||
@@ -0,0 +1,233 @@
|
||||
#!/usr/bin/env python3
|
||||
"""job_subscriber.py — the single entry point for observing Job events.
|
||||
|
||||
Subscribes to one job's ``<topic_prefix>/events`` (or, with ``--wait-any``, the
|
||||
events of every running/pending job in the registry), prints one line to stdout
|
||||
per accepted event, and exits on a terminal event or a timeout.
|
||||
|
||||
Design points (all flagged in the PLAN review):
|
||||
- terminal state machine: ``completed``/``error`` is acted on exactly once per
|
||||
job, so QoS-1 duplicates or an ``error``-after-``completed`` reorder are safe.
|
||||
- dual timeouts: a wall-clock ``--timeout`` (total budget, started at
|
||||
subscribe time so a cold start can't hang forever) AND an idle
|
||||
``--idle-timeout`` (no new event for N seconds).
|
||||
- defensive parsing: undecodable payloads, ``schema_version`` mismatches, and
|
||||
``job_id`` values we did not subscribe for are logged and dropped.
|
||||
|
||||
stdout = event lines only. Diagnostics go to stderr via logging.
|
||||
|
||||
Exit codes:
|
||||
0 all watched jobs reached ``completed``
|
||||
1 any watched job reached ``error``
|
||||
2 timed out (wall-clock or idle) before all jobs finished
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import queue
|
||||
import sys
|
||||
import time
|
||||
from typing import Any, Dict, List, Optional, Set, Tuple
|
||||
|
||||
import mqtt_common
|
||||
import registry
|
||||
from mqtt_common import (
|
||||
DEFAULT_REGISTRY_DIR,
|
||||
SCHEMA_VERSION,
|
||||
broker_config_from_job,
|
||||
load_job,
|
||||
make_client,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("delegate_job.job_subscriber")
|
||||
|
||||
TERMINAL_EVENTS = ("completed", "error")
|
||||
|
||||
|
||||
def _format_line(topic: str, payload: Dict[str, Any]) -> str:
|
||||
return (
|
||||
f"{payload.get('timestamp','-')} "
|
||||
f"job={payload.get('job_id','?')} "
|
||||
f"seq={payload.get('seq','?')} "
|
||||
f"{payload.get('event','?'):<20} "
|
||||
f"{payload.get('detail','')}"
|
||||
)
|
||||
|
||||
|
||||
class _Watcher:
|
||||
"""Holds the shared queue + the set of job_ids we accept events for."""
|
||||
|
||||
def __init__(self, expected_job_ids: Set[str], expected_tokens: Dict[str, Optional[str]]):
|
||||
self.events: "queue.Queue[Tuple[str, Dict[str, Any]]]" = queue.Queue()
|
||||
self.expected = set(expected_job_ids)
|
||||
self.tokens = expected_tokens # job_id -> expected auth_token (or None)
|
||||
|
||||
def on_message(self, _client, _userdata, msg) -> None:
|
||||
# --- defensive parsing -------------------------------------------
|
||||
try:
|
||||
payload = json.loads(msg.payload.decode("utf-8"))
|
||||
except (UnicodeDecodeError, json.JSONDecodeError) as exc:
|
||||
logger.warning("drop unparseable payload on %s: %s", msg.topic, exc)
|
||||
return
|
||||
if not isinstance(payload, dict):
|
||||
logger.warning("drop non-object payload on %s", msg.topic)
|
||||
return
|
||||
if payload.get("schema_version") != SCHEMA_VERSION:
|
||||
logger.warning("drop event with schema_version=%r (expected %d)",
|
||||
payload.get("schema_version"), SCHEMA_VERSION)
|
||||
return
|
||||
jid = payload.get("job_id")
|
||||
if jid not in self.expected:
|
||||
logger.warning("drop event for unexpected job_id=%r on %s", jid, msg.topic)
|
||||
return
|
||||
# --- production auth check: data.auth_token must match if expected ---
|
||||
expected_token = self.tokens.get(jid)
|
||||
if expected_token is not None:
|
||||
got = (payload.get("data") or {}).get("auth_token")
|
||||
if got != expected_token:
|
||||
logger.warning("drop event for job %s: auth_token mismatch", jid)
|
||||
return
|
||||
# Persistent audit log from the *subscriber's* vantage point: every event
|
||||
# that survives defensive parsing is recorded here, including ones a
|
||||
# different host published. This is the external-observer record that
|
||||
# backstops the publisher's own "published" line if it never wrote one.
|
||||
mqtt_common.append_event(jid, {
|
||||
"event": "received",
|
||||
"source_event": payload.get("event"),
|
||||
"seq": payload.get("seq"),
|
||||
"topic": msg.topic,
|
||||
"timestamp": payload.get("timestamp"),
|
||||
"detail": payload.get("detail", ""),
|
||||
})
|
||||
self.events.put((msg.topic, payload))
|
||||
|
||||
|
||||
def _collect_jobs(args) -> List[Dict[str, Any]]:
|
||||
"""Resolve the list of job records this invocation should watch."""
|
||||
if args.wait_any:
|
||||
jobs = [r for r in registry.list_jobs(args.registry_dir)
|
||||
if r.get("status") in ("pending", "running")]
|
||||
if not jobs:
|
||||
logger.error("no pending/running jobs to wait for")
|
||||
return jobs
|
||||
job = load_job(args.job, args.registry_dir) # raises FileNotFoundError
|
||||
return [job]
|
||||
|
||||
|
||||
def main(argv=None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Subscribe to Job events on MQTT")
|
||||
target = parser.add_mutually_exclusive_group(required=True)
|
||||
target.add_argument("--job", help="job id to watch")
|
||||
target.add_argument("--wait-any", action="store_true",
|
||||
help="watch every pending/running job in the registry")
|
||||
parser.add_argument("--timeout", type=float, default=None,
|
||||
help="wall-clock budget in seconds (default: job.timeout_sec or 600)")
|
||||
parser.add_argument("--idle-timeout", type=float, default=None,
|
||||
help="max seconds with no new event (default: job.idle_timeout_sec or 120)")
|
||||
parser.add_argument("--expect-retention", action="store_true",
|
||||
help="warn if no retained terminal event arrives promptly")
|
||||
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||
parser.add_argument("-v", "--verbose", action="store_true")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
|
||||
|
||||
try:
|
||||
jobs = _collect_jobs(args)
|
||||
except FileNotFoundError as exc:
|
||||
logger.error("%s", exc)
|
||||
return 2
|
||||
if not jobs:
|
||||
return 2
|
||||
|
||||
expected_ids: Set[str] = {j["job_id"] for j in jobs}
|
||||
tokens = {j["job_id"]: j.get("auth_token") for j in jobs}
|
||||
watcher = _Watcher(expected_ids, tokens)
|
||||
|
||||
# Resolve timeouts from CLI, falling back to the (first) job's settings.
|
||||
base_job = jobs[0]
|
||||
wall_timeout = args.timeout if args.timeout is not None else float(base_job.get("timeout_sec", 600))
|
||||
idle_timeout = args.idle_timeout if args.idle_timeout is not None else float(base_job.get("idle_timeout_sec", 120))
|
||||
|
||||
# All watched jobs share a broker in practice; connect using the first
|
||||
# job's broker and subscribe to each job's events topic.
|
||||
config = broker_config_from_job(base_job)
|
||||
client = make_client("subscriber", config)
|
||||
client.on_message = watcher.on_message
|
||||
|
||||
subscribed_topics = []
|
||||
for job in jobs:
|
||||
prefix = job.get("topic_prefix") or mqtt_common.topic_prefix_for(job["job_id"])
|
||||
subscribed_topics.append(f"{prefix}/events")
|
||||
|
||||
def on_connect(_c, _u, _flags, reason_code, _props):
|
||||
if mqtt_common.reason_code_value(reason_code) != 0:
|
||||
logger.error("broker connection failed: rc=%s", reason_code)
|
||||
return
|
||||
for topic in subscribed_topics:
|
||||
_c.subscribe(topic, qos=1)
|
||||
logger.info("subscribed to %s", topic)
|
||||
|
||||
client.on_connect = on_connect
|
||||
client.connect(config.host, config.port, config.keepalive)
|
||||
client.loop_start()
|
||||
|
||||
terminal: Dict[str, str] = {} # job_id -> "completed"/"error"
|
||||
pending: Set[str] = set(expected_ids)
|
||||
start = time.monotonic()
|
||||
wall_deadline = start + wall_timeout
|
||||
last_event = start
|
||||
retention_checked = not args.expect_retention
|
||||
|
||||
try:
|
||||
while pending:
|
||||
now = time.monotonic()
|
||||
if now >= wall_deadline:
|
||||
logger.error("wall-clock timeout (%.0fs); still pending: %s",
|
||||
wall_timeout, ", ".join(sorted(pending)))
|
||||
return 2
|
||||
idle_left = idle_timeout - (now - last_event)
|
||||
if idle_left <= 0:
|
||||
logger.error("idle timeout (%.0fs, no events); still pending: %s",
|
||||
idle_timeout, ", ".join(sorted(pending)))
|
||||
return 2
|
||||
wait = min(wall_deadline - now, idle_left, 1.0)
|
||||
try:
|
||||
topic, payload = watcher.events.get(timeout=wait)
|
||||
except queue.Empty:
|
||||
if not retention_checked and (now - start) > 3.0:
|
||||
logger.warning("--expect-retention set but no retained "
|
||||
"terminal event observed yet")
|
||||
retention_checked = True
|
||||
continue
|
||||
|
||||
last_event = time.monotonic()
|
||||
retention_checked = True
|
||||
print(_format_line(topic, payload), flush=True)
|
||||
|
||||
jid = payload["job_id"]
|
||||
event = payload.get("event")
|
||||
if event in TERMINAL_EVENTS:
|
||||
if jid in terminal:
|
||||
# Already finalised: ignore duplicates / late reorders.
|
||||
logger.info("ignoring duplicate terminal %s for %s", event, jid)
|
||||
continue
|
||||
terminal[jid] = event
|
||||
pending.discard(jid)
|
||||
finally:
|
||||
client.loop_stop()
|
||||
try:
|
||||
client.disconnect()
|
||||
except Exception: # pragma: no cover
|
||||
pass
|
||||
|
||||
# All jobs reached a terminal state. error wins over completed.
|
||||
if any(state == "error" for state in terminal.values()):
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,550 @@
|
||||
"""Shared MQTT + registry helpers for the tmux-agent-orchestrate-delegate-job skill.
|
||||
|
||||
Single entry point for:
|
||||
- broker configuration (env -> dataclass),
|
||||
- paho client construction (auth + TLS + unique client id),
|
||||
- monotonic per-job sequence counters,
|
||||
- retry-with-exponential-backoff,
|
||||
- atomic registry record load/update under an fcntl lock.
|
||||
|
||||
Requires paho-mqtt >= 2.0 (uses CallbackAPIVersion.VERSION2).
|
||||
|
||||
This module is the *only* place that talks to the broker config and to the
|
||||
raw job record file, so PoC -> production migration touches just env/registry
|
||||
values, never code (see references/mqtt-broker-setup.md).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import functools
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
import time
|
||||
import uuid
|
||||
from contextlib import contextmanager
|
||||
from dataclasses import asdict, dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any, Callable, Dict, Iterable, List, Optional
|
||||
|
||||
import paho.mqtt.client as mqtt
|
||||
|
||||
logger = logging.getLogger("delegate_job.mqtt_common")
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Constants
|
||||
# --------------------------------------------------------------------------
|
||||
SCHEMA_VERSION = 1
|
||||
DEFAULT_REGISTRY_DIR = ".hermes/jobs"
|
||||
DEFAULT_TOPIC_ROOT = "python/mqtt/jobs"
|
||||
LOCK_FILENAME = ".lock"
|
||||
|
||||
# Persistent audit-log layout: .hermes/delegate_job_logs/<job_id>/{meta,events,status}.
|
||||
# This is a *separate* artifact from the registry: the registry is the live job
|
||||
# record (mutated in place), the audit log is an append-only history that
|
||||
# survives even if the registry dir is cleaned up.
|
||||
META_FILENAME = "meta.json"
|
||||
EVENTS_FILENAME = "events.ndjson"
|
||||
STATUS_FILENAME = "status.json"
|
||||
|
||||
|
||||
def _default_logs_dir() -> str:
|
||||
"""Audit-log root. Overridable with ``DELEGATE_JOB_LOGS_DIR``; otherwise
|
||||
``<cwd>/.hermes/delegate_job_logs`` — we keep audit logs next to the
|
||||
live registry (``.hermes/jobs/``) so the two runtime artifacts sit
|
||||
under the same parent dir and follow the same ``.gitignore`` rule.
|
||||
The cwd of whichever process emits events (the bash wrapper and
|
||||
scripts) is used as the anchor."""
|
||||
env = os.environ.get("DELEGATE_JOB_LOGS_DIR")
|
||||
if env and env.strip():
|
||||
return env
|
||||
return os.path.join(os.getcwd(), ".hermes", "delegate_job_logs")
|
||||
|
||||
|
||||
LOGS_DIR = _default_logs_dir()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Broker configuration
|
||||
# --------------------------------------------------------------------------
|
||||
@dataclass
|
||||
class BrokerConfig:
|
||||
"""Resolved broker connection settings.
|
||||
|
||||
PoC defaults target the public HiveMQ broker. Production overrides arrive
|
||||
either from environment variables or from a job record's ``broker.*`` block
|
||||
(see ``broker_config_from_job``).
|
||||
"""
|
||||
|
||||
host: str = "broker.hivemq.com"
|
||||
port: int = 1883
|
||||
tls: bool = False
|
||||
username: Optional[str] = None
|
||||
password: Optional[str] = None
|
||||
client_id_prefix: str = "hermes"
|
||||
# TLS material (only consulted when tls is True).
|
||||
ca_certs: Optional[str] = None
|
||||
certfile: Optional[str] = None
|
||||
keyfile: Optional[str] = None
|
||||
keepalive: int = 60
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
return asdict(self)
|
||||
|
||||
def to_registry_block(self) -> Dict[str, Any]:
|
||||
"""The subset that gets persisted into a job record's broker block."""
|
||||
return {
|
||||
"host": self.host,
|
||||
"port": self.port,
|
||||
"tls": self.tls,
|
||||
"username": self.username,
|
||||
"password": self.password,
|
||||
}
|
||||
|
||||
|
||||
def _env_bool(name: str, default: bool = False) -> bool:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None:
|
||||
return default
|
||||
return raw.strip().lower() in ("1", "true", "yes", "on")
|
||||
|
||||
|
||||
def _env_int(name: str, default: int) -> int:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None or raw.strip() == "":
|
||||
return default
|
||||
try:
|
||||
return int(raw)
|
||||
except ValueError:
|
||||
logger.warning("invalid int for %s=%r; using default %d", name, raw, default)
|
||||
return default
|
||||
|
||||
|
||||
def broker_config_from_env(overrides: Optional[Dict[str, Any]] = None) -> BrokerConfig:
|
||||
"""Build a :class:`BrokerConfig` from environment variables.
|
||||
|
||||
Recognised vars (all optional, PoC defaults shown):
|
||||
MQTT_BROKER (broker.hivemq.com), MQTT_PORT (1883), MQTT_TLS (0),
|
||||
MQTT_USERNAME, MQTT_PASSWORD, MQTT_CLIENT_ID_PREFIX (hermes),
|
||||
MQTT_CA_CERTS, MQTT_CERTFILE, MQTT_KEYFILE, MQTT_KEEPALIVE (60).
|
||||
|
||||
``overrides`` (e.g. a job record's broker block) wins over the env values
|
||||
for any key it specifies with a non-None value.
|
||||
"""
|
||||
cfg = BrokerConfig(
|
||||
host=os.environ.get("MQTT_BROKER", "broker.hivemq.com"),
|
||||
port=_env_int("MQTT_PORT", 1883),
|
||||
tls=_env_bool("MQTT_TLS", False),
|
||||
username=os.environ.get("MQTT_USERNAME") or None,
|
||||
password=os.environ.get("MQTT_PASSWORD") or None,
|
||||
client_id_prefix=os.environ.get("MQTT_CLIENT_ID_PREFIX", "hermes"),
|
||||
ca_certs=os.environ.get("MQTT_CA_CERTS") or None,
|
||||
certfile=os.environ.get("MQTT_CERTFILE") or None,
|
||||
keyfile=os.environ.get("MQTT_KEYFILE") or None,
|
||||
keepalive=_env_int("MQTT_KEEPALIVE", 60),
|
||||
)
|
||||
if overrides:
|
||||
for key, value in overrides.items():
|
||||
if value is not None and hasattr(cfg, key):
|
||||
setattr(cfg, key, value)
|
||||
return cfg
|
||||
|
||||
|
||||
def broker_config_from_job(job: Dict[str, Any]) -> BrokerConfig:
|
||||
"""Resolve broker config for a job: env defaults, then the job's broker.*
|
||||
block overrides. This lets ``publish_event.py`` connect from the registry
|
||||
alone, while still honouring environment toggles (e.g. MQTT_TLS=1)."""
|
||||
return broker_config_from_env(overrides=job.get("broker") or {})
|
||||
|
||||
|
||||
def make_client(role: str, config: Optional[BrokerConfig] = None) -> mqtt.Client:
|
||||
"""Return a configured paho ``Client`` (not yet connected).
|
||||
|
||||
The client id is ``f"{prefix}-{role}-{uuid8}"`` so concurrent publishers /
|
||||
subscribers never collide on the broker. Auth and TLS are applied when the
|
||||
config supplies them.
|
||||
"""
|
||||
config = config or broker_config_from_env()
|
||||
client_id = f"{config.client_id_prefix}-{role}-{uuid.uuid4().hex[:8]}"
|
||||
client = mqtt.Client(
|
||||
callback_api_version=mqtt.CallbackAPIVersion.VERSION2,
|
||||
client_id=client_id,
|
||||
)
|
||||
if config.username:
|
||||
client.username_pw_set(config.username, config.password)
|
||||
if config.tls:
|
||||
# If ca_certs is None paho uses the system trust store (good enough for
|
||||
# public CAs); a private CA bundle path is passed through unchanged.
|
||||
client.tls_set(
|
||||
ca_certs=config.ca_certs,
|
||||
certfile=config.certfile,
|
||||
keyfile=config.keyfile,
|
||||
)
|
||||
logger.debug("built client id=%s tls=%s host=%s", client_id, config.tls, config.host)
|
||||
return client
|
||||
|
||||
|
||||
def reason_code_value(rc: Any) -> int:
|
||||
"""Normalise a paho v2 connect reason code to an int.
|
||||
|
||||
paho-mqtt 2.x hands callbacks a ``ReasonCode`` object (not an int); older
|
||||
paths may pass a plain int. ``ReasonCode`` exposes ``.value``; 0 == success.
|
||||
"""
|
||||
return int(getattr(rc, "value", rc))
|
||||
|
||||
|
||||
def topic_prefix_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
|
||||
return f"{root}/{job_id}"
|
||||
|
||||
|
||||
def events_topic_for(job_id: str, root: str = DEFAULT_TOPIC_ROOT) -> str:
|
||||
return f"{topic_prefix_for(job_id, root)}/events"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Registry primitives (single source of truth for raw record I/O)
|
||||
# --------------------------------------------------------------------------
|
||||
def _job_path(job_id: str, registry_dir: str) -> Path:
|
||||
return Path(registry_dir) / f"{job_id}.json"
|
||||
|
||||
|
||||
def _lock_path(registry_dir: str) -> Path:
|
||||
return Path(registry_dir) / LOCK_FILENAME
|
||||
|
||||
|
||||
@contextmanager
|
||||
def registry_lock(registry_dir: str):
|
||||
"""Advisory exclusive lock over the whole registry dir via fcntl.
|
||||
|
||||
PoC-grade single-host concurrency control. Multiple tmux sessions / scripts
|
||||
serialise their read-modify-write of job records through this lock so two
|
||||
sessions never claim the same pending job. For multi-host delegation move
|
||||
to SQLite WAL (see references/registry.md)."""
|
||||
import fcntl # POSIX only; imported lazily so import works on Windows.
|
||||
|
||||
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||
lock_file = _lock_path(registry_dir)
|
||||
fh = open(lock_file, "a+")
|
||||
try:
|
||||
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
|
||||
yield
|
||||
finally:
|
||||
try:
|
||||
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
|
||||
finally:
|
||||
fh.close()
|
||||
|
||||
|
||||
def load_job(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Dict[str, Any]:
|
||||
"""Load and parse a job record. Raises FileNotFoundError if absent."""
|
||||
path = _job_path(job_id, registry_dir)
|
||||
if not path.exists():
|
||||
raise FileNotFoundError(f"job record not found: {path}")
|
||||
with open(path, "r", encoding="utf-8") as fh:
|
||||
return json.load(fh)
|
||||
|
||||
|
||||
def _atomic_write_record(job_id: str, registry_dir: str, record: Dict[str, Any]) -> None:
|
||||
"""Write a record atomically: temp file in the same dir + os.replace.
|
||||
|
||||
The rename is atomic on POSIX, so readers never observe a half-written
|
||||
file. Callers MUST already hold ``registry_lock`` for read-modify-write
|
||||
correctness."""
|
||||
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||
path = _job_path(job_id, registry_dir)
|
||||
fd, tmp = tempfile.mkstemp(dir=str(path.parent), prefix=f".{job_id}.", suffix=".tmp")
|
||||
try:
|
||||
with os.fdopen(fd, "w", encoding="utf-8") as fh:
|
||||
json.dump(record, fh, ensure_ascii=False, indent=2)
|
||||
fh.write("\n")
|
||||
fh.flush()
|
||||
os.fsync(fh.fileno())
|
||||
os.replace(tmp, path)
|
||||
try:
|
||||
os.chmod(path, 0o600)
|
||||
except Exception:
|
||||
pass
|
||||
except BaseException:
|
||||
if os.path.exists(tmp):
|
||||
os.unlink(tmp)
|
||||
raise
|
||||
|
||||
|
||||
def update_job_status(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR, **fields: Any) -> Dict[str, Any]:
|
||||
"""Atomically merge ``fields`` into a job record under the registry lock.
|
||||
|
||||
Always refreshes ``updated_at``. Returns the new record. Raises
|
||||
FileNotFoundError if the job does not exist.
|
||||
|
||||
This is the single chokepoint for status writes (both ``registry.update_status``
|
||||
and ``publish_event.py``'s status sync route through here), so it also mirrors
|
||||
any ``status`` change into the persistent audit log — best-effort, after the
|
||||
registry lock is released so a slow/failed log write never blocks the record."""
|
||||
with registry_lock(registry_dir):
|
||||
record = load_job(job_id, registry_dir)
|
||||
old_status = record.get("status")
|
||||
record.update(fields)
|
||||
record["updated_at"] = _utcnow()
|
||||
_atomic_write_record(job_id, registry_dir, record)
|
||||
if "status" in fields:
|
||||
new_status = record.get("status")
|
||||
update_logged_status(job_id, new_status, updated_at=record["updated_at"])
|
||||
if old_status != new_status:
|
||||
append_event(job_id, {
|
||||
"event": "status_changed",
|
||||
"from": old_status,
|
||||
"to": new_status,
|
||||
"timestamp": record["updated_at"],
|
||||
})
|
||||
return record
|
||||
|
||||
|
||||
def next_seq(job_id: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> int:
|
||||
"""Return the next monotonic sequence number for a job, persisted in the
|
||||
record's ``last_seq`` field so it stays consistent across process restarts.
|
||||
First call returns 1."""
|
||||
with registry_lock(registry_dir):
|
||||
record = load_job(job_id, registry_dir)
|
||||
seq = int(record.get("last_seq", 0)) + 1
|
||||
record["last_seq"] = seq
|
||||
record["updated_at"] = _utcnow()
|
||||
_atomic_write_record(job_id, registry_dir, record)
|
||||
return seq
|
||||
|
||||
|
||||
def _utcnow() -> str:
|
||||
"""ISO-8601 UTC timestamp with trailing Z (payload `timestamp` field)."""
|
||||
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
|
||||
|
||||
def _utcnow_precise() -> str:
|
||||
"""ISO-8601 UTC timestamp with millisecond resolution. Used for the audit
|
||||
log's ``logged_at`` so events sort cleanly even within the same second."""
|
||||
now = time.time()
|
||||
base = time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime(now))
|
||||
return f"{base}.{int((now % 1) * 1000):03d}Z"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Persistent audit log (.hermes/delegate_job_logs/<job_id>/...)
|
||||
#
|
||||
# Every function here is idempotent, concurrency-safe, and *best-effort*: a
|
||||
# logging failure is swallowed with a logger.warning and never propagated, so it
|
||||
# can never break a publish, a subscribe, or a registry write. stdout is never
|
||||
# touched (it is reserved for data output).
|
||||
# --------------------------------------------------------------------------
|
||||
def job_log_dir(job_id: str, logs_dir: Optional[str] = None) -> Path:
|
||||
return Path(logs_dir or LOGS_DIR) / job_id
|
||||
|
||||
|
||||
def job_log_path(job_id: str, kind: str, logs_dir: Optional[str] = None) -> Path:
|
||||
"""Path to one audit-log file for a job. ``kind`` is a filename, e.g. the
|
||||
module constants META_FILENAME / EVENTS_FILENAME / STATUS_FILENAME."""
|
||||
return job_log_dir(job_id, logs_dir) / kind
|
||||
|
||||
|
||||
@contextmanager
|
||||
def _file_lock(fh):
|
||||
"""Best-effort exclusive lock over a single open file via fcntl, so two
|
||||
processes appending to events.ndjson never interleave a line. A no-op where
|
||||
fcntl is unavailable (Windows); a short append is atomic enough there."""
|
||||
try:
|
||||
import fcntl
|
||||
except ImportError: # pragma: no cover - non-POSIX
|
||||
yield
|
||||
return
|
||||
fcntl.flock(fh.fileno(), fcntl.LOCK_EX)
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
fcntl.flock(fh.fileno(), fcntl.LOCK_UN)
|
||||
|
||||
|
||||
def append_event(job_id: str, event_dict: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
|
||||
"""Append one event as a JSON line to ``<logs>/<job_id>/events.ndjson``.
|
||||
|
||||
Concurrency-safe (fcntl lock over the file) and best-effort. A millisecond
|
||||
``logged_at`` is stamped when the caller did not supply one."""
|
||||
try:
|
||||
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
record = dict(event_dict)
|
||||
record.setdefault("logged_at", _utcnow_precise())
|
||||
line = json.dumps(record, ensure_ascii=False) + "\n"
|
||||
with open(path, "a", encoding="utf-8") as fh:
|
||||
with _file_lock(fh):
|
||||
fh.write(line)
|
||||
fh.flush()
|
||||
except Exception as exc: # pragma: no cover - best effort
|
||||
logger.warning("append_event failed for job %s: %s", job_id, exc)
|
||||
|
||||
|
||||
def update_logged_status(job_id: str, status: str, logs_dir: Optional[str] = None, **extras: Any) -> None:
|
||||
"""Rewrite ``<logs>/<job_id>/status.json`` (current status for fast point
|
||||
queries) atomically. Best-effort; merges any ``extras``."""
|
||||
try:
|
||||
path = job_log_path(job_id, STATUS_FILENAME, logs_dir)
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
record: Dict[str, Any] = {"job_id": job_id, "status": status, "updated_at": _utcnow()}
|
||||
record.update(extras)
|
||||
tmp = path.with_name(path.name + ".tmp")
|
||||
with open(tmp, "w", encoding="utf-8") as fh:
|
||||
json.dump(record, fh, ensure_ascii=False, indent=2)
|
||||
fh.write("\n")
|
||||
os.replace(tmp, path)
|
||||
except Exception as exc: # pragma: no cover - best effort
|
||||
logger.warning("update_logged_status failed for job %s: %s", job_id, exc)
|
||||
|
||||
|
||||
def init_job_log(job_id: str, meta: Dict[str, Any], logs_dir: Optional[str] = None) -> None:
|
||||
"""Seed the per-job audit-log dir: write meta.json, status.json, and a first
|
||||
``registered`` line in events.ndjson. Idempotent (the ``registered`` line is
|
||||
written only when events.ndjson does not yet exist) and best-effort."""
|
||||
try:
|
||||
d = job_log_dir(job_id, logs_dir)
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
with open(d / META_FILENAME, "w", encoding="utf-8") as fh:
|
||||
json.dump(meta, fh, ensure_ascii=False, indent=2)
|
||||
fh.write("\n")
|
||||
status = meta.get("status", "pending")
|
||||
update_logged_status(
|
||||
job_id, status, logs_dir=logs_dir,
|
||||
created_at=meta.get("created_at"), prompt=meta.get("prompt"),
|
||||
)
|
||||
events_path = d / EVENTS_FILENAME
|
||||
first_time = not events_path.exists()
|
||||
events_path.touch(exist_ok=True)
|
||||
if first_time:
|
||||
append_event(job_id, {
|
||||
"event": "registered",
|
||||
"status": status,
|
||||
"agent": meta.get("agent"),
|
||||
"agent_session": meta.get("agent_session"),
|
||||
"topic_prefix": meta.get("topic_prefix"),
|
||||
"timestamp": meta.get("created_at"),
|
||||
}, logs_dir=logs_dir)
|
||||
except Exception as exc: # pragma: no cover - best effort
|
||||
logger.warning("init_job_log failed for job %s: %s", job_id, exc)
|
||||
|
||||
|
||||
def read_logged_meta(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
|
||||
"""Return a job's audit meta.json (registration snapshot), or None."""
|
||||
try:
|
||||
with open(job_log_path(job_id, META_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
|
||||
return json.load(fh)
|
||||
except (OSError, json.JSONDecodeError):
|
||||
return None
|
||||
|
||||
|
||||
def read_logged_status(job_id: str, logs_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
|
||||
"""Return a job's current status.json, or None. This is the fast point-query
|
||||
file (current status only), separate from the registration-time meta.json."""
|
||||
try:
|
||||
with open(job_log_path(job_id, STATUS_FILENAME, logs_dir), "r", encoding="utf-8") as fh:
|
||||
return json.load(fh)
|
||||
except (OSError, json.JSONDecodeError):
|
||||
return None
|
||||
|
||||
|
||||
def iter_logged_events(job_id: str, logs_dir: Optional[str] = None):
|
||||
"""Yield each parsed event from a job's events.ndjson in file (time) order.
|
||||
Malformed lines are skipped with a warning."""
|
||||
path = job_log_path(job_id, EVENTS_FILENAME, logs_dir)
|
||||
if not path.exists():
|
||||
return
|
||||
with open(path, "r", encoding="utf-8") as fh:
|
||||
for line in fh:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
yield json.loads(line)
|
||||
except json.JSONDecodeError:
|
||||
logger.warning("skipping malformed audit line in %s", path)
|
||||
|
||||
|
||||
def list_logged_jobs(logs_dir: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""Return one meta record per job directory under the logs root, oldest
|
||||
first. Falls back to ``{"job_id": <dir>}`` when meta.json is missing."""
|
||||
base = Path(logs_dir or LOGS_DIR)
|
||||
out: List[Dict[str, Any]] = []
|
||||
if not base.exists():
|
||||
return out
|
||||
for d in sorted(base.iterdir()):
|
||||
if not d.is_dir():
|
||||
continue
|
||||
meta = read_logged_meta(d.name, logs_dir) or {"job_id": d.name}
|
||||
# Overlay the live status.json so the summary reflects current state, not
|
||||
# the registration-time snapshot frozen in meta.json.
|
||||
status = read_logged_status(d.name, logs_dir)
|
||||
if status:
|
||||
meta = {**meta,
|
||||
"status": status.get("status", meta.get("status")),
|
||||
"updated_at": status.get("updated_at", meta.get("updated_at"))}
|
||||
out.append(meta)
|
||||
out.sort(key=lambda m: m.get("created_at") or "")
|
||||
return out
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# Retry helper
|
||||
# --------------------------------------------------------------------------
|
||||
def with_retry(
|
||||
fn: Optional[Callable] = None,
|
||||
*,
|
||||
attempts: int = 3,
|
||||
base_delay: float = 0.5,
|
||||
factor: float = 2.0,
|
||||
max_delay: float = 8.0,
|
||||
exceptions: Iterable[type] = (Exception,),
|
||||
) -> Callable:
|
||||
"""Retry ``fn`` with exponential backoff.
|
||||
|
||||
Usable two ways::
|
||||
|
||||
result = with_retry(do_publish, attempts=3)() # wrap-and-call
|
||||
@with_retry(attempts=5, base_delay=1.0) # decorator
|
||||
def do_publish(): ...
|
||||
|
||||
Re-raises the last exception once ``attempts`` is exhausted.
|
||||
"""
|
||||
exc_tuple = tuple(exceptions)
|
||||
|
||||
def decorate(func: Callable) -> Callable:
|
||||
@functools.wraps(func)
|
||||
def wrapper(*args: Any, **kwargs: Any) -> Any:
|
||||
delay = base_delay
|
||||
last_exc: Optional[BaseException] = None
|
||||
for attempt in range(1, attempts + 1):
|
||||
try:
|
||||
return func(*args, **kwargs)
|
||||
except exc_tuple as exc:
|
||||
last_exc = exc
|
||||
if attempt >= attempts:
|
||||
break
|
||||
logger.warning(
|
||||
"attempt %d/%d failed: %s; retrying in %.1fs",
|
||||
attempt, attempts, exc, delay,
|
||||
)
|
||||
time.sleep(delay)
|
||||
delay = min(delay * factor, max_delay)
|
||||
assert last_exc is not None
|
||||
raise last_exc
|
||||
|
||||
return wrapper
|
||||
|
||||
if fn is not None:
|
||||
return decorate(fn)
|
||||
return decorate
|
||||
|
||||
|
||||
def setup_logging(level: int = logging.WARNING) -> None:
|
||||
"""Configure root logging to stderr. stdout is reserved for data output
|
||||
(subscriber event lines, registry ids)."""
|
||||
import sys
|
||||
|
||||
logging.basicConfig(
|
||||
level=level,
|
||||
stream=sys.stderr,
|
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
||||
)
|
||||
@@ -0,0 +1,225 @@
|
||||
#!/usr/bin/env python3
|
||||
"""publish_event.py — the single entry point for emitting a Job event.
|
||||
|
||||
Loads the job record from the registry, resolves its broker, assigns the next
|
||||
monotonic ``seq``, builds the schema-v1 JSON payload, and publishes it to
|
||||
``<topic_prefix>/events`` over QoS 1 with exponential-backoff retry.
|
||||
|
||||
Silent by design: nothing is printed to stdout. Diagnostics go to stderr via
|
||||
logging. Terminal events (``completed``/``error``) publish with retain=True so
|
||||
a late subscriber still observes the final state (production hardening).
|
||||
|
||||
Exit codes:
|
||||
0 published successfully
|
||||
1 parameter / registry error (bad args, unknown job, no pending job)
|
||||
2 publish failed after retries (network / broker / ACK timeout)
|
||||
|
||||
Usage:
|
||||
publish_event.py --job <id> --event started [--detail "..."] [--data '{...}']
|
||||
publish_event.py --pick-pending --agent-session tmux:claude --event completed
|
||||
publish_event.py --job <id> --event completed --retained
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
import mqtt_common
|
||||
import registry
|
||||
from mqtt_common import (
|
||||
DEFAULT_REGISTRY_DIR,
|
||||
SCHEMA_VERSION,
|
||||
broker_config_from_job,
|
||||
events_topic_for,
|
||||
load_job,
|
||||
make_client,
|
||||
next_seq,
|
||||
with_retry,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("delegate_job.publish_event")
|
||||
|
||||
VALID_EVENTS = ("started", "permission_required", "progress", "completed", "error")
|
||||
TERMINAL_EVENTS = ("completed", "error")
|
||||
# event -> registry status to sync as a best-effort side effect
|
||||
EVENT_TO_STATUS = {
|
||||
"started": "running",
|
||||
"completed": "completed",
|
||||
"error": "error",
|
||||
}
|
||||
|
||||
CONNECT_ACK_TIMEOUT = 10 # seconds to wait for CONNACK
|
||||
PUBLISH_ACK_TIMEOUT = 5 # seconds to wait for QoS-1 PUBACK
|
||||
|
||||
|
||||
def build_payload(
|
||||
job_id: str,
|
||||
seq: int,
|
||||
event: str,
|
||||
detail: str,
|
||||
data: Optional[Dict[str, Any]],
|
||||
auth_token: Optional[str],
|
||||
) -> Dict[str, Any]:
|
||||
payload: Dict[str, Any] = {
|
||||
"schema_version": SCHEMA_VERSION,
|
||||
"seq": seq,
|
||||
"job_id": job_id,
|
||||
"event": event,
|
||||
"timestamp": mqtt_common._utcnow(),
|
||||
"detail": detail,
|
||||
"data": dict(data) if data else {},
|
||||
}
|
||||
# Production: carry the per-job auth token so the subscriber can verify the
|
||||
# publisher. The token is compared in plain text (bearer-token style) by the
|
||||
# subscriber — NOT an HMAC. See SKILL.md "Auth token" and PLAN 8.2. The
|
||||
# registry stores the per-job token in `auth_token`; only include it on the
|
||||
# wire when set so the public broker (no auth) doesn't leak anything.
|
||||
if auth_token:
|
||||
payload["data"]["auth_token"] = auth_token
|
||||
return payload
|
||||
|
||||
|
||||
def _publish_once(config, topic: str, body: bytes, retain: bool) -> None:
|
||||
"""Connect, publish one QoS-1 message, wait for the broker ACK, disconnect.
|
||||
|
||||
Raises on any failure so ``with_retry`` can re-run the whole sequence (a
|
||||
fresh connection per attempt is the robust choice for a PoC)."""
|
||||
client = make_client("publisher", config)
|
||||
connected = {"rc": None}
|
||||
|
||||
def on_connect(_c, _u, _flags, reason_code, _props):
|
||||
connected["rc"] = reason_code
|
||||
|
||||
client.on_connect = on_connect
|
||||
client.connect(config.host, config.port, config.keepalive)
|
||||
client.loop_start()
|
||||
try:
|
||||
# Wait for CONNACK so we fail fast on auth/TLS errors.
|
||||
deadline = time.monotonic() + CONNECT_ACK_TIMEOUT
|
||||
while connected["rc"] is None and time.monotonic() < deadline:
|
||||
time.sleep(0.05)
|
||||
if connected["rc"] is None:
|
||||
raise TimeoutError("no CONNACK from broker")
|
||||
if mqtt_common.reason_code_value(connected["rc"]) != 0:
|
||||
raise ConnectionError(f"broker refused connection: rc={connected['rc']}")
|
||||
|
||||
info = client.publish(topic, payload=body, qos=1, retain=retain)
|
||||
info.wait_for_publish(timeout=PUBLISH_ACK_TIMEOUT)
|
||||
if not info.is_published():
|
||||
raise TimeoutError("publish not acknowledged within timeout")
|
||||
finally:
|
||||
client.loop_stop()
|
||||
try:
|
||||
client.disconnect()
|
||||
except Exception: # pragma: no cover - disconnect best effort
|
||||
pass
|
||||
|
||||
|
||||
def _resolve_job_id(args) -> Optional[str]:
|
||||
if args.pick_pending:
|
||||
return registry.pick_pending(args.agent_session, args.registry_dir)
|
||||
return args.job
|
||||
|
||||
|
||||
def main(argv=None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Publish a Job event to MQTT")
|
||||
target = parser.add_mutually_exclusive_group(required=True)
|
||||
target.add_argument("--job", help="job id to publish for")
|
||||
target.add_argument("--pick-pending", action="store_true",
|
||||
help="auto-select a pending job for --agent-session")
|
||||
parser.add_argument("--agent-session", default="tmux:claude",
|
||||
help="session label used with --pick-pending")
|
||||
parser.add_argument("--event", default="progress", choices=VALID_EVENTS)
|
||||
parser.add_argument("--detail", default="")
|
||||
parser.add_argument("--data", default=None, help="optional JSON object string")
|
||||
parser.add_argument("--retained", action="store_true",
|
||||
help="force retain=True (auto for completed/error)")
|
||||
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||
parser.add_argument("--attempts", type=int, default=3)
|
||||
parser.add_argument("-v", "--verbose", action="store_true")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
mqtt_common.setup_logging(logging.DEBUG if args.verbose else logging.WARNING)
|
||||
|
||||
# --- parse optional data JSON (parameter error -> exit 1) ---
|
||||
data: Optional[Dict[str, Any]] = None
|
||||
if args.data:
|
||||
try:
|
||||
data = json.loads(args.data)
|
||||
if not isinstance(data, dict):
|
||||
raise ValueError("--data must be a JSON object")
|
||||
except (ValueError, json.JSONDecodeError) as exc:
|
||||
logger.error("invalid --data: %s", exc)
|
||||
return 1
|
||||
|
||||
job_id = _resolve_job_id(args)
|
||||
if not job_id:
|
||||
logger.error("no job to publish for (unknown --job or no pending job)")
|
||||
return 1
|
||||
|
||||
try:
|
||||
job = load_job(job_id, args.registry_dir)
|
||||
except FileNotFoundError as exc:
|
||||
logger.error("%s", exc)
|
||||
return 1
|
||||
|
||||
config = broker_config_from_job(job)
|
||||
topic = job.get("topic_prefix")
|
||||
topic = f"{topic}/events" if topic else events_topic_for(job_id)
|
||||
seq = next_seq(job_id, args.registry_dir)
|
||||
payload = build_payload(
|
||||
job_id=job_id,
|
||||
seq=seq,
|
||||
event=args.event,
|
||||
detail=args.detail,
|
||||
data=data,
|
||||
auth_token=job.get("auth_token"),
|
||||
)
|
||||
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
|
||||
retain = args.retained or args.event in TERMINAL_EVENTS
|
||||
|
||||
publish = with_retry(
|
||||
_publish_once,
|
||||
attempts=args.attempts,
|
||||
exceptions=(OSError, TimeoutError, ConnectionError, ValueError),
|
||||
)
|
||||
try:
|
||||
publish(config, topic, body, retain)
|
||||
except Exception as exc:
|
||||
logger.error("publish failed after %d attempts: %s", args.attempts, exc)
|
||||
return 2
|
||||
|
||||
# Persistent audit log: record the exact payload we put on the wire so the
|
||||
# publish is reproducible from the log alone. Best-effort (isolated inside
|
||||
# append_event) — never fails the publish.
|
||||
mqtt_common.append_event(job_id, {
|
||||
"event": "published",
|
||||
"source_event": args.event,
|
||||
"seq": seq,
|
||||
"topic": topic,
|
||||
"retain": retain,
|
||||
"timestamp": payload["timestamp"],
|
||||
"detail": args.detail,
|
||||
"payload": payload,
|
||||
})
|
||||
|
||||
# Best-effort side effects: registry status sync + (debug) event log. Never
|
||||
# fail the publish on these.
|
||||
registry.append_event(job_id, args.registry_dir, payload)
|
||||
new_status = EVENT_TO_STATUS.get(args.event)
|
||||
if new_status:
|
||||
try:
|
||||
mqtt_common.update_job_status(job_id, args.registry_dir, status=new_status)
|
||||
except Exception as exc: # pragma: no cover - best effort
|
||||
logger.warning("status sync failed: %s", exc)
|
||||
|
||||
logger.info("published %s seq=%d job=%s retain=%s", args.event, seq, job_id, retain)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,327 @@
|
||||
"""Job registry for the tmux-agent-orchestrate-delegate-job skill.
|
||||
|
||||
A job record is the single source of truth for one delegated unit of work:
|
||||
its id, prompt, owning agent session, broker connection, timeouts, and status.
|
||||
Records live as ``<registry_dir>/<job_id>.json`` with an append-only event log
|
||||
``<registry_dir>/<job_id>.events.log`` and a shared ``<registry_dir>/.lock``.
|
||||
|
||||
Concurrency is handled via the fcntl lock in :mod:`mqtt_common` (PoC). For
|
||||
multi-host delegation, migrate to SQLite WAL — see references/registry.md.
|
||||
|
||||
Importable as a library and runnable as a CLI (``register``/``list``/``get``/
|
||||
``status``/``pick``) so the ``tmux-agent-orchestrate-delegate-job`` bash wrapper can shell out.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import uuid
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
import mqtt_common
|
||||
from mqtt_common import (
|
||||
DEFAULT_REGISTRY_DIR,
|
||||
SCHEMA_VERSION,
|
||||
_atomic_write_record,
|
||||
_utcnow,
|
||||
broker_config_from_env,
|
||||
load_job,
|
||||
registry_lock,
|
||||
topic_prefix_for,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("delegate_job.registry")
|
||||
|
||||
TERMINAL_STATUSES = ("completed", "error", "cancelled")
|
||||
VALID_STATUSES = ("pending", "running", "completed", "error", "cancelled")
|
||||
|
||||
|
||||
def generate_job_id(bits: int = 32) -> str:
|
||||
"""PoC: 32-bit hex (8 chars). Production: 128-bit (full uuid4 hex)."""
|
||||
if bits >= 128:
|
||||
return uuid.uuid4().hex
|
||||
nibbles = max(1, bits // 4)
|
||||
return uuid.uuid4().hex[:nibbles]
|
||||
|
||||
|
||||
def register_job(
|
||||
prompt: str,
|
||||
agent: str = "claude-code",
|
||||
agent_session: str = "tmux:claude",
|
||||
broker: Optional[Dict[str, Any]] = None,
|
||||
timeout_sec: int = 600,
|
||||
idle_timeout_sec: int = 120,
|
||||
registry_dir: str = DEFAULT_REGISTRY_DIR,
|
||||
job_id: Optional[str] = None,
|
||||
expected_artifacts: Optional[List[str]] = None,
|
||||
bits: int = 32,
|
||||
auth_token: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Create a new ``pending`` job record and return its id.
|
||||
|
||||
``broker`` defaults to the current environment's resolved broker block, so
|
||||
the registry alone is enough for ``publish_event.py`` to connect later.
|
||||
"""
|
||||
job_id = job_id or generate_job_id(bits)
|
||||
if broker is None:
|
||||
broker = broker_config_from_env().to_registry_block()
|
||||
now = _utcnow()
|
||||
record: Dict[str, Any] = {
|
||||
"schema_version": SCHEMA_VERSION,
|
||||
"job_id": job_id,
|
||||
"status": "pending",
|
||||
"created_at": now,
|
||||
"updated_at": now,
|
||||
"prompt": prompt,
|
||||
"agent": agent,
|
||||
"agent_session": agent_session,
|
||||
"broker": broker,
|
||||
"topic_prefix": topic_prefix_for(job_id),
|
||||
"timeout_sec": int(timeout_sec),
|
||||
"idle_timeout_sec": int(idle_timeout_sec),
|
||||
"expected_artifacts": expected_artifacts or [],
|
||||
"last_seq": 0,
|
||||
"auth_token": auth_token,
|
||||
}
|
||||
with registry_lock(registry_dir):
|
||||
if mqtt_common._job_path(job_id, registry_dir).exists():
|
||||
raise FileExistsError(f"job already exists: {job_id}")
|
||||
_atomic_write_record(job_id, registry_dir, record)
|
||||
# Seed the persistent audit log (meta.json + status.json + a "registered"
|
||||
# event). Best-effort inside init_job_log — never blocks registration.
|
||||
mqtt_common.init_job_log(job_id, meta=record)
|
||||
logger.info("registered job %s (agent=%s session=%s)", job_id, agent, agent_session)
|
||||
return job_id
|
||||
|
||||
|
||||
def pick_pending(agent_session: str, registry_dir: str = DEFAULT_REGISTRY_DIR) -> Optional[str]:
|
||||
"""Claim the oldest ``pending`` job for ``agent_session``, flipping it to
|
||||
``running`` atomically under the lock. Returns the job id, or None if no
|
||||
pending job matches. This is how each tmux session takes only its own work
|
||||
without two sessions grabbing the same job."""
|
||||
with registry_lock(registry_dir):
|
||||
candidates = []
|
||||
for record in _iter_records(registry_dir):
|
||||
if record.get("status") == "pending" and record.get("agent_session") == agent_session:
|
||||
candidates.append(record)
|
||||
if not candidates:
|
||||
return None
|
||||
candidates.sort(key=lambda r: r.get("created_at", ""))
|
||||
chosen = candidates[0]
|
||||
chosen["status"] = "running"
|
||||
chosen["updated_at"] = _utcnow()
|
||||
_atomic_write_record(chosen["job_id"], registry_dir, chosen)
|
||||
logger.info("session %s picked job %s", agent_session, chosen["job_id"])
|
||||
job_id = chosen["job_id"]
|
||||
updated_at = chosen["updated_at"]
|
||||
# pick_pending writes the record directly (not via update_job_status), so it
|
||||
# mirrors the pending->running transition into the audit log here. Best-effort.
|
||||
mqtt_common.update_logged_status(job_id, "running", updated_at=updated_at)
|
||||
mqtt_common.append_event(job_id, {
|
||||
"event": "status_changed",
|
||||
"from": "pending",
|
||||
"to": "running",
|
||||
"by": agent_session,
|
||||
"timestamp": updated_at,
|
||||
})
|
||||
return job_id
|
||||
|
||||
|
||||
def update_status(job_id: str, registry_dir: str, status: str) -> Dict[str, Any]:
|
||||
if status not in VALID_STATUSES:
|
||||
raise ValueError(f"invalid status {status!r}; expected one of {VALID_STATUSES}")
|
||||
return mqtt_common.update_job_status(job_id, registry_dir, status=status)
|
||||
|
||||
|
||||
def list_jobs(registry_dir: str = DEFAULT_REGISTRY_DIR, status: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
records = list(_iter_records(registry_dir))
|
||||
if status:
|
||||
records = [r for r in records if r.get("status") == status]
|
||||
records.sort(key=lambda r: r.get("created_at", ""))
|
||||
return records
|
||||
|
||||
|
||||
def append_event(job_id: str, registry_dir: str, payload: Dict[str, Any]) -> None:
|
||||
"""Append one event payload as a JSON line to the job's events log. Best
|
||||
effort, debug-only; failures are logged but never raised to the caller."""
|
||||
try:
|
||||
Path(registry_dir).mkdir(parents=True, exist_ok=True)
|
||||
log_path = Path(registry_dir) / f"{job_id}.events.log"
|
||||
with open(log_path, "a", encoding="utf-8") as fh:
|
||||
fh.write(json.dumps(payload, ensure_ascii=False) + "\n")
|
||||
except OSError as exc: # pragma: no cover - best effort
|
||||
logger.warning("could not append event for %s: %s", job_id, exc)
|
||||
|
||||
|
||||
# convenience re-export so callers can `from registry import load_job`
|
||||
__all__ = [
|
||||
"register_job", "pick_pending", "update_status", "load_job",
|
||||
"list_jobs", "append_event", "generate_job_id",
|
||||
]
|
||||
|
||||
|
||||
def _iter_records(registry_dir: str):
|
||||
base = Path(registry_dir)
|
||||
if not base.exists():
|
||||
return
|
||||
for path in sorted(base.glob("*.json")):
|
||||
try:
|
||||
with open(path, "r", encoding="utf-8") as fh:
|
||||
yield json.load(fh)
|
||||
except (OSError, json.JSONDecodeError) as exc:
|
||||
logger.warning("skipping unreadable record %s: %s", path, exc)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI (so the bash wrapper can shell out without inline python)
|
||||
# --------------------------------------------------------------------------
|
||||
def _build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="tmux-agent-orchestrate-delegate-job registry CLI")
|
||||
parser.add_argument("--registry-dir", default=DEFAULT_REGISTRY_DIR)
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
p_reg = sub.add_parser("register", help="create a pending job; prints the job id")
|
||||
p_reg.add_argument("--prompt", required=True)
|
||||
p_reg.add_argument("--agent", default="claude-code")
|
||||
p_reg.add_argument("--agent-session", default="tmux:claude")
|
||||
p_reg.add_argument("--timeout", type=int, default=600)
|
||||
p_reg.add_argument("--idle-timeout", type=int, default=120)
|
||||
p_reg.add_argument("--bits", type=int, default=32, help="32 (PoC) or 128 (prod)")
|
||||
p_reg.add_argument("--artifact", action="append", default=[], dest="artifacts")
|
||||
|
||||
p_list = sub.add_parser("list", help="list jobs (optionally by status)")
|
||||
p_list.add_argument("--status", default=None)
|
||||
p_list.add_argument("--json", action="store_true")
|
||||
|
||||
p_get = sub.add_parser("get", help="print one job record as JSON")
|
||||
p_get.add_argument("--job", required=True)
|
||||
|
||||
p_status = sub.add_parser("status", help="set a job status")
|
||||
p_status.add_argument("--job", required=True)
|
||||
p_status.add_argument("--set", required=True, dest="status")
|
||||
|
||||
p_pick = sub.add_parser("pick", help="claim a pending job for a session; prints id")
|
||||
p_pick.add_argument("--agent-session", default="tmux:claude")
|
||||
|
||||
p_logs = sub.add_parser(
|
||||
"logs",
|
||||
help="show the persistent audit log for a job, or --list every logged job",
|
||||
)
|
||||
p_logs.add_argument("job_id", nargs="?", default=None,
|
||||
help="job id whose events.ndjson to print")
|
||||
p_logs.add_argument("--list", action="store_true", dest="list_all",
|
||||
help="summarise every job under the logs dir instead")
|
||||
p_logs.add_argument("--logs-dir", default=None,
|
||||
help="override the audit-log root (default: $DELEGATE_JOB_LOGS_DIR "
|
||||
"or <cwd>/.hermes/delegate_job_logs)")
|
||||
p_logs.add_argument("--tail", type=int, default=0,
|
||||
help="show only the last N events (0 = all)")
|
||||
p_logs.add_argument("--json", action="store_true",
|
||||
help="emit raw JSON lines / records instead of a table")
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(argv: Optional[List[str]] = None) -> int:
|
||||
mqtt_common.setup_logging(logging.INFO)
|
||||
args = _build_parser().parse_args(argv)
|
||||
rd = args.registry_dir
|
||||
|
||||
if args.command == "register":
|
||||
job_id = register_job(
|
||||
prompt=args.prompt,
|
||||
agent=args.agent,
|
||||
agent_session=args.agent_session,
|
||||
timeout_sec=args.timeout,
|
||||
idle_timeout_sec=args.idle_timeout,
|
||||
registry_dir=rd,
|
||||
expected_artifacts=args.artifacts,
|
||||
bits=args.bits,
|
||||
)
|
||||
print(job_id)
|
||||
return 0
|
||||
|
||||
if args.command == "list":
|
||||
records = list_jobs(rd, status=args.status)
|
||||
if args.json:
|
||||
print(json.dumps(records, ensure_ascii=False, indent=2))
|
||||
else:
|
||||
if not records:
|
||||
print("(no jobs)")
|
||||
for r in records:
|
||||
print(f"{r['job_id']} {r.get('status','?'):10s} {r.get('agent_session','')}"
|
||||
f" {r.get('prompt','')[:48]}")
|
||||
return 0
|
||||
|
||||
if args.command == "get":
|
||||
try:
|
||||
print(json.dumps(load_job(args.job, rd), ensure_ascii=False, indent=2))
|
||||
except FileNotFoundError as exc:
|
||||
print(str(exc), file=sys.stderr)
|
||||
return 1
|
||||
return 0
|
||||
|
||||
if args.command == "status":
|
||||
try:
|
||||
update_status(args.job, rd, args.status)
|
||||
except (FileNotFoundError, ValueError) as exc:
|
||||
print(str(exc), file=sys.stderr)
|
||||
return 1
|
||||
return 0
|
||||
|
||||
if args.command == "pick":
|
||||
job_id = pick_pending(args.agent_session, rd)
|
||||
if job_id is None:
|
||||
return 3 # no pending job for this session
|
||||
print(job_id)
|
||||
return 0
|
||||
|
||||
if args.command == "logs":
|
||||
return _cmd_logs(args)
|
||||
|
||||
return 1
|
||||
|
||||
|
||||
def _cmd_logs(args) -> int:
|
||||
"""Pretty-print one job's events.ndjson, or summarise all logged jobs."""
|
||||
logs_dir = args.logs_dir or mqtt_common.LOGS_DIR
|
||||
|
||||
if args.list_all:
|
||||
jobs = mqtt_common.list_logged_jobs(logs_dir)
|
||||
if args.json:
|
||||
print(json.dumps(jobs, ensure_ascii=False, indent=2))
|
||||
return 0
|
||||
if not jobs:
|
||||
print(f"(no logged jobs under {logs_dir})")
|
||||
return 0
|
||||
for m in jobs:
|
||||
print(f"{m.get('job_id','?')} {m.get('status','?'):10s} "
|
||||
f"{m.get('created_at','-'):20s} {(m.get('prompt') or '')[:48]}")
|
||||
return 0
|
||||
|
||||
if not args.job_id:
|
||||
print("logs requires a <job_id> or --list", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
events = list(mqtt_common.iter_logged_events(args.job_id, logs_dir))
|
||||
if not events and not mqtt_common.job_log_dir(args.job_id, logs_dir).exists():
|
||||
print(f"no audit log for job {args.job_id} under {logs_dir}", file=sys.stderr)
|
||||
return 1
|
||||
if args.tail and args.tail > 0:
|
||||
events = events[-args.tail:]
|
||||
if args.json:
|
||||
for e in events:
|
||||
print(json.dumps(e, ensure_ascii=False))
|
||||
return 0
|
||||
for e in events:
|
||||
ts = e.get("logged_at") or e.get("timestamp") or "-"
|
||||
extra = e.get("detail") or e.get("to") or e.get("source_event") or ""
|
||||
print(f"{ts:24s} {e.get('event','?'):<16s} {extra}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
+272
@@ -0,0 +1,272 @@
|
||||
#!/usr/bin/env bash
|
||||
# tmux-agent-orchestrate-delegate-job — user-facing orchestrator for the tmux-agent-orchestrate-delegate-job skill.
|
||||
#
|
||||
# Subcommands:
|
||||
# submit register a job, start the subscriber FIRST, then run the agent,
|
||||
# then (optionally) run a validation script.
|
||||
# status show one job record.
|
||||
# list list all jobs.
|
||||
# verify run a user-supplied --validate script against a job's artifacts.
|
||||
# wait block until all running/pending jobs reach a terminal state.
|
||||
#
|
||||
# This is a reference wrapper: it shells out to the python scripts that live
|
||||
# next to it. Copy it into your project and customise as needed. It never hard
|
||||
# fails if `claude`/`codex`/`tmux` are missing — it prints what it would run.
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
# Pick an interpreter: prefer a project .venv, else python3.
|
||||
pick_python() {
|
||||
local py_bin
|
||||
if [[ -n "${DELEGATE_JOB_PYTHON:-}" ]]; then
|
||||
py_bin="$DELEGATE_JOB_PYTHON"
|
||||
elif [[ -x "${WORKDIR:-.}/.venv/bin/python" ]]; then
|
||||
py_bin="${WORKDIR}/.venv/bin/python"
|
||||
elif [[ -x ".venv/bin/python" ]]; then
|
||||
py_bin="$(pwd)/.venv/bin/python"
|
||||
else
|
||||
py_bin="python3"
|
||||
fi
|
||||
if ! "$py_bin" -c "import paho.mqtt" 2>/dev/null; then
|
||||
echo "ERROR: paho-mqtt package is missing for $py_bin." >&2
|
||||
echo " Please create a virtual environment and install it:" >&2
|
||||
echo " python3 -m venv .venv && .venv/bin/pip install -r \"$SCRIPT_DIR/requirements.txt\"" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "$py_bin"
|
||||
}
|
||||
|
||||
REGISTRY_DIR_DEFAULT=".hermes/jobs"
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
tmux-agent-orchestrate-delegate-job <command> [options]
|
||||
|
||||
submit --agent <name> --prompt <text> [--workdir <dir>] [--agent-session <label>]
|
||||
[--timeout <sec>] [--idle-timeout <sec>] [--validate <script>]
|
||||
[--registry-dir <dir>] [--dry-run]
|
||||
# The skill is tmux-interactive only; --mode print was removed.
|
||||
status --job <id> [--registry-dir <dir>]
|
||||
list [--registry-dir <dir>]
|
||||
verify --job <id> --validate <script> [--registry-dir <dir>]
|
||||
wait [--job <id>] [--timeout <sec>] [--registry-dir <dir>]
|
||||
logs <job_id> | --list # persistent audit log (delegate_job_logs/)
|
||||
EOF
|
||||
}
|
||||
|
||||
# ---- arg parsing helpers --------------------------------------------------
|
||||
AGENT="claude-code"; PROMPT=""; WORKDIR="$(pwd)"; AGENT_SESSION="tmux:claude"
|
||||
TIMEOUT=600; IDLE_TIMEOUT=120; VALIDATE=""; DRY_RUN=0
|
||||
JOB_ID=""; REGISTRY_DIR="$REGISTRY_DIR_DEFAULT"
|
||||
|
||||
parse_opts() {
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--agent) AGENT="$2"; shift 2;;
|
||||
--prompt) PROMPT="$2"; shift 2;;
|
||||
--workdir) WORKDIR="$2"; shift 2;;
|
||||
--agent-session) AGENT_SESSION="$2"; shift 2;;
|
||||
--timeout) TIMEOUT="$2"; shift 2;;
|
||||
--idle-timeout) IDLE_TIMEOUT="$2"; shift 2;;
|
||||
--validate) VALIDATE="$2"; shift 2;;
|
||||
--job) JOB_ID="$2"; shift 2;;
|
||||
--registry-dir) REGISTRY_DIR="$2"; shift 2;;
|
||||
--dry-run) DRY_RUN=1; shift;;
|
||||
*) echo "unknown option: $1" >&2; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
}
|
||||
|
||||
cmd_submit() {
|
||||
parse_opts "$@"
|
||||
[[ -n "$PROMPT" ]] || { echo "submit requires --prompt" >&2; exit 1; }
|
||||
PY="$(pick_python)"
|
||||
cd "$WORKDIR"
|
||||
mkdir -p "$REGISTRY_DIR"
|
||||
|
||||
# 1) register job (prints the new job id)
|
||||
JOB_ID="$("$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" register \
|
||||
--prompt "$PROMPT" --agent "$AGENT" --agent-session "$AGENT_SESSION" \
|
||||
--timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT")"
|
||||
echo "registered job: $JOB_ID"
|
||||
|
||||
# 2) START THE SUBSCRIBER FIRST (ordering dependency — MQTT does not queue
|
||||
# non-retained messages for absent subscribers).
|
||||
local logf="$REGISTRY_DIR/$JOB_ID.subscriber.out"
|
||||
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||
--job "$JOB_ID" --timeout "$TIMEOUT" --idle-timeout "$IDLE_TIMEOUT" \
|
||||
>"$logf" 2>&1 &
|
||||
local sub_pid=$!
|
||||
echo "subscriber pid: $sub_pid (log: $logf)"
|
||||
sleep 1 # give the subscriber time to CONNACK + SUBSCRIBE before the agent runs
|
||||
|
||||
# 3) run the agent (or print the command for dry-run / missing binary)
|
||||
local pub="$PY $SCRIPT_DIR/scripts/publish_event.py --registry-dir $REGISTRY_DIR --job $JOB_ID"
|
||||
# NOTE: the agent MUST use --job "$JOB_ID" (the one we just minted). Hard-coding
|
||||
# an id from an earlier session is the #1 reason a delegated job sits idle and
|
||||
# times out (see SKILL.md "Wrong job_id propagated to the agent"). We make the
|
||||
# freshness explicit in the instruction header.
|
||||
local instructions="Your job_id is \"$JOB_ID\" (the one just registered for THIS delegation — read it from the registry record, do NOT reuse any job_id you saw in earlier runs).
|
||||
|
||||
On start run: $pub --event started.
|
||||
On permission/tool prompt run: $pub --event permission_required --detail '<tool>:<what>'.
|
||||
On progress (optional): $pub --event progress --detail '<short status>'.
|
||||
On success run: $pub --event completed --detail '<one-line summary>'.
|
||||
On failure run: $pub --event error --detail '<one-line reason>'.
|
||||
|
||||
The subscriber for this job_id is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
|
||||
|
||||
Task: $PROMPT"
|
||||
|
||||
run_agent "$JOB_ID" "$instructions"
|
||||
|
||||
# 4) optional validation hook
|
||||
if [[ -n "$VALIDATE" ]]; then
|
||||
echo "running validation: $VALIDATE"
|
||||
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
|
||||
echo "validation: PASS"
|
||||
else
|
||||
local rc=$?
|
||||
echo "validation: FAIL (exit $rc)"
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "$DRY_RUN" == "1" ]]; then
|
||||
# In dry-run we never started a real subscriber (the wrapper short-circuits
|
||||
# before launching one), but the wait below would still try to join the
|
||||
# background sub_pid from cmd_submit. Skip both the wait and the subscriber
|
||||
# log dump; the user just wants to see the instruction that would have run.
|
||||
local logs_root_dry="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
|
||||
echo "$logs_root_dry/$JOB_ID"
|
||||
return 0
|
||||
fi
|
||||
|
||||
wait "$sub_pid" || true
|
||||
echo "subscriber output:"; cat "$logf" || true
|
||||
|
||||
# Last stdout line: the persistent audit-log dir for this job (see SKILL.md
|
||||
# "Audit Logs"). Callers can scrape `tail -n1` to find it.
|
||||
local logs_root="${DELEGATE_JOB_LOGS_DIR:-$WORKDIR/delegate_job_logs}"
|
||||
echo "$logs_root/$JOB_ID"
|
||||
}
|
||||
|
||||
run_agent() {
|
||||
local job_id="$1"; local instructions="$2"
|
||||
# The skill is INTERACTIVE-ONLY. We never invoke `claude -p` or any other
|
||||
# one-shot print mode, because:
|
||||
# - claude -p exits the moment stdin is drained, so there's nothing to
|
||||
# `tmux attach` to afterwards.
|
||||
# - fire-and-forget via wrapper defeats the whole point of the audit log
|
||||
# (you can't tell what happened if the agent crashes mid-turn).
|
||||
# - the job registry already gives us an authoritative completion signal,
|
||||
# so we don't need a wrapper-side exit code to know "done".
|
||||
# The user attaches with `tmux attach -t <session>` and types follow-up
|
||||
# prompts themselves. We pre-load the first prompt via stdin and `read`
|
||||
# keeps the pane open after the agent exits so the user can review.
|
||||
case "$AGENT" in
|
||||
claude-code) bin="claude";;
|
||||
codex) bin="codex";;
|
||||
human) echo "[human agent] complete the task, then run publish_event.py --event completed"; return;;
|
||||
*) bin="$AGENT";;
|
||||
esac
|
||||
|
||||
if [[ "$DRY_RUN" == "1" ]]; then
|
||||
echo "[dry-run] would launch agent '$AGENT' in a fresh tmux session with instructions:"
|
||||
echo "----"; echo "$instructions"; echo "----"
|
||||
return
|
||||
fi
|
||||
|
||||
if ! command -v tmux >/dev/null 2>&1; then
|
||||
echo "ERROR: this skill requires tmux (interactive agent sessions)." >&2
|
||||
echo " Install with: brew install tmux (or your package manager)" >&2
|
||||
return 1
|
||||
fi
|
||||
if ! command -v "$bin" >/dev/null 2>&1; then
|
||||
echo "ERROR: agent binary '$bin' not found in PATH." >&2
|
||||
return 1
|
||||
fi
|
||||
|
||||
local sess="${AGENT_SESSION#tmux:}"
|
||||
# Detect a stale session with the same name (e.g. the user is still attached
|
||||
# from an earlier run, or a previous wrapper died without cleanup). tmux
|
||||
# new-session on an existing name fails silently; check first and fail loud.
|
||||
if tmux has-session -t "$sess" 2>/dev/null; then
|
||||
local attached
|
||||
attached=$(tmux list-clients -t "$sess" 2>/dev/null | wc -l | tr -d ' ')
|
||||
echo "ERROR: tmux session '$sess' already exists (clients attached: $attached)." >&2
|
||||
echo " Pick a unique --agent-session (e.g. tmux:demo, tmux:claude-a) or" >&2
|
||||
echo " kill the stale one first: tmux kill-session -t $sess" >&2
|
||||
return 1
|
||||
fi
|
||||
tmux new-session -d -s "$sess" -c "$WORKDIR" \
|
||||
"printf '%s' \"$instructions\" | $bin --dangerously-skip-permissions; echo; echo '--- agent exited (job $job_id); press enter to close ---'; read"
|
||||
echo "agent launched in tmux session: $sess (attach with: tmux attach -t $sess)"
|
||||
}
|
||||
|
||||
cmd_status() {
|
||||
parse_opts "$@"
|
||||
[[ -n "$JOB_ID" ]] || { echo "status requires --job" >&2; exit 1; }
|
||||
PY="$(pick_python)"
|
||||
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" get --job "$JOB_ID"
|
||||
}
|
||||
|
||||
cmd_list() {
|
||||
parse_opts "$@"
|
||||
PY="$(pick_python)"
|
||||
"$PY" "$SCRIPT_DIR/scripts/registry.py" --registry-dir "$REGISTRY_DIR" list
|
||||
}
|
||||
|
||||
cmd_verify() {
|
||||
parse_opts "$@"
|
||||
[[ -n "$JOB_ID" ]] || { echo "verify requires --job" >&2; exit 1; }
|
||||
[[ -n "$VALIDATE" ]] || { echo "verify requires --validate <script>" >&2; exit 1; }
|
||||
echo "verifying job $JOB_ID with $VALIDATE"
|
||||
if JOB_ID="$JOB_ID" REGISTRY_DIR="$REGISTRY_DIR" bash "$VALIDATE"; then
|
||||
echo "verify: PASS (exit 0)"; exit 0
|
||||
else
|
||||
rc=$?; echo "verify: FAIL (exit $rc)"; exit "$rc"
|
||||
fi
|
||||
}
|
||||
|
||||
cmd_logs() {
|
||||
# logs <job_id> | logs --list — delegates to registry.py's logs CLI, which
|
||||
# reads the persistent audit log under $DELEGATE_JOB_LOGS_DIR (or
|
||||
# <cwd>/delegate_job_logs). Run from your project dir so the default resolves.
|
||||
PY="$(pick_python)"
|
||||
if [[ "${1:-}" == "--list" ]]; then
|
||||
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs --list
|
||||
else
|
||||
local jid="${1:-}"
|
||||
[[ -n "$jid" ]] || { echo "logs requires <job_id> or --list" >&2; exit 1; }
|
||||
"$PY" "$SCRIPT_DIR/scripts/registry.py" logs "$jid"
|
||||
fi
|
||||
}
|
||||
|
||||
cmd_wait() {
|
||||
parse_opts "$@"
|
||||
PY="$(pick_python)"
|
||||
if [[ -n "$JOB_ID" ]]; then
|
||||
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||
--job "$JOB_ID" --timeout "$TIMEOUT"
|
||||
else
|
||||
"$PY" "$SCRIPT_DIR/scripts/job_subscriber.py" --registry-dir "$REGISTRY_DIR" \
|
||||
--wait-any --timeout "$TIMEOUT"
|
||||
fi
|
||||
}
|
||||
|
||||
main() {
|
||||
local sub="${1:-}"; shift || true
|
||||
case "$sub" in
|
||||
submit) cmd_submit "$@";;
|
||||
status) cmd_status "$@";;
|
||||
list) cmd_list "$@";;
|
||||
verify) cmd_verify "$@";;
|
||||
wait) cmd_wait "$@";;
|
||||
logs) cmd_logs "$@";;
|
||||
""|-h|--help|help) usage;;
|
||||
*) echo "unknown command: $sub" >&2; usage; exit 1;;
|
||||
esac
|
||||
}
|
||||
|
||||
main "$@"
|
||||
Reference in New Issue
Block a user