--- name: tmux-agent-orchestrate-delegate-job description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, tmux-agent-orchestrate-delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types." version: 1.0.0 author: Hermes Agent license: MIT platforms: [linux, macos, windows] metadata: hermes: tags: [agent-delegation, mqtt, jobs, orchestration, async-completion] related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring] --- # tmux-agent-orchestrate-delegate-job — Async Job Delegation over MQTT Delegate a unit of work to an autonomous agent, then **observe** it instead of blocking on it. Every job gets a unique id and a registry record; the agent publishes lifecycle events (`started`, `permission_required`, `progress`, `completed`, `error`) to a per-job MQTT topic; the delegator subscribes and treats `completed`/`error` — or a timeout — as the terminal state. This skill is a **reference implementation**: copy the files in this directory into your project and customise. The `communication_over_mqtt` project is the canonical concrete instance. ## Overview The model is deliberately small. A **job** is one delegated task. An **agent** is a worker (a claude-code tmux session, a codex run, a human). The **registry** (`.hermes/jobs/.json`) holds everything about a job so nothing important lives in environment variables — which means one tmux session can process many jobs sequentially, and many sessions can fan out in parallel, with no env collisions. The **event channel** is one MQTT topic per job carrying JSON payloads; `event` discriminates the type. Responsibility is split into exactly one entry point each: [`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup, monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py) observes them (timeouts, terminal state machine, defensive parsing). Shared logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in [`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py` in the host project stay frozen. Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up the protocol. **Production** moves to your own authenticated TLS broker — the switch is **config only** (env vars + the registry `broker.*` block), never a code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md). ## When to Use / When NOT to Use **Use when:** - you want **fire-and-observe** delegation — kick off work and get a completion signal rather than blocking a terminal; - several agent types (claude-code, codex, opencode, human) must follow **one** completion protocol; - you need **multi-job fan-out** across tmux sessions with safe job claiming; - you want a clean PoC → authenticated-broker upgrade path. **Do NOT use when:** - a one-shot `claude -p '…'` that returns inline is enough (no async signal needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly; - you need request/response RPC or large artifact transfer (this is a one-direction event stream, not a data bus); - the payload would carry secrets and you're still on the public broker — move to the own-broker stage first. ## Quick Start The one-line wrapper handles register + subscriber-first + agent launch. If you're new, **start here** and only fall back to the manual 5-step flow when you need finer control. ```bash # 1) one line: register → start subscriber → launch agent in tmux # (uses public broker by default; last stdout line is the audit-log dir) tmux-agent-orchestrate-delegate-job submit \ --agent claude-code \ --prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \ --workdir /path/to/project \ --agent-session tmux:demo \ --timeout 3600 --idle-timeout 120 # → stdout: registered job: # subscriber pid: … # agent launched in tmux session: demo # subscriber output: # /path/to/project/.hermes/delegate_job_logs/ ← audit log dir # 2) at any time, query the job or its audit log tmux-agent-orchestrate-delegate-job status --job tmux-agent-orchestrate-delegate-job logs # pretty timeline tmux-agent-orchestrate-delegate-job logs --list # every job, live status # 3) run a user-supplied validator against the job's artifacts tmux-agent-orchestrate-delegate-job verify --job --validate ./validate.sh ``` The wrapper enforces the **subscribe-before-publish** ordering and **forwards the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls `publish_event.py --job ` with the right id — see Pitfall §"Wrong job_id propagated to the agent"). When you need finer control, the manual flow is: ```bash # Manual 5-step (same outcome, more knobs) PY=.venv/bin/python SKILL=./.agents/skills/tmux-agent-orchestrate-delegate-job/scripts # 1) register JID=$($PY "$SKILL/registry.py" register \ --prompt "…" --agent claude-code --agent-session tmux:demo \ --timeout 3600 --idle-timeout 120) # 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs) $PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 3600 --idle-timeout 120 & # 3) pass JID to the agent and instruct it to publish events with --job "$JID" # (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id") # 4) on completion the subscriber prints events and exits 0/1/2 # 5) inspect any time $PY "$SKILL/registry.py" get --job "$JID" $PY "$SKILL/registry.py" logs "$JID" # positional job id $PY "$SKILL/registry.py" logs --list ``` ## Job Protocol One topic per job: `python/mqtt/jobs//events`. Payload (JSON, UTF-8, `schema_version=1`): ```json { "schema_version": 1, "seq": 7, "job_id": "abc12345", "event": "started|permission_required|progress|completed|error", "timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text", "data": { "optional": "metadata" } } ``` - `seq` is monotonic per job (first = 1); the subscriber uses it to spot reorder/duplication. - `timestamp` is advisory — timeouts are measured from **receive** time. - `detail`/`data` carry **no** secrets or absolute paths. - A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing). `started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0, `error`→exit 1. Full catalogue + production `auth_token` handling: [`job-protocol.md`](./job-protocol.md). ## Registry Format ``` .hermes/jobs/.json # metadata record (single source of truth) .hermes/jobs/.events.log # append-only JSON-lines log (debug, optional) .hermes/jobs/.lock # fcntl advisory lock for the registry ``` The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block, `topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`, `last_seq`, and (production) `auth_token`. Because the `broker` block lives in the record, `publish_event.py` connects from the registry alone. Concurrency, the atomic rename trick, and multi-session job claiming are in [`registry.md`](./registry.md). ## Audit Logs Every job's lifecycle is mirrored to a **persistent, append-only audit log** under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`; default `/.hermes/delegate_job_logs`). Unlike the registry — live state mutated in place and liable to be cleaned up — the audit log is durable history you can replay after the fact. It is git-ignored. ``` .hermes/delegate_job_logs// meta.json # registration snapshot: prompt, agent, broker, timeouts, … events.ndjson # append-only, one JSON event per line, in time order status.json # current status only (fast point-query) ``` **What is logged, automatically:** | When | `events.ndjson` line | Written by | |------|----------------------|------------| | job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` | | any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` | | event published | `published` (carries the exact payload — reproducible) | `publish_event.py` | | event received | `received` (subscriber's external view) | `job_subscriber.py` | Both the emitter side (`published`) and the observer side (`received`) are recorded, so a dropped publish or a missed receive is still visible from the other. Every write is **best-effort and isolated** — an fcntl-locked append guarded by `try/except` that only ever emits a `logger.warning`, so a logging failure can never break a publish, a subscribe, or a registry write. stdout is never touched. **Reading them:** ```bash tmux-agent-orchestrate-delegate-job logs # pretty-print one job's timeline tmux-agent-orchestrate-delegate-job logs --list # summarise every logged job (with live status) # or directly via the registry CLI: $PY scripts/registry.py logs [--tail N] [--json] $PY scripts/registry.py logs --list [--json] ``` `submit` prints the job's audit-log directory as its last stdout line, so a caller can `tail -n1` to locate it. ## Broker Setup | Stage | Broker | Auth | Transport | |-------|--------|------|-----------| | PoC | `broker.hivemq.com` | none | 1883 plaintext | | Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS | All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`, `MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by `broker_config_from_env()`, with the registry `broker.*` block overriding per job. Moving to your own broker is **config only**: install Mosquitto, set `persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip `MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL, `mosquitto_passwd`, self-signed/private-CA certs, cut-over verification): [`mqtt-broker-setup.md`](./mqtt-broker-setup.md). ## Agent Adapters Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the **freshly-issued id from the registry record for *this* delegation** — never a job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the agent"). - **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at lifecycle points. `submit --mode tmux` injects a prompt that already names `$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference instruction block (the wrapper injects something equivalent): ```text Your job_id is "$JOB_ID" (read it from the registry record for this delegation — do not reuse any job_id you saw before). On start: $PY tmux-agent-orchestrate-delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started On permission: $PY … --job "$JOB_ID" --event permission_required --detail ":" On progress: $PY … --job "$JOB_ID" --event progress --detail "" On success: $PY … --job "$JOB_ID" --event completed --detail "" On failure: $PY … --job "$JOB_ID" --event error --detail "" Task: The subscriber for "$JOB_ID" is already running; your completed/error event ends the job. Exit codes: 0 completed, 1 error, 2 publish failure. ``` See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns. - **codex** — same contract. Invoke `codex exec ""` or wire `publish_event.py` as an MCP tool so the agent can call it directly. - **opencode** — wire `publish_event.py` as a tool/command the agent can call; identical event points. - **human** — a person does the work, reads the registry record, then runs `publish_event.py --job --event completed` (or `error`) by hand. ## User Interface The [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) bash wrapper bundles register + subscribe-first + run-agent + validate: ```bash tmux-agent-orchestrate-delegate-job submit --agent claude-code \ --prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \ --workdir /path/to/project --timeout 3600 [--validate ./validate.sh] tmux-agent-orchestrate-delegate-job status --job # one record, pretty-printed tmux-agent-orchestrate-delegate-job list # all jobs, one line each tmux-agent-orchestrate-delegate-job verify --job --validate ./validate.sh # runs it, reports exit code tmux-agent-orchestrate-delegate-job wait [--job ] # block until terminal (else --wait-any) ``` `submit` **always starts the subscriber before the agent** (the ordering dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and calls `--validate` afterward if given. The skill automates job-id generation, registry creation, broker resolution, subscriber-first ordering, agent launch, and completion detection; it does **not** automate the agent's internals or your business-logic validation — those are hooks you fill (`validate.sh` reads `$JOB_ID`/`$REGISTRY_DIR`). ## Common Pitfalls - **Publishing before subscribing** — MQTT does not queue non-retained messages for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely on retained terminal events (production). `submit` enforces this. - **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID` on every `submit`. If your agent instruction (or the wrapper's prompt template) hard-codes an old job_id, the agent calls `publish_event.py --job `, the subscriber's defensive parser drops it as a `job_id` mismatch, and the delegator waits until idle timeout (exit 2). Fix: instruct the agent to **read the job_id from the registry record for *this* delegation** (or pass it in via env / `--prompt` interpolation), never from prior runs. `submit`'s default prompt template interpolates `$JOB_ID` for you — if you build a custom prompt, do the same. - **tmux session name collision** — `submit --mode tmux` derives the session name from `--agent-session tmux:` (default `tmux:claude`). If a session with that name is already attached (e.g. you ran the demo and the previous session is still open), `tmux new-session -d -s ` fails and the agent never launches. Pick a unique `--agent-session` per concurrent delegation (e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one (`tmux kill-session -t claude`) before re-running. - **Timeout before `started`** — a cold-starting agent may not emit `started` for a while; the wall-clock timeout starts at subscribe time so a stuck agent still terminates. Don't set `--timeout` so low you false-positive a slow start. - **No retry on publish** — a dropped `completed` would hang the delegator forever; `publish_event.py` retries with exponential backoff and exits 2 if it still fails, so the delegator is never left waiting silently. - **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or `error` can trail `completed`; the subscriber's terminal state machine finalises each job once and ignores the rest. - **Trusting the public broker** — anyone can publish there; never make a real decision on a PoC signal. Add `auth_token` + an authenticated broker first. - **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or tokens (except the production `auth_token` in `data`). ## Subagent Orchestration Pattern When using this skill from a Hermes `delegate_task` subagent to dispatch work to a coding-agent CLI (agy/claude) running in a tmux session, the following pattern has been verified (2026-06-21, 6-batch refactoring sprint): ### Roles - **Main worker** (implementation): one agent session (e.g. `agy-new`) receives brief files and executes code changes. - **Reviewers** (spec compliance + code quality): two other agent sessions (e.g. `agy-existing`, `claude-existing`) review the diff in parallel. - **Hermes** (orchestrator): dispatches subagents, verifies diffs, commits, and falls back to direct fixes when reviewers find issues. ### Key lessons learned 1. **Brief delivery via file path** — don't paste long briefs inline via `tmux send-keys`; the TUI may swallow them. Instead, send a short instruction like "follow /tmp/batch1-brief.md" and let the agent read the file. 2. **Polling vs MQTT subscriber** — for short tasks (<5min), pane polling (`capture-pane` + grep for completion markers) is simpler and more reliable than registering a job via `registry.py` + `job_subscriber.py`. Use MQTT subscriber only for long-running jobs (>5min) where push notification matters. 3. **Reviewers catch different bugs** — in practice, agy (Flash) caught semantic issues (slash matching, export scope), while claude (Opus) caught API signature mismatches (paho v2 5-arg vs 4-arg `on_disconnect`). Two reviewers with different models provide complementary coverage. 4. **Hermes fallback fix** — when reviewers find a small, well-defined issue (wrong argument count, missing slash), Hermes should fix it directly rather than re-dispatching the implementer. This saves a full round-trip. 5. **Batch grouping** — group 2-3 FW items per batch when they touch different files (no file overlap). This amortises the dispatch overhead. Items touching the same file must be in separate batches to avoid conflicts. 6. **Pane Snapshots & Truncation Prevention** — to prevent long agent responses from being scrolled out and truncated due to TUI viewport limitations, enforce the following snapshotting pattern: - Immediately after dispatching a brief, capture the pre-brief pane buffer via `capture-pane -S -200`. - During long execution, run a background loop taking incremental snapshots (e.g. every 30 seconds `>> /tmp/pane-snap.txt`). - Immediately after job termination, capture the entire final pane state to ensure no terminal logs are lost. ## Verification Checklist - [ ] `started` → `completed` over the public broker: subscriber prints the lines and exits **0**. - [ ] `error` path: subscriber exits **1**. - [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` → exit **2**. - [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is dropped with a warning, not crashed on. - [ ] one tmux session processes two registry jobs in sequence; a second session with a different `agent_session` claims only its own. - [ ] broker cut-over: same scripts reach an authenticated TLS broker with env changes only; a credential without write ACL is rejected; a late subscriber still receives the retained terminal event. - [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample` still works unchanged (regression). - [ ] **audit log integrity** — for a completed job, `.hermes/delegate_job_logs//events.ndjson` contains `registered` → `received started` → `published completed` (in that order), and `status.json.status == "completed"` matches the registry record. A logging failure (e.g. read-only log dir) does not break the publish or subscribe path — only a `logger.warning` is emitted. - [ ] **end-to-end demo smoke** — run `tmux-agent-orchestrate-delegate-job submit --agent claude-code --agent-session tmux:demo-smoke --prompt "echo hello and call publish_event.py --job --event completed" --timeout 120` and confirm (a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session name printed, (d) `events.ndjson` grows as the agent runs, (e) final stdout line is the audit-log dir.