e9fc763d31
Renamed 6 skills directories to tmux-agent-orchestrate-* prefix: - multi-agent-create → tmux-agent-orchestrate-create - multi-agent-resume → tmux-agent-orchestrate-resume - multi-agent-delete → tmux-agent-orchestrate-delete - multi-agent-status → tmux-agent-orchestrate-status - agent-sessions-monitor → tmux-agent-orchestrate-monitor - delegate-job → tmux-agent-orchestrate-delegate-job Updated: - skills/lib.sh internal paths (delegate_submit_job etc.) - skills/tmux-agent-orchestrate-status/scripts/status.sh (monitor path) - skills/tmux-agent-orchestrate-monitor/scripts/reconcile.sh - .gitignore (HTML ignore patterns) - 6 SKILL.md frontmatter (name, related_skills, prereq_skills) and body - All script headers and Korean comments Notes: - tmux session naming convention unchanged (<slug>-creator-<agent>) — workspace identifier based, kept for backward compatibility - Existing 2 sessions in -L multi-agent-canary untouched - YAML delegate_job_id / agent-session (tmux:canary-...) preserved for log history compatibility Verified on isolated server -L agy-rename-test (kill-server after).
349 lines
18 KiB
Markdown
349 lines
18 KiB
Markdown
---
|
|
name: tmux-agent-orchestrate-delegate-job
|
|
description: "Delegate a unit of work to any autonomous agent (claude-code, codex, opencode, or a human) and observe it asynchronously over an MQTT event channel. Each job gets a unique id, a registry record (prompt, broker, status, timeouts), and a single per-job topic that carries started/permission_required/progress/completed/error events as schema-versioned JSON. The delegator starts a subscriber first, runs the agent, and treats a completed/error event or a timeout as the job's terminal state. Ships a working reference implementation (publish_event.py, job_subscriber.py, registry.py, mqtt_common.py, tmux-agent-orchestrate-delegate-job wrapper) plus a PoC-to-production path: validate on a public broker, then move to an authenticated TLS broker by changing config only — no code change. Use when you need fire-and-observe delegation, multi-job fan-out across tmux sessions, or a uniform completion-signal protocol shared by several agent types."
|
|
version: 1.0.0
|
|
author: Hermes Agent
|
|
license: MIT
|
|
platforms: [linux, macos, windows]
|
|
metadata:
|
|
hermes:
|
|
tags: [agent-delegation, mqtt, jobs, orchestration, async-completion]
|
|
related_skills: [claude-code, codex, opencode, hermes-agent-skill-authoring]
|
|
---
|
|
|
|
# tmux-agent-orchestrate-delegate-job — Async Job Delegation over MQTT
|
|
|
|
Delegate a unit of work to an autonomous agent, then **observe** it instead of
|
|
blocking on it. Every job gets a unique id and a registry record; the agent
|
|
publishes lifecycle events (`started`, `permission_required`, `progress`,
|
|
`completed`, `error`) to a per-job MQTT topic; the delegator subscribes and
|
|
treats `completed`/`error` — or a timeout — as the terminal state.
|
|
|
|
This skill is a **reference implementation**: copy the files in this directory
|
|
into your project and customise. The `communication_over_mqtt` project is the
|
|
canonical concrete instance.
|
|
|
|
## Overview
|
|
|
|
The model is deliberately small. A **job** is one delegated task. An **agent**
|
|
is a worker (a claude-code tmux session, a codex run, a human). The **registry**
|
|
(`.hermes/jobs/<id>.json`) holds everything about a job so nothing important
|
|
lives in environment variables — which means one tmux session can process many
|
|
jobs sequentially, and many sessions can fan out in parallel, with no env
|
|
collisions. The **event channel** is one MQTT topic per job carrying JSON
|
|
payloads; `event` discriminates the type.
|
|
|
|
Responsibility is split into exactly one entry point each:
|
|
[`publish_event.py`](./scripts/publish_event.py) emits events (registry lookup,
|
|
monotonic `seq`, retry+backoff) and [`job_subscriber.py`](./scripts/job_subscriber.py)
|
|
observes them (timeouts, terminal state machine, defensive parsing). Shared
|
|
logic lives in [`mqtt_common.py`](./scripts/mqtt_common.py); registry I/O in
|
|
[`registry.py`](./scripts/registry.py). The demo `publisher.py`/`subscriber.py`
|
|
in the host project stay frozen.
|
|
|
|
Two stages, same code. **PoC** runs on the public `broker.hivemq.com` to wire up
|
|
the protocol. **Production** moves to your own authenticated TLS broker — the
|
|
switch is **config only** (env vars + the registry `broker.*` block), never a
|
|
code change. See [`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
|
|
|
## When to Use / When NOT to Use
|
|
|
|
**Use when:**
|
|
- you want **fire-and-observe** delegation — kick off work and get a completion
|
|
signal rather than blocking a terminal;
|
|
- several agent types (claude-code, codex, opencode, human) must follow **one**
|
|
completion protocol;
|
|
- you need **multi-job fan-out** across tmux sessions with safe job claiming;
|
|
- you want a clean PoC → authenticated-broker upgrade path.
|
|
|
|
**Do NOT use when:**
|
|
- a one-shot `claude -p '…'` that returns inline is enough (no async signal
|
|
needed) — just use the [claude-code](../claude-code/SKILL.md) skill directly;
|
|
- you need request/response RPC or large artifact transfer (this is a
|
|
one-direction event stream, not a data bus);
|
|
- the payload would carry secrets and you're still on the public broker — move
|
|
to the own-broker stage first.
|
|
|
|
## Quick Start
|
|
|
|
The one-line wrapper handles register + subscriber-first + agent launch. If
|
|
you're new, **start here** and only fall back to the manual 5-step flow when
|
|
you need finer control.
|
|
|
|
```bash
|
|
# 1) one line: register → start subscriber → launch agent in tmux
|
|
# (uses public broker by default; last stdout line is the audit-log dir)
|
|
tmux-agent-orchestrate-delegate-job submit \
|
|
--agent claude-code \
|
|
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
|
--workdir /path/to/project \
|
|
--agent-session tmux:demo \
|
|
--timeout 600 --idle-timeout 120
|
|
# → stdout: registered job: <JID>
|
|
# subscriber pid: …
|
|
# agent launched in tmux session: demo
|
|
# subscriber output: <one line per event>
|
|
# /path/to/project/.hermes/delegate_job_logs/<JID> ← audit log dir
|
|
|
|
# 2) at any time, query the job or its audit log
|
|
tmux-agent-orchestrate-delegate-job status --job <JID>
|
|
tmux-agent-orchestrate-delegate-job logs <JID> # pretty timeline
|
|
tmux-agent-orchestrate-delegate-job logs --list # every job, live status
|
|
|
|
# 3) run a user-supplied validator against the job's artifacts
|
|
tmux-agent-orchestrate-delegate-job verify --job <JID> --validate ./validate.sh
|
|
```
|
|
|
|
The wrapper enforces the **subscribe-before-publish** ordering and **forwards
|
|
the freshly-minted `JOB_ID` into the agent's prompt** (so the agent calls
|
|
`publish_event.py --job <JID>` with the right id — see Pitfall §"Wrong job_id
|
|
propagated to the agent"). When you need finer control, the manual flow is:
|
|
|
|
```bash
|
|
# Manual 5-step (same outcome, more knobs)
|
|
PY=.venv/bin/python
|
|
SKILL=./skills/tmux-agent-orchestrate-delegate-job/scripts
|
|
|
|
# 1) register
|
|
JID=$($PY "$SKILL/registry.py" register \
|
|
--prompt "…" --agent claude-code --agent-session tmux:demo \
|
|
--timeout 600 --idle-timeout 120)
|
|
|
|
# 2) START THE SUBSCRIBER FIRST (MQTT does not queue non-retained msgs)
|
|
$PY "$SKILL/job_subscriber.py" --job "$JID" --timeout 600 --idle-timeout 120 &
|
|
|
|
# 3) pass JID to the agent and instruct it to publish events with --job "$JID"
|
|
# (don't hard-code a job id you saw earlier — see Pitfall §"Wrong job_id")
|
|
|
|
# 4) on completion the subscriber prints events and exits 0/1/2
|
|
|
|
# 5) inspect any time
|
|
$PY "$SKILL/registry.py" get --job "$JID"
|
|
$PY "$SKILL/registry.py" logs "$JID" # positional job id
|
|
$PY "$SKILL/registry.py" logs --list
|
|
```
|
|
|
|
## Job Protocol
|
|
|
|
One topic per job: `python/mqtt/jobs/<job_id>/events`. Payload (JSON, UTF-8,
|
|
`schema_version=1`):
|
|
|
|
```json
|
|
{ "schema_version": 1, "seq": 7, "job_id": "abc12345",
|
|
"event": "started|permission_required|progress|completed|error",
|
|
"timestamp": "2026-06-19T09:32:00Z", "detail": "generalised text",
|
|
"data": { "optional": "metadata" } }
|
|
```
|
|
|
|
- `seq` is monotonic per job (first = 1); the subscriber uses it to spot
|
|
reorder/duplication.
|
|
- `timestamp` is advisory — timeouts are measured from **receive** time.
|
|
- `detail`/`data` carry **no** secrets or absolute paths.
|
|
- A `schema_version` or `job_id` mismatch is **dropped** (defensive parsing).
|
|
|
|
`started` and `completed`/`error` are the mandatory bookends; `completed`→exit 0,
|
|
`error`→exit 1. Full catalogue + production `auth_token` handling:
|
|
[`job-protocol.md`](./job-protocol.md).
|
|
|
|
## Registry Format
|
|
|
|
```
|
|
.hermes/jobs/<id>.json # metadata record (single source of truth)
|
|
.hermes/jobs/<id>.events.log # append-only JSON-lines log (debug, optional)
|
|
.hermes/jobs/.lock # fcntl advisory lock for the registry
|
|
```
|
|
|
|
The record holds `status`, `prompt`, `agent`, `agent_session`, a `broker` block,
|
|
`topic_prefix`, `timeout_sec`/`idle_timeout_sec`, `expected_artifacts`,
|
|
`last_seq`, and (production) `auth_token`. Because the `broker` block lives in
|
|
the record, `publish_event.py` connects from the registry alone. Concurrency,
|
|
the atomic rename trick, and multi-session job claiming are in
|
|
[`registry.md`](./registry.md).
|
|
|
|
## Audit Logs
|
|
|
|
Every job's lifecycle is mirrored to a **persistent, append-only audit log**
|
|
under `.hermes/delegate_job_logs/` (override with `DELEGATE_JOB_LOGS_DIR`;
|
|
default `<cwd>/.hermes/delegate_job_logs`). Unlike the registry — live state
|
|
mutated in place and liable to be cleaned up — the audit log is durable
|
|
history you can replay after the fact. It is git-ignored.
|
|
|
|
```
|
|
.hermes/delegate_job_logs/<job_id>/
|
|
meta.json # registration snapshot: prompt, agent, broker, timeouts, …
|
|
events.ndjson # append-only, one JSON event per line, in time order
|
|
status.json # current status only (fast point-query)
|
|
```
|
|
|
|
**What is logged, automatically:**
|
|
|
|
| When | `events.ndjson` line | Written by |
|
|
|------|----------------------|------------|
|
|
| job registered | `registered` (also seeds meta.json + status.json) | `registry.register_job` |
|
|
| any status change | `status_changed` (`from`/`to`; also rewrites status.json) | `update_job_status`, `pick_pending` |
|
|
| event published | `published` (carries the exact payload — reproducible) | `publish_event.py` |
|
|
| event received | `received` (subscriber's external view) | `job_subscriber.py` |
|
|
|
|
Both the emitter side (`published`) and the observer side (`received`) are
|
|
recorded, so a dropped publish or a missed receive is still visible from the
|
|
other. Every write is **best-effort and isolated** — an fcntl-locked append
|
|
guarded by `try/except` that only ever emits a `logger.warning`, so a logging
|
|
failure can never break a publish, a subscribe, or a registry write. stdout is
|
|
never touched.
|
|
|
|
**Reading them:**
|
|
|
|
```bash
|
|
tmux-agent-orchestrate-delegate-job logs <job_id> # pretty-print one job's timeline
|
|
tmux-agent-orchestrate-delegate-job logs --list # summarise every logged job (with live status)
|
|
# or directly via the registry CLI:
|
|
$PY scripts/registry.py logs <job_id> [--tail N] [--json]
|
|
$PY scripts/registry.py logs --list [--json]
|
|
```
|
|
|
|
`submit` prints the job's audit-log directory as its last stdout line, so a
|
|
caller can `tail -n1` to locate it.
|
|
|
|
## Broker Setup
|
|
|
|
| Stage | Broker | Auth | Transport |
|
|
|-------|--------|------|-----------|
|
|
| PoC | `broker.hivemq.com` | none | 1883 plaintext |
|
|
| Production | self-hosted Mosquitto/EMQX | user/pass + ACL | 8883 TLS |
|
|
|
|
All connection settings come from env (`MQTT_BROKER`, `MQTT_PORT`, `MQTT_TLS`,
|
|
`MQTT_USERNAME`/`MQTT_PASSWORD`, `MQTT_CA_CERTS`, …) resolved by
|
|
`broker_config_from_env()`, with the registry `broker.*` block overriding per
|
|
job. Moving to your own broker is **config only**: install Mosquitto, set
|
|
`persistence true` + `acl_file` + `password_file` + a TLS `listener 8883`, grant
|
|
the worker `write python/mqtt/jobs/+/events` and Hermes `read`, then flip
|
|
`MQTT_TLS=1` and fill the registry `broker.*`. Step-by-step (conf, ACL,
|
|
`mosquitto_passwd`, self-signed/private-CA certs, cut-over verification):
|
|
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md).
|
|
|
|
## Agent Adapters
|
|
|
|
Each agent voluntarily follows the contract: receive a `JOB_ID` (or registry
|
|
path), call `publish_event.py` at lifecycle points, exit 0/1/2. **The contract
|
|
in one line**: every event call uses `--job "$JOB_ID"` where `$JOB_ID` is the
|
|
**freshly-issued id from the registry record for *this* delegation** — never a
|
|
job_id you saw in an earlier session (Pitfall §"Wrong job_id propagated to the
|
|
agent").
|
|
|
|
- **claude-code** — Claude Code calls `publish_event.py` via its Bash tool at
|
|
lifecycle points. `submit --mode tmux` injects a prompt that already names
|
|
`$JOB_ID`; if you drive claude manually, hand it the id explicitly. Reference
|
|
instruction block (the wrapper injects something equivalent):
|
|
|
|
```text
|
|
Your job_id is "$JOB_ID" (read it from the registry record for this delegation —
|
|
do not reuse any job_id you saw before).
|
|
|
|
On start: $PY tmux-agent-orchestrate-delegate-job/scripts/publish_event.py --job "$JOB_ID" --event started
|
|
On permission: $PY … --job "$JOB_ID" --event permission_required --detail "<tool>:<what>"
|
|
On progress: $PY … --job "$JOB_ID" --event progress --detail "<short status>"
|
|
On success: $PY … --job "$JOB_ID" --event completed --detail "<one-line summary>"
|
|
On failure: $PY … --job "$JOB_ID" --event error --detail "<one-line reason>"
|
|
|
|
Task: <the user's prompt>
|
|
|
|
The subscriber for "$JOB_ID" is already running; your completed/error event
|
|
ends the job. Exit codes: 0 completed, 1 error, 2 publish failure.
|
|
```
|
|
|
|
See [claude-code](../claude-code/SKILL.md) for tmux orchestration patterns.
|
|
- **codex** — same contract. Invoke `codex exec "<instruction-block-above>"` or
|
|
wire `publish_event.py` as an MCP tool so the agent can call it directly.
|
|
- **opencode** — wire `publish_event.py` as a tool/command the agent can call;
|
|
identical event points.
|
|
- **human** — a person does the work, reads the registry record, then runs
|
|
`publish_event.py --job <id> --event completed` (or `error`) by hand.
|
|
|
|
## User Interface
|
|
|
|
The [`tmux-agent-orchestrate-delegate-job`](./tmux-agent-orchestrate-delegate-job) bash wrapper bundles register +
|
|
subscribe-first + run-agent + validate:
|
|
|
|
```bash
|
|
tmux-agent-orchestrate-delegate-job submit --agent claude-code \
|
|
--prompt "정렬 문제 10개를 만들어 sort_problems.md로 저장" \
|
|
--workdir /path/to/project --timeout 600 [--validate ./validate.sh]
|
|
tmux-agent-orchestrate-delegate-job status --job <id> # one record, pretty-printed
|
|
tmux-agent-orchestrate-delegate-job list # all jobs, one line each
|
|
tmux-agent-orchestrate-delegate-job verify --job <id> --validate ./validate.sh # runs it, reports exit code
|
|
tmux-agent-orchestrate-delegate-job wait [--job <id>] # block until terminal (else --wait-any)
|
|
```
|
|
|
|
`submit` **always starts the subscriber before the agent** (the ordering
|
|
dependency), runs the agent in `--mode print` (one-shot) or `--mode tmux`, and
|
|
calls `--validate` afterward if given. The skill automates job-id generation,
|
|
registry creation, broker resolution, subscriber-first ordering, agent launch,
|
|
and completion detection; it does **not** automate the agent's internals or your
|
|
business-logic validation — those are hooks you fill (`validate.sh` reads
|
|
`$JOB_ID`/`$REGISTRY_DIR`).
|
|
|
|
## Common Pitfalls
|
|
|
|
- **Publishing before subscribing** — MQTT does not queue non-retained messages
|
|
for absent subscribers. Start `job_subscriber.py` *before* the agent, or rely
|
|
on retained terminal events (production). `submit` enforces this.
|
|
- **Wrong job_id propagated to the agent** — the wrapper prints a fresh `JOB_ID`
|
|
on every `submit`. If your agent instruction (or the wrapper's prompt template)
|
|
hard-codes an old job_id, the agent calls `publish_event.py --job <wrong>`,
|
|
the subscriber's defensive parser drops it as a `job_id` mismatch, and the
|
|
delegator waits until idle timeout (exit 2). Fix: instruct the agent to
|
|
**read the job_id from the registry record for *this* delegation** (or pass it
|
|
in via env / `--prompt` interpolation), never from prior runs. `submit`'s
|
|
default prompt template interpolates `$JOB_ID` for you — if you build a custom
|
|
prompt, do the same.
|
|
- **tmux session name collision** — `submit --mode tmux` derives the session
|
|
name from `--agent-session tmux:<name>` (default `tmux:claude`). If a session
|
|
with that name is already attached (e.g. you ran the demo and the previous
|
|
session is still open), `tmux new-session -d -s <name>` fails and the agent
|
|
never launches. Pick a unique `--agent-session` per concurrent delegation
|
|
(e.g. `tmux:demo`, `tmux:claude-a`, `tmux:claude-b`) or kill the stale one
|
|
(`tmux kill-session -t claude`) before re-running.
|
|
- **Timeout before `started`** — a cold-starting agent may not emit `started`
|
|
for a while; the wall-clock timeout starts at subscribe time so a stuck agent
|
|
still terminates. Don't set `--timeout` so low you false-positive a slow start.
|
|
- **No retry on publish** — a dropped `completed` would hang the delegator
|
|
forever; `publish_event.py` retries with exponential backoff and exits 2 if it
|
|
still fails, so the delegator is never left waiting silently.
|
|
- **QoS-1 duplicates / reorders** — a terminal event can arrive twice, or
|
|
`error` can trail `completed`; the subscriber's terminal state machine
|
|
finalises each job once and ignores the rest.
|
|
- **Trusting the public broker** — anyone can publish there; never make a real
|
|
decision on a PoC signal. Add `auth_token` + an authenticated broker first.
|
|
- **Secrets in `detail`/`data`** — keep payloads generalised; no paths, keys, or
|
|
tokens (except the production `auth_token` in `data`).
|
|
|
|
## Verification Checklist
|
|
|
|
- [ ] `started` → `completed` over the public broker: subscriber prints the
|
|
lines and exits **0**.
|
|
- [ ] `error` path: subscriber exits **1**.
|
|
- [ ] timeout path: no terminal event within `--timeout`/`--idle-timeout` →
|
|
exit **2**.
|
|
- [ ] polluted payload (bad JSON, wrong `schema_version`, wrong `job_id`) is
|
|
dropped with a warning, not crashed on.
|
|
- [ ] one tmux session processes two registry jobs in sequence; a second
|
|
session with a different `agent_session` claims only its own.
|
|
- [ ] broker cut-over: same scripts reach an authenticated TLS broker with env
|
|
changes only; a credential without write ACL is rejected; a late
|
|
subscriber still receives the retained terminal event.
|
|
- [ ] `publisher.py`/`subscriber.py`/`README.md` demo on `python/mqtt/sample`
|
|
still works unchanged (regression).
|
|
- [ ] **audit log integrity** — for a completed job,
|
|
`.hermes/delegate_job_logs/<JID>/events.ndjson` contains `registered` →
|
|
`received started` → `published completed` (in that order), and
|
|
`status.json.status == "completed"` matches the registry record. A
|
|
logging failure (e.g. read-only log dir) does not break the publish or
|
|
subscribe path — only a `logger.warning` is emitted.
|
|
- [ ] **end-to-end demo smoke** — run
|
|
`tmux-agent-orchestrate-delegate-job submit --agent claude-code --agent-session tmux:demo-smoke
|
|
--prompt "echo hello and call publish_event.py --job <JID>
|
|
--event completed" --timeout 120` and confirm
|
|
(a) registered job id echoed, (b) subscriber pid echoed, (c) tmux session
|
|
name printed, (d) `events.ndjson` grows as the agent runs, (e) final
|
|
stdout line is the audit-log dir.
|