Changed 11 locations across 5 files: - scripts/registry.py: timeout_sec dataclass default + argparse default - scripts/job_subscriber.py: help text + fallback default - SKILL.md: 4 recommended invocation examples - registry.md: JSON example + CLI example - tmux-agent-orchestrate-delegate-job: bash wrapper TIMEOUT var --idle-timeout 120s preserved unchanged. Rationale: 10min default was too short for deep analysis / multi-file generation tasks; 1h aligns with long-running agent delegation patterns.
6.7 KiB
Job Registry
The registry is the single source of truth for delegated work. Job metadata
(id, prompt, broker, status, timeouts) lives in files, not environment
variables — so one tmux session can handle many jobs sequentially or in
parallel without collisions, and publish_event.py / job_subscriber.py can
reconstruct everything they need from the registry alone.
Reference implementation: ./scripts/registry.py
(library + CLI) over the primitives in
./scripts/mqtt_common.py.
1. Directory layout
.hermes/jobs/
<job_id>.json # job metadata record (schema below)
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
.lock # shared advisory lock (fcntl) for the whole registry
registry_dir defaults to .hermes/jobs and is overridable everywhere via
--registry-dir.
2. Job record schema
{
"schema_version": 1,
"job_id": "abc12345",
"status": "pending | running | completed | error | cancelled",
"created_at": "2026-06-19T09:30:00Z",
"updated_at": "2026-06-19T09:32:00Z",
"prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
"agent": "claude-code",
"agent_session": "tmux:claude",
"broker": {
"host": "broker.hivemq.com",
"port": 1883,
"tls": false,
"username": null,
"password": null
},
"topic_prefix": "python/mqtt/jobs/abc12345",
"timeout_sec": 3600,
"idle_timeout_sec": 120,
"expected_artifacts": ["sort_problems.md"],
"last_seq": 0,
"auth_token": null
}
brokerletspublish_event.pyconnect from the record alone (env still overrides toggles likeMQTT_TLS).topic_prefix→ the events topic is<topic_prefix>/events.last_seqbacks the monotonicseqcounter so it survives process restarts.expected_artifactsis the hook a uservalidate.shchecks (existence/content).auth_tokenisnullin PoC; production setssecrets.token_urlsafe(32).
3. Concurrency rules
PoC — fcntl advisory lock
Every read-modify-write (register_job, pick_pending, update_status,
next_seq) runs inside registry_lock(registry_dir), an exclusive
fcntl.flock over .lock. Single-host, good enough for many tmux sessions on
one machine.
Production — SQLite WAL
When delegation spans multiple hosts, the file lock no longer serialises
across machines. Migrate the same operations to a SQLite database in WAL mode
(PRAGMA journal_mode=WAL) with a transaction per claim. The function
signatures stay identical; only the storage backend changes.
4. How multiple sessions take only their own work
Each tmux session carries an agent_session label (tmux:claude,
tmux:claude-a, tmux:claude-b, …). pick_pending(agent_session):
- acquires the registry lock,
- scans for the oldest record with
status == "pending"and matchingagent_session, - flips it to
runningand writes it back atomically, - releases the lock and returns the
job_id(orNone).
Because the scan + flip happen under one lock, two sessions can never claim the
same job. Sessions with distinct labels naturally partition the work; sessions
sharing a label compete safely — first to acquire the lock wins, the other sees
the job already running and moves on.
# session A only ever runs its own pending jobs
PY scripts/registry.py pick --agent-session tmux:claude-a # prints id or exits 3
5. Atomic status updates
All writes use a temp-file + os.replace rename, which is atomic on POSIX:
- take the registry lock,
- load the current record,
- mutate fields + refresh
updated_at(andlast_seqfornext_seq), - write to
.<job_id>.<rand>.tmpin the same directory,fsync, os.replace(tmp, <job_id>.json),- release the lock.
A reader therefore always sees either the old or the new complete record, never
a half-written file. This is the file-based equivalent of the rename trick
(pending.<session> → running.<session>) and maps cleanly onto a single
SQLite transaction when you migrate.
6. CLI quick reference
PY=.venv/bin/python
$PY scripts/registry.py register --prompt "…" --agent claude-code \
--agent-session tmux:claude --timeout 3600 --idle-timeout 120 # → prints job_id
$PY scripts/registry.py list # human table
$PY scripts/registry.py list --json # full records
$PY scripts/registry.py get --job <id> # one record
$PY scripts/registry.py status --job <id> --set completed # set status
$PY scripts/registry.py pick --agent-session tmux:claude # claim → running
Exit codes: 0 ok, 1 not found / bad status, 3 (pick) no pending job for
that session.
7. Persistent audit log
Separate from the registry, every job is also mirrored to a durable append-only
audit log at .hermes/delegate_job_logs/<job_id>/ (override with
DELEGATE_JOB_LOGS_DIR, default <cwd>/.hermes/delegate_job_logs). The registry
is live state mutated in place; the audit log is history that survives
even after the registry dir is cleaned up. It is git-ignored.
.hermes/delegate_job_logs/<job_id>/
meta.json # registration snapshot (the full job record at register time)
events.ndjson # append-only, one JSON event per line, time-ordered
status.json # current status only (fast point-query)
events.ndjson lines are written automatically at four points:
| Trigger | line event |
Source |
|---|---|---|
register_job |
registered |
registry.register_job → mqtt_common.init_job_log |
status change (update_status, pick, publish status sync) |
status_changed (from/to) |
mqtt_common.update_job_status / pick_pending |
| event published | published (embeds the exact payload) |
publish_event.py |
| event received | received |
job_subscriber.py |
Helpers live in ./scripts/mqtt_common.py:
LOGS_DIR, job_log_path, init_job_log, append_event (fcntl-locked,
concurrent-append safe), update_logged_status, and the readers
read_logged_meta / read_logged_status / iter_logged_events /
list_logged_jobs. Every writer is best-effort and isolated — wrapped in
try/except with a logger.warning, so an audit-log failure never breaks the
registry write, the publish, or the subscribe it shadows.
Read them via the CLI:
PY=.venv/bin/python
$PY scripts/registry.py logs <job_id> # pretty timeline
$PY scripts/registry.py logs <job_id> --tail 20 # last 20 events
$PY scripts/registry.py logs <job_id> --json # raw JSON lines
$PY scripts/registry.py logs --list # every job, live status