Files

T

Godopu a6f7c045bc feat(delegate-job): bump default --timeout 600s -> 3600s (1h wall-clock budget)

Changed 11 locations across 5 files:
- scripts/registry.py: timeout_sec dataclass default + argparse default
- scripts/job_subscriber.py: help text + fallback default
- SKILL.md: 4 recommended invocation examples
- registry.md: JSON example + CLI example
- tmux-agent-orchestrate-delegate-job: bash wrapper TIMEOUT var

--idle-timeout 120s preserved unchanged.
Rationale: 10min default was too short for deep analysis / multi-file
generation tasks; 1h aligns with long-running agent delegation patterns.

2026-06-21 06:08:49 +00:00

6.7 KiB

Raw Blame History

Job Registry

The registry is the single source of truth for delegated work. Job metadata (id, prompt, broker, status, timeouts) lives in files, not environment variables — so one tmux session can handle many jobs sequentially or in parallel without collisions, and publish_event.py / job_subscriber.py can reconstruct everything they need from the registry alone.

Reference implementation: ./scripts/registry.py (library + CLI) over the primitives in ./scripts/mqtt_common.py.

1. Directory layout

.hermes/jobs/
  <job_id>.json          # job metadata record (schema below)
  <job_id>.events.log    # append-only JSON-lines event log (debug, optional)
  .lock                  # shared advisory lock (fcntl) for the whole registry

registry_dir defaults to .hermes/jobs and is overridable everywhere via --registry-dir.

2. Job record schema

{
  "schema_version": 1,
  "job_id": "abc12345",
  "status": "pending | running | completed | error | cancelled",
  "created_at": "2026-06-19T09:30:00Z",
  "updated_at": "2026-06-19T09:32:00Z",
  "prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
  "agent": "claude-code",
  "agent_session": "tmux:claude",
  "broker": {
    "host": "broker.hivemq.com",
    "port": 1883,
    "tls": false,
    "username": null,
    "password": null
  },
  "topic_prefix": "python/mqtt/jobs/abc12345",
  "timeout_sec": 3600,
  "idle_timeout_sec": 120,
  "expected_artifacts": ["sort_problems.md"],
  "last_seq": 0,
  "auth_token": null
}

broker lets publish_event.py connect from the record alone (env still overrides toggles like MQTT_TLS).
topic_prefix → the events topic is <topic_prefix>/events.
last_seq backs the monotonic seq counter so it survives process restarts.
expected_artifacts is the hook a user validate.sh checks (existence/content).
auth_token is null in PoC; production sets secrets.token_urlsafe(32).

3. Concurrency rules

PoC — fcntl advisory lock

Every read-modify-write (register_job, pick_pending, update_status, next_seq) runs inside registry_lock(registry_dir), an exclusive fcntl.flock over .lock. Single-host, good enough for many tmux sessions on one machine.

Production — SQLite WAL

When delegation spans multiple hosts, the file lock no longer serialises across machines. Migrate the same operations to a SQLite database in WAL mode (PRAGMA journal_mode=WAL) with a transaction per claim. The function signatures stay identical; only the storage backend changes.

4. How multiple sessions take only their own work

Each tmux session carries an agent_session label (tmux:claude, tmux:claude-a, tmux:claude-b, …). pick_pending(agent_session):

acquires the registry lock,
scans for the oldest record with status == "pending" and matching agent_session,
flips it to running and writes it back atomically,
releases the lock and returns the job_id (or None).

Because the scan + flip happen under one lock, two sessions can never claim the same job. Sessions with distinct labels naturally partition the work; sessions sharing a label compete safely — first to acquire the lock wins, the other sees the job already running and moves on.

# session A only ever runs its own pending jobs
PY scripts/registry.py pick --agent-session tmux:claude-a   # prints id or exits 3

5. Atomic status updates

All writes use a temp-file + os.replace rename, which is atomic on POSIX:

take the registry lock,
load the current record,
mutate fields + refresh updated_at (and last_seq for next_seq),
write to .<job_id>.<rand>.tmp in the same directory, fsync,
os.replace(tmp, <job_id>.json),
release the lock.

A reader therefore always sees either the old or the new complete record, never a half-written file. This is the file-based equivalent of the rename trick (pending.<session> → running.<session>) and maps cleanly onto a single SQLite transaction when you migrate.

6. CLI quick reference

PY=.venv/bin/python
$PY scripts/registry.py register --prompt "…" --agent claude-code \
    --agent-session tmux:claude --timeout 3600 --idle-timeout 120   # → prints job_id
$PY scripts/registry.py list                                       # human table
$PY scripts/registry.py list --json                                # full records
$PY scripts/registry.py get    --job <id>                          # one record
$PY scripts/registry.py status --job <id> --set completed          # set status
$PY scripts/registry.py pick   --agent-session tmux:claude         # claim → running

Exit codes: 0 ok, 1 not found / bad status, 3 (pick) no pending job for that session.

7. Persistent audit log

Separate from the registry, every job is also mirrored to a durable append-only audit log at .hermes/delegate_job_logs/<job_id>/ (override with DELEGATE_JOB_LOGS_DIR, default <cwd>/.hermes/delegate_job_logs). The registry is live state mutated in place; the audit log is history that survives even after the registry dir is cleaned up. It is git-ignored.

.hermes/delegate_job_logs/<job_id>/
  meta.json      # registration snapshot (the full job record at register time)
  events.ndjson  # append-only, one JSON event per line, time-ordered
  status.json    # current status only (fast point-query)

events.ndjson lines are written automatically at four points:

Trigger	line `event`	Source
`register_job`	`registered`	`registry.register_job` → `mqtt_common.init_job_log`
status change (`update_status`, `pick`, publish status sync)	`status_changed` (`from`/`to`)	`mqtt_common.update_job_status` / `pick_pending`
event published	`published` (embeds the exact payload)	`publish_event.py`
event received	`received`	`job_subscriber.py`

Helpers live in ./scripts/mqtt_common.py: LOGS_DIR, job_log_path, init_job_log, append_event (fcntl-locked, concurrent-append safe), update_logged_status, and the readers read_logged_meta / read_logged_status / iter_logged_events / list_logged_jobs. Every writer is best-effort and isolated — wrapped in try/except with a logger.warning, so an audit-log failure never breaks the registry write, the publish, or the subscribe it shadows.

Read them via the CLI:

PY=.venv/bin/python
$PY scripts/registry.py logs <job_id>            # pretty timeline
$PY scripts/registry.py logs <job_id> --tail 20  # last 20 events
$PY scripts/registry.py logs <job_id> --json     # raw JSON lines
$PY scripts/registry.py logs --list              # every job, live status

6.7 KiB Raw Blame History