6.7 KiB
Job Registry
The registry is the single source of truth for delegated work. Job metadata
(id, prompt, broker, status, timeouts) lives in files, not environment
variables — so one tmux session can handle many jobs sequentially or in
parallel without collisions, and publish_event.py / job_subscriber.py can
reconstruct everything they need from the registry alone.
Reference implementation: ./scripts/registry.py
(library + CLI) over the primitives in
./scripts/mqtt_common.py.
1. Directory layout
.mam/jobs/
<job_id>.json # job metadata record (schema below)
<job_id>.events.log # append-only JSON-lines event log (debug, optional)
.lock # shared advisory lock (fcntl) for the whole registry
registry_dir defaults to .mam/jobs and is overridable everywhere via
--registry-dir.
2. Job record schema
{
"schema_version": 1,
"job_id": "abc12345",
"status": "pending | running | completed | error | cancelled",
"created_at": "2026-06-19T09:30:00Z",
"updated_at": "2026-06-19T09:32:00Z",
"prompt": "정렬 문제 10개를 만들어 sort_problems.md로 저장…",
"agent": "claude-code",
"agent_session": "tmux:claude",
"broker": {
"host": "broker.hivemq.com",
"port": 1883,
"tls": false,
"username": null,
"password": null
},
"topic_prefix": "python/mqtt/jobs/abc12345",
"timeout_sec": 3600,
"idle_timeout_sec": 120,
"expected_artifacts": ["sort_problems.md"],
"last_seq": 0,
"auth_token": null
}
brokerletspublish_event.pyconnect from the record alone (env still overrides toggles likeMQTT_TLS).topic_prefix→ the events topic is<topic_prefix>/events.last_seqbacks the monotonicseqcounter so it survives process restarts.expected_artifactsis the hook a uservalidate.shchecks (existence/content).auth_tokenisnullin PoC; production setssecrets.token_urlsafe(32).
3. Concurrency rules
PoC — fcntl advisory lock
Every read-modify-write (register_job, pick_pending, update_status,
next_seq) runs inside registry_lock(registry_dir), an exclusive
fcntl.flock over .lock. Single-host, good enough for many tmux sessions on
one machine.
Production — SQLite WAL
When delegation spans multiple hosts, the file lock no longer serialises
across machines. Migrate the same operations to a SQLite database in WAL mode
(PRAGMA journal_mode=WAL) with a transaction per claim. The function
signatures stay identical; only the storage backend changes.
4. How multiple sessions take only their own work
Each tmux session carries an agent_session label (tmux:claude,
tmux:claude-a, tmux:claude-b, …). pick_pending(agent_session):
- acquires the registry lock,
- scans for the oldest record with
status == "pending"and matchingagent_session, - flips it to
runningand writes it back atomically, - releases the lock and returns the
job_id(orNone).
Because the scan + flip happen under one lock, two sessions can never claim the
same job. Sessions with distinct labels naturally partition the work; sessions
sharing a label compete safely — first to acquire the lock wins, the other sees
the job already running and moves on.
# session A only ever runs its own pending jobs
PY scripts/registry.py pick --agent-session tmux:claude-a # prints id or exits 3
5. Atomic status updates
All writes use a temp-file + os.replace rename, which is atomic on POSIX:
- take the registry lock,
- load the current record,
- mutate fields + refresh
updated_at(andlast_seqfornext_seq), - write to
.<job_id>.<rand>.tmpin the same directory,fsync, os.replace(tmp, <job_id>.json),- release the lock.
A reader therefore always sees either the old or the new complete record, never
a half-written file. This is the file-based equivalent of the rename trick
(pending.<session> → running.<session>) and maps cleanly onto a single
SQLite transaction when you migrate.
6. CLI quick reference
PY=.venv/bin/python
$PY scripts/registry.py register --prompt "…" --agent claude-code \
--agent-session tmux:claude --timeout 3600 --idle-timeout 120 # → prints job_id
$PY scripts/registry.py list # human table
$PY scripts/registry.py list --json # full records
$PY scripts/registry.py get --job <id> # one record
$PY scripts/registry.py status --job <id> --set completed # set status
$PY scripts/registry.py pick --agent-session tmux:claude # claim → running
Exit codes: 0 ok, 1 not found / bad status, 3 (pick) no pending job for
that session.
7. Persistent audit log
Separate from the registry, every job is also mirrored to a durable append-only
audit log at .mam/delegate_job_logs/<job_id>/ (override with
DELEGATE_JOB_LOGS_DIR, default <cwd>/.mam/delegate_job_logs). The registry
is live state mutated in place; the audit log is history that survives
even after the registry dir is cleaned up. It is git-ignored.
.mam/delegate_job_logs/<job_id>/
meta.json # registration snapshot (the full job record at register time)
events.ndjson # append-only, one JSON event per line, time-ordered
status.json # current status only (fast point-query)
events.ndjson lines are written automatically at four points:
| Trigger | line event |
Source |
|---|---|---|
register_job |
registered |
registry.register_job → mqtt_common.init_job_log |
status change (update_status, pick, publish status sync) |
status_changed (from/to) |
mqtt_common.update_job_status / pick_pending |
| event published | published (embeds the exact payload) |
publish_event.py |
| event received | received |
job_subscriber.py |
Helpers live in ./scripts/mqtt_common.py:
LOGS_DIR, job_log_path, init_job_log, append_event (fcntl-locked,
concurrent-append safe), update_logged_status, and the readers
read_logged_meta / read_logged_status / iter_logged_events /
list_logged_jobs. Every writer is best-effort and isolated — wrapped in
try/except with a logger.warning, so an audit-log failure never breaks the
registry write, the publish, or the subscribe it shadows.
Read them via the CLI:
PY=.venv/bin/python
$PY scripts/registry.py logs <job_id> # pretty timeline
$PY scripts/registry.py logs <job_id> --tail 20 # last 20 events
$PY scripts/registry.py logs <job_id> --json # raw JSON lines
$PY scripts/registry.py logs --list # every job, live status