Files
multi-agent-mux/.agents/skills/tmux-agent-orchestrate-delegate-job/job-protocol.md
T

5.3 KiB

Job Event Protocol

The wire contract every tmux-agent-orchestrate-delegate-job agent (claude-code, codex, opencode, human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across the PoC (public broker) and production (own broker) stages; only transport hardening changes, never the payload shape.

Reference implementation: ./scripts/publish_event.py (emit) and ./scripts/job_subscriber.py (observe).


1. Topic design

Topic Purpose
python/mqtt/sample Legacy demo topic — never changed (README compat).
python/mqtt/jobs/<job_id>/events Per-job event stream (this protocol).
  • One topic per job, JSON payload, event field discriminates the type.
  • Single-direction publish only (worker → observer). No request/response.
  • Future split is reserved but not required: <job_id>/events, <job_id>/logs, <job_id>/artifacts.
  • topic_prefix is stored in the job record so publishers resolve the topic from the registry alone (<topic_prefix>/events).

2. Payload schema (JSON, UTF-8, schema_version = 1)

{
  "schema_version": 1,
  "seq": 7,
  "job_id": "abc12345",
  "event": "started | permission_required | progress | completed | error",
  "timestamp": "2026-06-19T09:32:00Z",
  "detail": "generalised, whitelisted human-readable string",
  "data": { "optional": "metadata" }
}
Field Rule
schema_version If publisher/subscriber disagree, the subscriber drops the event with a warning (defensive parsing).
seq Monotonic per job_id, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (last_seq) so it survives restarts.
job_id Subscriber drops any event whose job_id it did not subscribe for.
timestamp Publisher host clock, advisory only. The delegator's timeout is measured from receive time, not this field.
detail Generalised text only. No absolute paths, keys, or tokens.
data Optional metadata. Production may add hmac_sig, build_id, etc.

3. Event catalogue

event When emitted detail example seq
started Agent first picks up the job "Job a1b2c3d4 started" 1
permission_required Agent needs a tool/permission grant "needs to write sort_problems.md" as it happens
progress Optional intermediate checkpoint "creating problem 5/10" as it happens
completed Successful terminal state "saved to sort_problems.md" last
error Failure / exception terminal state "internal error, see logs" last

started and completed/error are mandatory bookends; permission_required and progress are optional. detail must stay on the whitelist of generalised phrasings — never leak secrets through it.

Terminal semantics

  • completed → subscriber exits 0; error → exits 1.
  • The subscriber runs a terminal state machine: it finalises a job on the first completed/error it sees and ignores any later terminal event for that job (QoS-1 duplicate, or an error-after-completed reorder). When all watched jobs are finalised it exits.
  • Wall-clock timeout or idle timeout before a terminal event → exit 2.

4. Production hardening (own broker stage)

The payload shape is unchanged; the transport and trust model tighten. See mqtt-broker-setup.md for the broker side.

  • Auth / ACL — username/password + per-topic ACL. jobs/+/events publish is granted to the worker credential, subscribe to the Hermes credential.

  • HMAC Signature Verification (data.hmac_sig) — to authenticate the publisher and verify message integrity without exposing the raw secret token over the wire, each job record contains a per-job auth_token (secrets.token_urlsafe(32)). The publisher computes an HMAC-SHA256 signature over the serialized payload (excluding data.hmac_sig itself) using the auth_token as the key, and appends it to data.hmac_sig. The subscriber reconstructs this signature and drops any message that does not match or lacks a valid signature.

    { "...": "...", "data": { "hmac_sig": "d2f3...", "build_id": "42" } }
    
  • TLS — port 8883 + private CA. Toggled with MQTT_TLS=1 (+ MQTT_CA_CERTS); no code change.

  • Retained terminal eventscompleted/error publish with retain=True so a subscriber that joins late immediately receives the last terminal state instead of a stale view. The reference publisher auto-retains terminal events; --retained forces it for any event.

  • Dual timeouts — total wall-clock budget + last-activity idle detection, both measured from receive time.

  • Clock trust — never trust the payload timestamp for timeout decisions.


5. Why a public broker is PoC-only

On broker.hivemq.com anyone can publish/subscribe the same topic. Therefore:

  • No secret data in payloads.
  • started/completed/error are signals, never a basis for a security decision.
  • Non-retained messages are not queued for absent subscribers — start the subscriber before the agent (ordering dependency), or rely on retained terminal events in production.
  • Real operational decisions belong to the own-broker stage with auth + ACL.