- Copy delegate-job-skill/skills/delegate-job/ → skills/delegate-job/ - Move requirements.txt (paho-mqtt>=2.0.0) into the new location - Refactor outdated hardcoded paths (~/PuKi/lab/, ~/.hermes/skills/) to dynamic resolution - Add MQTT connection timeout / retry hardening - Remove legacy delegate-job-skill/ directory - Update .gitignore Note: delegate-job-skill git history is squashed — preserved content, dropped commit lineage.
5.2 KiB
Job Event Protocol
The wire contract every delegate-job agent (claude-code, codex, opencode, human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across the PoC (public broker) and production (own broker) stages; only transport hardening changes, never the payload shape.
Reference implementation: ./scripts/publish_event.py
(emit) and ./scripts/job_subscriber.py (observe).
1. Topic design
| Topic | Purpose |
|---|---|
python/mqtt/sample |
Legacy demo topic — never changed (README compat). |
python/mqtt/jobs/<job_id>/events |
Per-job event stream (this protocol). |
- One topic per job, JSON payload,
eventfield discriminates the type. - Single-direction publish only (worker → observer). No request/response.
- Future split is reserved but not required:
<job_id>/events,<job_id>/logs,<job_id>/artifacts. topic_prefixis stored in the job record so publishers resolve the topic from the registry alone (<topic_prefix>/events).
2. Payload schema (JSON, UTF-8, schema_version = 1)
{
"schema_version": 1,
"seq": 7,
"job_id": "abc12345",
"event": "started | permission_required | progress | completed | error",
"timestamp": "2026-06-19T09:32:00Z",
"detail": "generalised, whitelisted human-readable string",
"data": { "optional": "metadata" }
}
| Field | Rule |
|---|---|
schema_version |
If publisher/subscriber disagree, the subscriber drops the event with a warning (defensive parsing). |
seq |
Monotonic per job_id, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (last_seq) so it survives restarts. |
job_id |
Subscriber drops any event whose job_id it did not subscribe for. |
timestamp |
Publisher host clock, advisory only. The delegator's timeout is measured from receive time, not this field. |
detail |
Generalised text only. No absolute paths, keys, or tokens. |
data |
Optional metadata. Production may add auth_token, build_id, etc. |
3. Event catalogue
| event | When emitted | detail example |
seq |
|---|---|---|---|
started |
Agent first picks up the job | "Job a1b2c3d4 started" |
1 |
permission_required |
Agent needs a tool/permission grant | "needs to write sort_problems.md" |
as it happens |
progress |
Optional intermediate checkpoint | "creating problem 5/10" |
as it happens |
completed |
Successful terminal state | "saved to sort_problems.md" |
last |
error |
Failure / exception terminal state | "internal error, see logs" |
last |
started and completed/error are mandatory bookends; permission_required
and progress are optional. detail must stay on the whitelist of generalised
phrasings — never leak secrets through it.
Terminal semantics
completed→ subscriber exits 0;error→ exits 1.- The subscriber runs a terminal state machine: it finalises a job on the
first
completed/errorit sees and ignores any later terminal event for that job (QoS-1 duplicate, or anerror-after-completedreorder). When all watched jobs are finalised it exits. - Wall-clock timeout or idle timeout before a terminal event → exit 2.
4. Production hardening (own broker stage)
The payload shape is unchanged; the transport and trust model tighten. See
mqtt-broker-setup.md for the broker side.
-
Auth / ACL — username/password + per-topic ACL.
jobs/+/eventspublish is granted to the worker credential, subscribe to the Hermes credential. -
auth_token(the bonus field) — each job record carries a per-jobauth_token(secrets.token_urlsafe(32)). The publisher copies it intodata.auth_token; the subscriber compares it against the registry's expected token and drops mismatches. This is an integrity check on top of the broker ACL, useful while still on a shared/public broker.{ "...": "...", "data": { "auth_token": "9f3c…", "build_id": "42" } } -
TLS — port 8883 + private CA. Toggled with
MQTT_TLS=1(+MQTT_CA_CERTS); no code change. -
Retained terminal events —
completed/errorpublish withretain=Trueso a subscriber that joins late immediately receives the last terminal state instead of a stale view. The reference publisher auto-retains terminal events;--retainedforces it for any event. -
Dual timeouts — total wall-clock budget + last-activity idle detection, both measured from receive time.
-
Clock trust — never trust the payload
timestampfor timeout decisions.
5. Why a public broker is PoC-only
On broker.hivemq.com anyone can publish/subscribe the same topic. Therefore:
- No secret data in payloads.
started/completed/errorare signals, never a basis for a security decision.- Non-retained messages are not queued for absent subscribers — start the subscriber before the agent (ordering dependency), or rely on retained terminal events in production.
- Real operational decisions belong to the own-broker stage with auth + ACL.