refactor: migrate skills/ directory to .agents/skills/
This commit is contained in:
@@ -0,0 +1,114 @@
|
||||
# Job Event Protocol
|
||||
|
||||
The wire contract every tmux-agent-orchestrate-delegate-job agent (claude-code, codex, opencode,
|
||||
human, …) speaks. One job → one MQTT topic → JSON event payloads. Stable across
|
||||
the PoC (public broker) and production (own broker) stages; only transport
|
||||
hardening changes, never the payload shape.
|
||||
|
||||
Reference implementation: [`./scripts/publish_event.py`](./scripts/publish_event.py)
|
||||
(emit) and [`./scripts/job_subscriber.py`](./scripts/job_subscriber.py) (observe).
|
||||
|
||||
---
|
||||
|
||||
## 1. Topic design
|
||||
|
||||
| Topic | Purpose |
|
||||
|-------|---------|
|
||||
| `python/mqtt/sample` | Legacy demo topic — **never changed** (README compat). |
|
||||
| `python/mqtt/jobs/<job_id>/events` | Per-job event stream (this protocol). |
|
||||
|
||||
- One topic per job, JSON payload, `event` field discriminates the type.
|
||||
- Single-direction publish only (worker → observer). No request/response.
|
||||
- Future split is reserved but not required:
|
||||
`<job_id>/events`, `<job_id>/logs`, `<job_id>/artifacts`.
|
||||
- `topic_prefix` is stored in the job record so publishers resolve the topic
|
||||
from the registry alone (`<topic_prefix>/events`).
|
||||
|
||||
---
|
||||
|
||||
## 2. Payload schema (JSON, UTF-8, `schema_version = 1`)
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"seq": 7,
|
||||
"job_id": "abc12345",
|
||||
"event": "started | permission_required | progress | completed | error",
|
||||
"timestamp": "2026-06-19T09:32:00Z",
|
||||
"detail": "generalised, whitelisted human-readable string",
|
||||
"data": { "optional": "metadata" }
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Rule |
|
||||
|-------|------|
|
||||
| `schema_version` | If publisher/subscriber disagree, the subscriber **drops** the event with a warning (defensive parsing). |
|
||||
| `seq` | Monotonic **per `job_id`**, first publish = 1. Lets the subscriber detect reorder/duplication. Persisted in the registry (`last_seq`) so it survives restarts. |
|
||||
| `job_id` | Subscriber drops any event whose `job_id` it did not subscribe for. |
|
||||
| `timestamp` | Publisher host clock, **advisory only**. The delegator's timeout is measured from *receive* time, not this field. |
|
||||
| `detail` | Generalised text only. **No absolute paths, keys, or tokens.** |
|
||||
| `data` | Optional metadata. Production may add `hmac_sig`, `build_id`, etc. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Event catalogue
|
||||
|
||||
| event | When emitted | `detail` example | seq |
|
||||
|-------|--------------|------------------|-----|
|
||||
| `started` | Agent first picks up the job | `"Job a1b2c3d4 started"` | 1 |
|
||||
| `permission_required` | Agent needs a tool/permission grant | `"needs to write sort_problems.md"` | as it happens |
|
||||
| `progress` | Optional intermediate checkpoint | `"creating problem 5/10"` | as it happens |
|
||||
| `completed` | Successful terminal state | `"saved to sort_problems.md"` | last |
|
||||
| `error` | Failure / exception terminal state | `"internal error, see logs"` | last |
|
||||
|
||||
`started` and `completed`/`error` are mandatory bookends; `permission_required`
|
||||
and `progress` are optional. `detail` must stay on the whitelist of generalised
|
||||
phrasings — never leak secrets through it.
|
||||
|
||||
### Terminal semantics
|
||||
|
||||
- `completed` → subscriber exits 0; `error` → exits 1.
|
||||
- The subscriber runs a **terminal state machine**: it finalises a job on the
|
||||
first `completed`/`error` it sees and ignores any later terminal event for
|
||||
that job (QoS-1 duplicate, or an `error`-after-`completed` reorder). When all
|
||||
watched jobs are finalised it exits.
|
||||
- Wall-clock timeout *or* idle timeout before a terminal event → exit 2.
|
||||
|
||||
---
|
||||
|
||||
## 4. Production hardening (own broker stage)
|
||||
|
||||
The payload shape is unchanged; the transport and trust model tighten. See
|
||||
[`mqtt-broker-setup.md`](./mqtt-broker-setup.md) for the broker side.
|
||||
|
||||
- **Auth / ACL** — username/password + per-topic ACL. `jobs/+/events` publish is
|
||||
granted to the worker credential, subscribe to the Hermes credential.
|
||||
- **HMAC Signature Verification (`data.hmac_sig`)** — to authenticate the publisher and verify message integrity without exposing the raw secret token over the wire, each job record contains a per-job `auth_token` (`secrets.token_urlsafe(32)`). The publisher computes an HMAC-SHA256 signature over the serialized payload (excluding `data.hmac_sig` itself) using the `auth_token` as the key, and appends it to **`data.hmac_sig`**. The subscriber reconstructs this signature and **drops any message that does not match or lacks a valid signature**.
|
||||
|
||||
```json
|
||||
{ "...": "...", "data": { "hmac_sig": "d2f3...", "build_id": "42" } }
|
||||
```
|
||||
|
||||
- **TLS** — port 8883 + private CA. Toggled with `MQTT_TLS=1` (+ `MQTT_CA_CERTS`);
|
||||
no code change.
|
||||
- **Retained terminal events** — `completed`/`error` publish with `retain=True`
|
||||
so a subscriber that joins late immediately receives the last terminal state
|
||||
instead of a stale view. The reference publisher auto-retains terminal events;
|
||||
`--retained` forces it for any event.
|
||||
- **Dual timeouts** — total wall-clock budget + last-activity idle detection,
|
||||
both measured from receive time.
|
||||
- **Clock trust** — never trust the payload `timestamp` for timeout decisions.
|
||||
|
||||
---
|
||||
|
||||
## 5. Why a public broker is PoC-only
|
||||
|
||||
On `broker.hivemq.com` anyone can publish/subscribe the same topic. Therefore:
|
||||
|
||||
- No secret data in payloads.
|
||||
- `started`/`completed`/`error` are *signals*, never a basis for a security
|
||||
decision.
|
||||
- Non-retained messages are **not queued** for absent subscribers — start the
|
||||
subscriber **before** the agent (ordering dependency), or rely on retained
|
||||
terminal events in production.
|
||||
- Real operational decisions belong to the own-broker stage with auth + ACL.
|
||||
Reference in New Issue
Block a user