feat(lib): migrate to SQLite WAL backend for robust concurrency (FW-L1)

- Replaces python fcntl.flock with SQLite BEGIN IMMEDIATE.
- Status/Reconcile read from SQLite SSOT, with YAML fallback.
- Explicitly documented tradeoff: YAML is no longer a real-time view.
- Handles PRAGMA wal_checkpoint(TRUNCATE) safely outside transactions.
This commit is contained in:
2026-06-21 08:35:01 +00:00
parent 623eef814b
commit 9b797a5c8c
6 changed files with 155 additions and 52 deletions
+4 -3
View File
@@ -306,7 +306,7 @@ graph LR
### 6.1 Limitations
1. **Single-Host File Locking Vulnerability**:
The advisory locking system relies on `fcntl.flock` on `.hermes/jobs/.lock`. This works perfectly for local processes but is **broken on network filesystems (NFS)** or across multi-host environments where locks may fail or behave non-atomically.
The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.hermes/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
2. **Bearer Token Leakage over Plaintext (Public Broker)**:
The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events.
3. **Subscriber Network Drop Orphanage**:
@@ -318,8 +318,9 @@ graph LR
### 6.2 Recommendations
1. **Migrate to SQLite WAL Backend**:
Replace the raw directory lock in `registry.py` and `mqtt_common.py` with a SQLite database configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). SQLite handles concurrent reads and serializes writes safely across multi-process applications without blocking.
1. **[Implemented] Migrate to SQLite WAL Backend**:
The `agent-sessions.yaml` locking mechanism in `lib.sh` has been upgraded to use a SQLite database (`agent-sessions.db`) configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (`stopped`, `terminated`, `archived`), eliminating `flock` contention during active session updates.
**Architecture Decision Note**: This means `agent-sessions.yaml` is **no longer a real-time view** of currently `running` sessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe live `running` states.
2. **Implement Signature-Based Payload Verification**:
Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. `HMAC-SHA256(secret_key, payload_bytes)`). The subscriber validates the signature, preventing token interception.
3. **Enforce Mandatory Broker-Side TLS and ACLs**: