feat(lib): migrate to SQLite WAL backend for robust concurrency (FW-L1)

- Replaces python fcntl.flock with SQLite BEGIN IMMEDIATE.
- Status/Reconcile read from SQLite SSOT, with YAML fallback.
- Explicitly documented tradeoff: YAML is no longer a real-time view.
- Handles PRAGMA wal_checkpoint(TRUNCATE) safely outside transactions.
This commit is contained in:
2026-06-21 08:35:01 +00:00
parent 623eef814b
commit 9b797a5c8c
6 changed files with 155 additions and 52 deletions
+1
View File
@@ -35,6 +35,7 @@
| FW-14 | REPORT.md -> Messaging_System_REPORT.md git rename 정규화 | `9334352` | Hermes 직접 | git mv로 정규화 |
| FW-15 | monitor --subscribe 보안 경고 문서화 (SKILL.md Security 섹션) | `7d925de` | agy-new | Hermes spec 검토 PASS |
| FW-16 | 세션 상태 vs 잡 상태 도메인 분리 (glossary) | `155c6e8` | Hermes 직접 | FW-10과 동일 커밋 |
| FW-L1 | SQLite WAL 도입 및 YAML 최종 스냅샷 분리 | (미커밋) | Hermes 직접 | SQLite DB 런타임 갱신, 세션 종료 시 YAML 덤프 구현 |
---
+6 -6
View File
@@ -8,12 +8,11 @@
## 1. 장기 과제 (근본적 구조 변경)
### FW-L1. SQLite WAL 마이그레이션 (FW-02 장기 후속)
- **상태**: FW-02 단기 대응(NFS 경고) 완료. 장기 해결 미진행.
- **제**: `atomic_dump_yaml`의 fcntl.flock이 NFS/NAS 환경에서 무시됨. 현재는 WARNING 로그만 출력.
- **해결 방안**: YAML 레지스트리를 SQLite WAL(Write-Ahead Logging) 백엔드로 마이그레이션.
- **작업량**: 대 (Large) — 데이터 레이어 전면 교체
- **우선순위**: NFS 환경 배포 시 필수, 로컬 단일 환경에서는 낮음
### FW-L3. SQLite 테이블 정규화 (FW-L1 후속)
- **상태**: 대기
- **제**: 현재 `.db`에는 전체 JSON 상태를 하나의 `data TEXT` 컬럼에 덤프하고 있음. 이를 `CREATE TABLE sessions (name TEXT PRIMARY KEY, status TEXT, pane_cwd TEXT, data JSON)` 형태로 정규화하면 O(1) 수준의 상태 조회가 가능해짐.
- **주의**: 현재 상태 조회 스크립트(`status.sh`, `reconcile.sh`) 역시 `SELECT data` 후 Python 단에서 전체 JSON을 파싱하는 구조이므로, O(1) 이점을 누리기 위해서는 이 조회 스크립트들도 per-column 쿼리(예: `SELECT status FROM sessions WHERE name=?`)로 함께 변경해야 함.
### FW-L2. stop 옵션 시맨틱 Step 2 (FW-03/FW-13 후속)
- **상태**: Step 1(디렉터리/식별자 rename) + frontmatter/산문 재작성 완료. Step 2 미진행.
@@ -78,3 +77,4 @@
|---|---|
| 2026-06-21 | 초기 작성 — 3개 에이전트 분석 결과 (FW-01~FW-16) |
| 2026-06-21 | FW-01~FW-16 전부 완료 -> DONE.md로 이동. 본 파일은 신규 발견 항목(FW-N1~N4) + 장기 과제(FW-L1~L2)만 남김. |
| 2026-06-21 | FW-L1 구현 완료 (사용자 피드백 재수용: 런타임은 SQLite DB, 종료 시에만 YAML 스냅샷 덤프). 항목 DONE.md로 이동. |
+4 -3
View File
@@ -306,7 +306,7 @@ graph LR
### 6.1 Limitations
1. **Single-Host File Locking Vulnerability**:
The advisory locking system relies on `fcntl.flock` on `.hermes/jobs/.lock`. This works perfectly for local processes but is **broken on network filesystems (NFS)** or across multi-host environments where locks may fail or behave non-atomically.
The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.hermes/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
2. **Bearer Token Leakage over Plaintext (Public Broker)**:
The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events.
3. **Subscriber Network Drop Orphanage**:
@@ -318,8 +318,9 @@ graph LR
### 6.2 Recommendations
1. **Migrate to SQLite WAL Backend**:
Replace the raw directory lock in `registry.py` and `mqtt_common.py` with a SQLite database configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). SQLite handles concurrent reads and serializes writes safely across multi-process applications without blocking.
1. **[Implemented] Migrate to SQLite WAL Backend**:
The `agent-sessions.yaml` locking mechanism in `lib.sh` has been upgraded to use a SQLite database (`agent-sessions.db`) configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (`stopped`, `terminated`, `archived`), eliminating `flock` contention during active session updates.
**Architecture Decision Note**: This means `agent-sessions.yaml` is **no longer a real-time view** of currently `running` sessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe live `running` states.
2. **Implement Signature-Based Payload Verification**:
Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. `HMAC-SHA256(secret_key, payload_bytes)`). The subscriber validates the signature, preventing token interception.
3. **Enforce Mandatory Broker-Side TLS and ACLs**:
+91 -14
View File
@@ -109,11 +109,18 @@ tmux() {
resolve_tmux_server() {
local session_name="$1"
SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, sys, yaml
import os, sys, sqlite3, json, yaml
name = os.environ['SESSION_NAME']
yaml_path = os.environ['YAML_PATH']
if os.path.exists(yaml_path):
db_path = yaml_path.replace('.yaml', '.db')
d = {}
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
for s in d.get('tmux_sessions', []):
@@ -207,7 +214,6 @@ _atomic_dump_yaml_check_nfs() {
atomic_dump_yaml() {
local yaml_path="$1"; shift
_atomic_dump_yaml_check_nfs "$yaml_path"
local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN")
while [ $# -gt 0 ]; do
case "$1" in
@@ -217,13 +223,12 @@ atomic_dump_yaml() {
done
local mutation; mutation="$(cat)"
env "${envs[@]}" AGENT_SESSIONS_MUTATION="$mutation" python3 - <<'PYEOF'
import os, sys, fcntl, tempfile, shutil, glob, subprocess, json
import os, sys, tempfile, shutil, glob, subprocess, json, sqlite3
from datetime import datetime, timezone
import yaml
yaml_path = os.environ['YAML_PATH']
lock_path = yaml_path + '.lock'
db_path = yaml_path.replace('.yaml', '.db')
def _validate(d):
if not isinstance(d, dict):
@@ -242,21 +247,66 @@ def _validate(d):
if not isinstance(s.get('pane'), dict):
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] {s.get('name')!r} missing pane")
def get_terminal_set(d):
return {s.get('name'): s.get('status') for s in d.get('tmux_sessions', []) if s.get('status') in ('stopped', 'terminated', 'archived')}
lock_fh = open(lock_path, 'w')
fcntl.flock(lock_fh, fcntl.LOCK_EX)
os.makedirs(os.path.dirname(db_path) or '.', exist_ok=True)
conn = sqlite3.connect(db_path, timeout=60.0)
for f in [db_path, db_path + '-wal', db_path + '-shm']:
if os.path.exists(f):
try:
os.chmod(f, 0o600)
except Exception:
pass
def is_nfs(path):
try:
df_out = subprocess.check_output(['df', '--output=target', path], text=True, stderr=subprocess.DEVNULL)
target = df_out.strip().split('\n')[-1].strip()
mount_out = subprocess.check_output(['mount'], text=True)
for line in mount_out.split('\n'):
if f" on {target} " in line and (' type nfs ' in line or ' type cifs ' in line or ' fuse.sshfs ' in line):
return True
except Exception:
pass
return False
if is_nfs(os.path.dirname(db_path) or '.'):
conn.execute('PRAGMA journal_mode=DELETE')
else:
conn.execute('PRAGMA journal_mode=WAL')
try:
# Disable auto-commit by explicitly starting a transaction with BEGIN IMMEDIATE
# This prevents the read-modify-write lost update race condition.
conn.execute('BEGIN IMMEDIATE')
conn.execute('CREATE TABLE IF NOT EXISTS state (id INTEGER PRIMARY KEY, data TEXT)')
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row:
d = json.loads(row[0])
else:
# Seed from YAML
if os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
else:
d = {}
conn.execute('INSERT INTO state (id, data) VALUES (1, ?)', (json.dumps(d),))
old_terminals = get_terminal_set(d)
# --- caller mutation (module scope: sees d, yaml, os, glob, subprocess) ---
exec(compile(os.environ['AGENT_SESSIONS_MUTATION'], '<mutation>', 'exec'), globals())
_validate(d)
conn.execute('REPLACE INTO state (id, data) VALUES (1, ?)', (json.dumps(d),))
new_terminals = get_terminal_set(d)
# Write to YAML ONLY when a session transitions to a finished state
if new_terminals != old_terminals:
if os.path.exists(yaml_path):
try:
shutil.copy2(yaml_path, yaml_path + '.bak')
@@ -274,9 +324,18 @@ try:
if os.path.exists(tmp):
os.remove(tmp)
raise
conn.commit()
if new_terminals != old_terminals:
try:
conn.execute('PRAGMA wal_checkpoint(TRUNCATE)')
except Exception:
pass
except Exception:
conn.rollback()
raise
finally:
fcntl.flock(lock_fh, fcntl.LOCK_UN)
lock_fh.close()
conn.close()
PYEOF
}
@@ -298,19 +357,28 @@ find_workspace_uuid() {
local workspace="$1" agent="$2"
local abs; abs="$(cd "$workspace" 2>/dev/null && pwd)" || abs="$workspace"
WS_ABS="$abs" AGENT="$agent" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, json, glob
import os, json, glob, sqlite3
import yaml
ws = os.environ['WS_ABS']
agent = os.environ['AGENT']
home = os.environ['HOME_DIR']
yaml_path = os.environ['YAML_PATH']
db_path = yaml_path.replace('.yaml', '.db')
claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects")
d = {}
if os.path.exists(yaml_path):
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
except Exception:
pass
def jsonl_exists(uuid):
@@ -412,13 +480,22 @@ capture_conversation_id() {
is_already_stopped() {
local session_name="$1"
SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, yaml
import os, yaml, sqlite3, json
name = os.environ['SESSION_NAME']
yaml_path = os.environ['YAML_PATH']
db_path = yaml_path.replace('.yaml', '.db')
d = {}
if os.path.exists(yaml_path):
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
except Exception:
pass
for s in d.get('tmux_sessions', []):
if s.get('name') == name and s.get('status') == 'stopped':
print(f"stopped_at={s.get('stopped_at', '?')}")
@@ -237,8 +237,20 @@ now_iso = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
try:
d
except NameError:
import sqlite3
db_path = yaml_path.replace('.yaml', '.db')
d = {}
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
except Exception:
pass
drifts = []
actions = []
@@ -37,8 +37,20 @@ home = os.environ['HOME_DIR']
claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects")
drift = json.loads(os.environ['DRIFT_JSON'])
db_path = yaml_path.replace('.yaml', '.db')
d = {}
import sqlite3
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f:
d = yaml.safe_load(f) or {}
except Exception:
pass
alive = set(drift.get('tmux_sessions_alive', []))
drift_by_name = {}