feat(lib): migrate to SQLite WAL backend for robust concurrency (FW-L1)

- Replaces python fcntl.flock with SQLite BEGIN IMMEDIATE.
- Status/Reconcile read from SQLite SSOT, with YAML fallback.
- Explicitly documented tradeoff: YAML is no longer a real-time view.
- Handles PRAGMA wal_checkpoint(TRUNCATE) safely outside transactions.
This commit is contained in:
2026-06-21 08:35:01 +00:00
parent 623eef814b
commit 9b797a5c8c
6 changed files with 155 additions and 52 deletions
+1
View File
@@ -35,6 +35,7 @@
| FW-14 | REPORT.md -> Messaging_System_REPORT.md git rename 정규화 | `9334352` | Hermes 직접 | git mv로 정규화 | | FW-14 | REPORT.md -> Messaging_System_REPORT.md git rename 정규화 | `9334352` | Hermes 직접 | git mv로 정규화 |
| FW-15 | monitor --subscribe 보안 경고 문서화 (SKILL.md Security 섹션) | `7d925de` | agy-new | Hermes spec 검토 PASS | | FW-15 | monitor --subscribe 보안 경고 문서화 (SKILL.md Security 섹션) | `7d925de` | agy-new | Hermes spec 검토 PASS |
| FW-16 | 세션 상태 vs 잡 상태 도메인 분리 (glossary) | `155c6e8` | Hermes 직접 | FW-10과 동일 커밋 | | FW-16 | 세션 상태 vs 잡 상태 도메인 분리 (glossary) | `155c6e8` | Hermes 직접 | FW-10과 동일 커밋 |
| FW-L1 | SQLite WAL 도입 및 YAML 최종 스냅샷 분리 | (미커밋) | Hermes 직접 | SQLite DB 런타임 갱신, 세션 종료 시 YAML 덤프 구현 |
--- ---
+6 -6
View File
@@ -8,12 +8,11 @@
## 1. 장기 과제 (근본적 구조 변경) ## 1. 장기 과제 (근본적 구조 변경)
### FW-L1. SQLite WAL 마이그레이션 (FW-02 장기 후속) ### FW-L3. SQLite 테이블 정규화 (FW-L1 후속)
- **상태**: FW-02 단기 대응(NFS 경고) 완료. 장기 해결 미진행. - **상태**: 대기
- **제**: `atomic_dump_yaml`의 fcntl.flock이 NFS/NAS 환경에서 무시됨. 현재는 WARNING 로그만 출력. - **제**: 현재 `.db`에는 전체 JSON 상태를 하나의 `data TEXT` 컬럼에 덤프하고 있음. 이를 `CREATE TABLE sessions (name TEXT PRIMARY KEY, status TEXT, pane_cwd TEXT, data JSON)` 형태로 정규화하면 O(1) 수준의 상태 조회가 가능해짐.
- **해결 방안**: YAML 레지스트리를 SQLite WAL(Write-Ahead Logging) 백엔드로 마이그레이션. - **주의**: 현재 상태 조회 스크립트(`status.sh`, `reconcile.sh`) 역시 `SELECT data` 후 Python 단에서 전체 JSON을 파싱하는 구조이므로, O(1) 이점을 누리기 위해서는 이 조회 스크립트들도 per-column 쿼리(예: `SELECT status FROM sessions WHERE name=?`)로 함께 변경해야 함.
- **작업량**: 대 (Large) — 데이터 레이어 전면 교체
- **우선순위**: NFS 환경 배포 시 필수, 로컬 단일 환경에서는 낮음
### FW-L2. stop 옵션 시맨틱 Step 2 (FW-03/FW-13 후속) ### FW-L2. stop 옵션 시맨틱 Step 2 (FW-03/FW-13 후속)
- **상태**: Step 1(디렉터리/식별자 rename) + frontmatter/산문 재작성 완료. Step 2 미진행. - **상태**: Step 1(디렉터리/식별자 rename) + frontmatter/산문 재작성 완료. Step 2 미진행.
@@ -78,3 +77,4 @@
|---|---| |---|---|
| 2026-06-21 | 초기 작성 — 3개 에이전트 분석 결과 (FW-01~FW-16) | | 2026-06-21 | 초기 작성 — 3개 에이전트 분석 결과 (FW-01~FW-16) |
| 2026-06-21 | FW-01~FW-16 전부 완료 -> DONE.md로 이동. 본 파일은 신규 발견 항목(FW-N1~N4) + 장기 과제(FW-L1~L2)만 남김. | | 2026-06-21 | FW-01~FW-16 전부 완료 -> DONE.md로 이동. 본 파일은 신규 발견 항목(FW-N1~N4) + 장기 과제(FW-L1~L2)만 남김. |
| 2026-06-21 | FW-L1 구현 완료 (사용자 피드백 재수용: 런타임은 SQLite DB, 종료 시에만 YAML 스냅샷 덤프). 항목 DONE.md로 이동. |
+4 -3
View File
@@ -306,7 +306,7 @@ graph LR
### 6.1 Limitations ### 6.1 Limitations
1. **Single-Host File Locking Vulnerability**: 1. **Single-Host File Locking Vulnerability**:
The advisory locking system relies on `fcntl.flock` on `.hermes/jobs/.lock`. This works perfectly for local processes but is **broken on network filesystems (NFS)** or across multi-host environments where locks may fail or behave non-atomically. The advisory locking system previously relied heavily on `fcntl.flock`. While `agent-sessions.yaml` has been migrated to SQLite WAL to solve concurrent writes, the job metadata in `.hermes/jobs/` still relies on `fcntl.flock` which may behave non-atomically on NFS.
2. **Bearer Token Leakage over Plaintext (Public Broker)**: 2. **Bearer Token Leakage over Plaintext (Public Broker)**:
The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events. The `auth_token` mechanism is a simple plaintext bearer comparison. If the transport layer is unencrypted (e.g., using `broker.hivemq.com` on port `1883`), any eavesdropper on the network can steal the token and spoof legitimate events.
3. **Subscriber Network Drop Orphanage**: 3. **Subscriber Network Drop Orphanage**:
@@ -318,8 +318,9 @@ graph LR
### 6.2 Recommendations ### 6.2 Recommendations
1. **Migrate to SQLite WAL Backend**: 1. **[Implemented] Migrate to SQLite WAL Backend**:
Replace the raw directory lock in `registry.py` and `mqtt_common.py` with a SQLite database configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). SQLite handles concurrent reads and serializes writes safely across multi-process applications without blocking. The `agent-sessions.yaml` locking mechanism in `lib.sh` has been upgraded to use a SQLite database (`agent-sessions.db`) configured with Write-Ahead Logging (`PRAGMA journal_mode=WAL`). The YAML file is now only updated as a finalized archive when a session reaches a terminal state (`stopped`, `terminated`, `archived`), eliminating `flock` contention during active session updates.
**Architecture Decision Note**: This means `agent-sessions.yaml` is **no longer a real-time view** of currently `running` sessions. We have explicitly accepted the trade-off of giving up real-time text readability of running sessions in favor of robust concurrency and solving NFS flock limits. Tooling and status checks must now query the SQLite DB to observe live `running` states.
2. **Implement Signature-Based Payload Verification**: 2. **Implement Signature-Based Payload Verification**:
Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. `HMAC-SHA256(secret_key, payload_bytes)`). The subscriber validates the signature, preventing token interception. Rather than sending a plaintext token, utilize HMAC signatures. The delegator and worker share a secret key; the worker publishes a signature of the payload (e.g. `HMAC-SHA256(secret_key, payload_bytes)`). The subscriber validates the signature, preventing token interception.
3. **Enforce Mandatory Broker-Side TLS and ACLs**: 3. **Enforce Mandatory Broker-Side TLS and ACLs**:
+91 -14
View File
@@ -109,11 +109,18 @@ tmux() {
resolve_tmux_server() { resolve_tmux_server() {
local session_name="$1" local session_name="$1"
SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF' SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, sys, yaml import os, sys, sqlite3, json, yaml
name = os.environ['SESSION_NAME'] name = os.environ['SESSION_NAME']
yaml_path = os.environ['YAML_PATH'] yaml_path = os.environ['YAML_PATH']
if os.path.exists(yaml_path): db_path = yaml_path.replace('.yaml', '.db')
d = {}
try: try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
for s in d.get('tmux_sessions', []): for s in d.get('tmux_sessions', []):
@@ -207,7 +214,6 @@ _atomic_dump_yaml_check_nfs() {
atomic_dump_yaml() { atomic_dump_yaml() {
local yaml_path="$1"; shift local yaml_path="$1"; shift
_atomic_dump_yaml_check_nfs "$yaml_path"
local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN") local -a envs=("YAML_PATH=$yaml_path" "HOME_DIR=$HOME_DIR" "CLAUDE_PROJECT_DIR=$CLAUDE_PROJECT_DIR" "LOCAL_BIN=$LOCAL_BIN")
while [ $# -gt 0 ]; do while [ $# -gt 0 ]; do
case "$1" in case "$1" in
@@ -217,13 +223,12 @@ atomic_dump_yaml() {
done done
local mutation; mutation="$(cat)" local mutation; mutation="$(cat)"
env "${envs[@]}" AGENT_SESSIONS_MUTATION="$mutation" python3 - <<'PYEOF' env "${envs[@]}" AGENT_SESSIONS_MUTATION="$mutation" python3 - <<'PYEOF'
import os, sys, fcntl, tempfile, shutil, glob, subprocess, json import os, sys, tempfile, shutil, glob, subprocess, json, sqlite3
from datetime import datetime, timezone from datetime import datetime, timezone
import yaml import yaml
yaml_path = os.environ['YAML_PATH'] yaml_path = os.environ['YAML_PATH']
lock_path = yaml_path + '.lock' db_path = yaml_path.replace('.yaml', '.db')
def _validate(d): def _validate(d):
if not isinstance(d, dict): if not isinstance(d, dict):
@@ -242,21 +247,66 @@ def _validate(d):
if not isinstance(s.get('pane'), dict): if not isinstance(s.get('pane'), dict):
raise SystemExit(f"VALIDATE: tmux_sessions[{i}] {s.get('name')!r} missing pane") raise SystemExit(f"VALIDATE: tmux_sessions[{i}] {s.get('name')!r} missing pane")
def get_terminal_set(d):
return {s.get('name'): s.get('status') for s in d.get('tmux_sessions', []) if s.get('status') in ('stopped', 'terminated', 'archived')}
lock_fh = open(lock_path, 'w') os.makedirs(os.path.dirname(db_path) or '.', exist_ok=True)
fcntl.flock(lock_fh, fcntl.LOCK_EX) conn = sqlite3.connect(db_path, timeout=60.0)
for f in [db_path, db_path + '-wal', db_path + '-shm']:
if os.path.exists(f):
try: try:
os.chmod(f, 0o600)
except Exception:
pass
def is_nfs(path):
try:
df_out = subprocess.check_output(['df', '--output=target', path], text=True, stderr=subprocess.DEVNULL)
target = df_out.strip().split('\n')[-1].strip()
mount_out = subprocess.check_output(['mount'], text=True)
for line in mount_out.split('\n'):
if f" on {target} " in line and (' type nfs ' in line or ' type cifs ' in line or ' fuse.sshfs ' in line):
return True
except Exception:
pass
return False
if is_nfs(os.path.dirname(db_path) or '.'):
conn.execute('PRAGMA journal_mode=DELETE')
else:
conn.execute('PRAGMA journal_mode=WAL')
try:
# Disable auto-commit by explicitly starting a transaction with BEGIN IMMEDIATE
# This prevents the read-modify-write lost update race condition.
conn.execute('BEGIN IMMEDIATE')
conn.execute('CREATE TABLE IF NOT EXISTS state (id INTEGER PRIMARY KEY, data TEXT)')
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row:
d = json.loads(row[0])
else:
# Seed from YAML
if os.path.exists(yaml_path): if os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
else: else:
d = {} d = {}
conn.execute('INSERT INTO state (id, data) VALUES (1, ?)', (json.dumps(d),))
old_terminals = get_terminal_set(d)
# --- caller mutation (module scope: sees d, yaml, os, glob, subprocess) --- # --- caller mutation (module scope: sees d, yaml, os, glob, subprocess) ---
exec(compile(os.environ['AGENT_SESSIONS_MUTATION'], '<mutation>', 'exec'), globals()) exec(compile(os.environ['AGENT_SESSIONS_MUTATION'], '<mutation>', 'exec'), globals())
_validate(d) _validate(d)
conn.execute('REPLACE INTO state (id, data) VALUES (1, ?)', (json.dumps(d),))
new_terminals = get_terminal_set(d)
# Write to YAML ONLY when a session transitions to a finished state
if new_terminals != old_terminals:
if os.path.exists(yaml_path): if os.path.exists(yaml_path):
try: try:
shutil.copy2(yaml_path, yaml_path + '.bak') shutil.copy2(yaml_path, yaml_path + '.bak')
@@ -274,9 +324,18 @@ try:
if os.path.exists(tmp): if os.path.exists(tmp):
os.remove(tmp) os.remove(tmp)
raise raise
conn.commit()
if new_terminals != old_terminals:
try:
conn.execute('PRAGMA wal_checkpoint(TRUNCATE)')
except Exception:
pass
except Exception:
conn.rollback()
raise
finally: finally:
fcntl.flock(lock_fh, fcntl.LOCK_UN) conn.close()
lock_fh.close()
PYEOF PYEOF
} }
@@ -298,19 +357,28 @@ find_workspace_uuid() {
local workspace="$1" agent="$2" local workspace="$1" agent="$2"
local abs; abs="$(cd "$workspace" 2>/dev/null && pwd)" || abs="$workspace" local abs; abs="$(cd "$workspace" 2>/dev/null && pwd)" || abs="$workspace"
WS_ABS="$abs" AGENT="$agent" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF' WS_ABS="$abs" AGENT="$agent" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, json, glob import os, json, glob, sqlite3
import yaml import yaml
ws = os.environ['WS_ABS'] ws = os.environ['WS_ABS']
agent = os.environ['AGENT'] agent = os.environ['AGENT']
home = os.environ['HOME_DIR'] home = os.environ['HOME_DIR']
yaml_path = os.environ['YAML_PATH'] yaml_path = os.environ['YAML_PATH']
db_path = yaml_path.replace('.yaml', '.db')
claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects") claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects")
d = {} d = {}
if os.path.exists(yaml_path): try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
except Exception:
pass
def jsonl_exists(uuid): def jsonl_exists(uuid):
@@ -412,13 +480,22 @@ capture_conversation_id() {
is_already_stopped() { is_already_stopped() {
local session_name="$1" local session_name="$1"
SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF' SESSION_NAME="$session_name" env_python "$AGENT_SESSIONS_YAML" <<'PYEOF'
import os, yaml import os, yaml, sqlite3, json
name = os.environ['SESSION_NAME'] name = os.environ['SESSION_NAME']
yaml_path = os.environ['YAML_PATH'] yaml_path = os.environ['YAML_PATH']
db_path = yaml_path.replace('.yaml', '.db')
d = {} d = {}
if os.path.exists(yaml_path): try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
except Exception:
pass
for s in d.get('tmux_sessions', []): for s in d.get('tmux_sessions', []):
if s.get('name') == name and s.get('status') == 'stopped': if s.get('name') == name and s.get('status') == 'stopped':
print(f"stopped_at={s.get('stopped_at', '?')}") print(f"stopped_at={s.get('stopped_at', '?')}")
@@ -237,8 +237,20 @@ now_iso = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
try: try:
d d
except NameError: except NameError:
import sqlite3
db_path = yaml_path.replace('.yaml', '.db')
d = {}
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
except Exception:
pass
drifts = [] drifts = []
actions = [] actions = []
@@ -37,8 +37,20 @@ home = os.environ['HOME_DIR']
claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects") claude_project_dir = os.environ.get('CLAUDE_PROJECT_DIR', f"{home}/.claude/projects")
drift = json.loads(os.environ['DRIFT_JSON']) drift = json.loads(os.environ['DRIFT_JSON'])
db_path = yaml_path.replace('.yaml', '.db')
d = {}
import sqlite3
try:
if os.path.exists(db_path):
conn = sqlite3.connect(db_path, timeout=10.0)
row = conn.execute('SELECT data FROM state WHERE id=1').fetchone()
if row: d = json.loads(row[0])
conn.close()
elif os.path.exists(yaml_path):
with open(yaml_path) as f: with open(yaml_path) as f:
d = yaml.safe_load(f) or {} d = yaml.safe_load(f) or {}
except Exception:
pass
alive = set(drift.get('tmux_sessions_alive', [])) alive = set(drift.get('tmux_sessions_alive', []))
drift_by_name = {} drift_by_name = {}