refactor: move AGENT.md and AGENT.ko.md to .agents/ directory

2026-06-26 21:28:41 +09:00
parent e14ee90243
commit 57d8f6c2ff
9 changed files with 15 additions and 14 deletions
@@ -0,0 +1,126 @@
+# AGENT.md
+
+This document serves as the common guidelines and protocol for introducing the **MQTT messaging backplane and Tmux-based multi-agent orchestration workflow** to a new project. It defines the rules and architecture to ensure collaborating agents perform tasks safely, robustly, and consistently.
+
+All agents working on a new project must read this document thoroughly and comply with the defined protocols before starting any tasks.
+
+---
+
+## 1. Agent Roles Definition (Agent Roles)
+
+We clearly separate responsibilities and permissions between roles to reduce bottlenecks and enhance the quality of execution.
+
+### 👤 Project Manager (PM / Orchestrator)
+- **Core Responsibility**: Receive user requirements, establish detailed task plans, assign and instruct workers, control the overall workflow, and report final results.
+- **Ambiguity Resolution**: If a user's requirements contain ambiguous details, do not guess. Immediately ask the user for clarification (we recommend using the `/grill-me` slash command).
+- **Feedback Loop Adjustment**: Analyze verification feedback from Reviewers to decide on improvement paths. For complex technical challenges, direct Workers and Reviewers to research options, add the PM's own assessment, and present a final report to the user to decide the project's direction.
+- **Self-Healing (Hermes Fallback Fix)**: If a defect pointed out by a Reviewer is extremely minor or is a simple typo/configuration omission, the PM should directly fix the source code instead of reassigning it to the Worker, thereby minimizing the round-trip cost.
+
+### 🛠️ Worker (Implementation Agent)
+- **Core Responsibility**: Design business logic and implement source code as delegated by the PM.
+- **Collaboration & Communication**: If the implementation path is ambiguous or interface design changes are required within the assigned scope, ask the PM for consensus before applying surgical changes.
+- **Contract Adherence**: Comply with the single task instructions (Brief) and the unique Job ID convention provided by the PM. Workers must publish a `started` event when starting work, and a `completed` or `error` event to the backplane upon termination.
+
+### 🔍 Reviewer (Verification Agent)
+- **Core Responsibility**: Verify source code changes (Diff) and implementation specifications submitted by Workers. Reviewers act as facilitators by detecting security vulnerabilities, proposing performance improvements, and examining design consistency.
+- **Provide Concrete Alternatives**: Simply rejecting changes (`NOT PASS`) is forbidden. When raising an issue, Reviewers must propose a **concrete, stable, and verified alternative code block or solution**.
+- **Complementary Cross-Verification**: Leverage the unique characteristics of different agent models (e.g., Flash-class models are skilled at capturing semantic shell bugs, while Opus/Sonnet-class models excel at API signatures and logical regression analysis) to perform parallel and mutually-supportive reviews.
+
+---
+
+## 2. Messaging Backplane & Registry Protocol
+
+Asynchronous communication and state management between agents are controlled via distributed event channels and file/DB registries.
+
+### 📡 MQTT Backplane
+- **Event Lifecycle**:
+  - `started` (Job execution starts) ➡️ `progress`/`permission_required` (Share intermediate progress) ➡️ `completed` (Successful termination) or `error` (Failed termination)
+  - `completed` and `error` are terminal events that are published exactly once.
+- **Publish/Subscribe Rules**:
+  - Since MQTT does not guarantee persistent queues, the subscriber (`job_subscriber.py`) **must be running in the background before the agent starts** (the Subscribe-before-Publish principle).
+  - When publishing terminal events, publish with `retain=True` on the broker so that subscribers joining late can still read the final state.
+  - Generalize all transmitted data to ensure that sensitive secrets like passwords, private keys, or absolute system paths are not included.
+
+### 🗃️ Registry & State Management
+- This architecture maintains two distinct registries based on their purpose:
+  - **Job Registry**: The metadata and lifecycle of each asynchronous job are recorded in individual JSON files (`.mam/jobs/<id>.json`). Concurrency conflicts (claiming races) across multiple sessions are prevented via file-based `fcntl` advisory locks (`registry_lock` via `registry.py`).
+  - **Session Registry**: TMUX monitoring states and running agent metadata are consistently controlled using a SQLite WAL database (`.mam/agent-sessions.db`) to support reliable concurrent transactions on a single host. However, since SQLite WAL mode does not guarantee complete file locking in Network File System (NFS) environments, we recommend using a local file system.
+
+### 🛡️ Security Protocol (HMAC-SHA256)
+- **Unauthenticated PoC Mode**: If the `auth_token` in the job registry is set to `null` (the default PoC mode), signature verification is skipped and all events are accepted (`verify_hmac` always returns `True`).
+- **Authenticated Production Mode**: In production environments or integrations requiring authentication, a unique cryptographic token (`auth_token`) is issued for each job. The publisher must include an `hmac_sig` signature in the payload keyed by this token, and the receiving end (`verify_hmac`) will immediately drop messages that lack a signature or have mismatching signatures to prevent downgrade attacks.
+- **Rollout Strategy**: To avoid event drops caused by inconsistencies between publishing and receiving nodes when updating security schemes, hybrid transition formats (which risk leaking plaintext tokens) must not be used. Instead, adopt a **"Simultaneous Rollout"** where all nodes are updated at once.
+
+---
+
+## 3. Collaborative Workflow Execution Loop (Workflow Loop)
+
+```mermaid
+sequenceDiagram
+    autonumber
+    actor User as User
+    participant PM as Project Manager
+    participant W as Worker
+    participant R as Reviewers
+    participant M as MQTT Backplane
+
+    User->>PM: Hand over requirements
+    Note over PM: Run grill-me & plan tasks
+    PM->>M: Register Job & start Subscriber
+    PM->>W: Delegate task (Provide Job ID & Brief)
+    W->>M: Publish 'started' event
+    Note over W: Modify code & implement
+    W->>M: Publish 'completed' (or 'error')
+    PM->>R: Request parallel review (Provide Diff)
+    Note over R: Cross-analysis & verification
+    alt Defect Found
+        R->>PM: NOT PASS (Feedback with alternatives)
+        Note over PM: PM directly fixes minor defects
+        PM->>W: Apply feedback & re-delegate
+    else Verification Pass
+        R->>PM: PASS
+    end
+    PM->>User: Report final pass & commit changes
+```
+
+1. **Planning and Allocation**: The PM defines requirements and outlines independent jobs to avoid conflicting dependencies.
+2. **Execution and Notification**: The PM launches a subscriber, then assigns the job to a Worker session. The Worker performs the logic and sends a terminal event, automatically closing the session.
+3. **Cross-Verification Iteration (Review Loop)**: Once the task is complete, the PM circulates the changes to the Reviewer agents in parallel. The modify-reject cycle repeats until all reviewers yield a `PASS`, ensuring high-quality code.
+4. **Release and Cleanup**: Code that passes verification is committed to Git, and temporary session resources are reclaimed.
+
+---
+
+## 4. Analysis Infrastructure Patterns & Practical Guide (Infra Patterns)
+
+These are critical instructions for preventing data loss and infrastructure-level failures during long-running agent analyses.
+
+### 📸 Preventing TUI Viewport Truncation (The 3 Pane Snapshotting Rules)
+To ensure that agents running in TMUX environments do not lose debug logs or previous outputs due to screen scrollback limits, the following **snapshotting pattern must be enforced**:
+1. **Pre-brief Capture**: Capture the pane (`capture-pane -S -200`) immediately after sending the task instruction (Brief) to back up the starting point of the input history.
+2. **Loop Snapshot**: For long-running agent sessions (5 minutes or more), periodically (e.g., every 30 seconds) scan the viewport and append the incremental data to `/tmp/pane-snap.txt`.
+3. **Post-job Capture**: Capture the complete pane state one final time immediately after a job completes or returns an error to preserve the entire execution trajectory.
+
+### 📄 Handling Long Briefing Instructions
+- Sending long instructions or prompts (hundreds of lines) sequentially via TMUX `send-keys` or input buffers can overwhelm the agent's TUI, leading to lost characters or truncated paragraphs.
+- **Resolution**: If instructions are long, write them separately to a file path (e.g., `/tmp/brief-<job_id>.md`) and send a simplified execution command to the agent: `"Read /tmp/brief-... and execute"`.
+
+### ⏱️ Timeout Configuration & Alignment Rules
+- **Job Execution Limits (`timeout_sec` & `idle_timeout_sec`)**: Each job independently manages its overall execution timeout (`timeout_sec`, default 3600s) and idle timeout without receiving messages (`idle_timeout_sec`, default 120s).
+- **Monitor Idle Waiting (`SUB_IDLE_TIMEOUT`)**: The idle timeout for the monitor script (`reconcile.sh`), `SUB_IDLE_TIMEOUT`, must always be set generously to `3600s` (1 hour) or more to align with the maximum job budget. This prevents the monitor from terminating early due to idle detection, which would lose control over background tasks before they finish.
+
+---
+
+## 5. Setup Checklist for New Projects (Setup Checklist)
+
+Use this checklist when deploying this agent orchestration model to a new project:
+
+- [ ] **Virtualenv Dependencies**: Are required Python packages like `pyyaml` and `paho-mqtt` included in `.venv` or `requirements.txt`?
+- [ ] **Configuration File**: Are the MQTT broker address and security credentials safely loaded and shared via the `.env` file?
+- [ ] **Directory Convention**: Are the registry path (`.mam/jobs/`) and logging path (`.mam/delegate_job_logs/`) added to `.gitignore`?
+- [ ] **Core Scripts**: Are the core scripts (`mqtt_common.py`, `publish_event.py`, `job_subscriber.py`, and `registry.py`) in place?
+- [ ] **HMAC Enablement**: When a new registry job is created, is a random `auth_token` correctly injected, and is signature-based mutual authentication active?
+- [ ] **Charter Placement**: Is this protocol file (`AGENT.md`) placed in the **.agents/ directory** of the new project? (Placing it in `.agents/` is essential to keep the project root clean while allowing onboarding agents to align on the rules.)
+
+---
+
+*This guide balances collaboration efficiency with strict code security. Any required changes must be discussed and agreed upon by the PM and all Reviewers before updating this document.*