docs: add detailed answers for 21 learning roadmap questions

2026-05-13 03:40:13 +00:00
parent a8d90d2980
commit c4753da8f5
10 changed files with 1308 additions and 0 deletions
--- a/docs/answers/05-delivery-and-system-actions.md
+++ b/docs/answers/05-delivery-and-system-actions.md
@@ -0,0 +1,175 @@
+# Q13-Q15: 出站投递与系统动作
+
+---
+
+## Q13: Agent 回复消息后，delivery.ts 怎么知道用哪个 channel adapter 发送？重试和失败怎么处理？
+
+### 答案
+
+Delivery 系统使用**两层轮询**：active poll（每 1s）扫描有运行容器的 session，sweep poll（每 60s）扫描所有 active session。从 `outbound.db`（container-owned）读取，在 `inbound.db` 的 `delivered` 表中跟踪投递状态。Channel adapter 是 boot 时设置的单个全局 `ChannelDeliveryAdapter`。
+
+### Adapter 的选择
+
+- `messages_out` 的每行带有 `channel_type` 和 `platform_id` 字段（container 的 `writeMessageOut()` 填入）
+- `delivery.ts:356-363`：`deliveryAdapter.deliver(channelType, platformId, threadId, kind, content, files)` 被调用。Adapter 收到 `channelType` + `platformId`，负责路由到正确的平台
+- Adapter 通过 `setDeliveryAdapter()` 设置一次（line 95），是一个包装了所有 channel adapter 的 `ChannelDeliveryAdapter`
+
+### 完整投递流程
+
+**1. Poll 触发：** `pollActive()`（1s，line 121-133）→ `getRunningSessions()`；`pollSweep()`（60s，line 136-149）→ `getActiveSessions()`
+
+**2. 防竞态**（line 151-162）：`inflightDeliveries` 是 `Set<string>` — 如果 active poll 和 sweep poll 竞态同个 session，第二个调用者跳过，防止重复投递
+
+**3. Drain session**（`drainSession()`，line 164-232）：
+- 只读打开 `outbound.db`，读写打开 `inbound.db`
+- `getDueOutboundMessages(outDb)` 读 `messages_out`（`deliver_after <= now`）
+- `getDeliveredIds(inDb)` 做去重比对（line 183）
+- 对每条未投递消息调用 `deliverMessage()`（line 192）
+
+**4. 消息路由**（`deliverMessage()`，line 234-375）：
+- **System actions**（line 255-258）：`msg.kind === 'system'` → `handleSystemAction()` → 查找 `actionHandlers` Map。模块通过 `registerDeliveryAction()` 注册处理器
+- **Agent-to-agent**（line 264-271）：`msg.channel_type === 'agent'` → `routeAgentMessage()`
+- **Channel delivery**（line 289-375）：
+  - **权限检查**（line 289-311）：验证源 agent 是否有权发到目标 channel——要么目标是自己 session 的 origin（`session.messaging_group_id` 匹配），要么 `agent_destinations` 中有显式行
+  - **Pending question 跟踪**（line 317-340）：`ask_question` 类型创建 `pending_questions` 行
+  - **文件附件**（line 348-354）：从 session 的 `outbox/<messageId>/` 读文件
+  - **Adapter call**（line 356-363）：实际发送
+  - **清理**（line 372）：`clearOutbox()` 删除 outbox 目录
+
+### 重试和失败处理
+
+- 投递尝试在 `deliveryAttempts` Map 中以消息 ID 为 key 在内存中跟踪（进程重启时重置，给失败消息全新机会）
+- 在 `MAX_DELIVERY_ATTEMPTS = 3` 次失败后，标记为 permanent failed（line 206-225）
+- 重试是惰性的：消息留在 `messages_out` 表中未投递，下次 poll 迭代重新捡起
+
+---
+
+## Q14: Agent 发起 `install_packages` 或 `add_mcp_server` 的完整审批-执行链路是什么？
+
+### 答案
+
+自我修改采用 **fire-and-forget** 模式 + admin 审批门控。Agent 调 MCP tool，tool 写 system message 到 `outbound.db`。Host delivery loop 拾起、验证/净化请求、排队审批，admin 批准后应用修改、重建镜像（如需要）、kill 容器、写 `on_wake` 消息给新容器。
+
+### 完整链路（逐步）
+
+**Phase 1: Agent 请求（container 端）**
+
+1. Agent 调 `install_packages` 或 `add_mcp_server` MCP tool（`container/agent-runner/src/mcp-tools/self-mod.ts`）
+   - `install_packages`（line 53-78）：用正则验证包名（`APT_RE`、`NPM_RE`，最多 20 个），写 `kind: 'system'` + `action: 'install_packages'` 的 outbound message
+   - `add_mcp_server`（line 97-117）：验证 name 和 command 存在，写 `action: 'add_mcp_server'`
+
+2. 写 `outbound.db`：用 `writeMessageOut()`，写入**奇数 seq**
+
+**Phase 2: Host delivery 拾起**
+
+3. `delivery.ts` → `deliverMessage()` 看到 `kind === 'system'` → `handleSystemAction()`
+4. `handleSystemAction()`（line 410-425）：查 `actionHandlers` Map。self-mod 模块注册了 handler：
+   - `handleInstallPackages`（`self-mod/request.ts:20-64`）
+   - `handleAddMcpServer`（`self-mod/request.ts:66-91`）
+
+**Phase 3: 请求验证 + 审批排队**
+
+5. **Host 端验证**（深度防御第二层）：对 package 名称再次验证（同一正则），失败时调 `notifyAgent()` 告知 agent，**不创建审批**
+
+6. **审批请求**（`approvals/primitive.ts:164-220`）：
+   - `pickApprover(session.agent_group_id)` → scoped admins → global admins → owners
+   - `pickApprovalDelivery`：找到可 DM 的审批者，优先同 channel 类型
+   - 用 `deliveryAdapter.deliver()` 发 `ask_question` 卡片到 admin DM
+   - 创建 `pending_approvals` 行，包含 `action`、`payload`（JSON）、`approval_id`、`session_id`
+
+**Phase 4: Admin 响应**
+
+7. `approvals/response-handler.ts:24-43` → `handleApprovalsResponse()`：
+   - **Reject**（line 72-77）：调 `notify()` 告知 agent
+   - **Approve**（line 80-105）：查 `getApprovalHandler(approval.action)` 注册的 handler，传入 `{ session, payload, userId, notify }`
+
+**Phase 5: 应用修改**
+
+8. **`install_packages` handler** — `self-mod/apply.ts:22-83`：
+   - 去重后追加新 apt/npm 包到 DB 中已有列表（line 37-49）
+   - `buildAgentGroupImage()`（line 57）：构建 per-agent-group Docker 镜像（`container-runner.ts:468-515`），用 `docker build -t nanoclaw-agent:<agentGroupId>` 拉 900s 超时
+   - 写 `on_wake: 1` 消息告知 agent
+   - **Kill 容器 with `onExit` callback**（line 72-75）：`killContainer(sessionId, 'rebuild applied', () => { wakeContainer(s) })` —— 保证旧容器退出后新容器才 spawn
+   - 重建失败（line 77-82）：通知 admin，不 kill 容器
+
+9. **`add_mcp_server` handler** — `self-mod/apply.ts:85-125`：
+   - 添加 MCP server 到 DB 的 `mcp_servers` JSON（line 99-105）
+   - 写 `on_wake: 1` 消息（line 107-120）
+   - Kill 容器 with `onExit` → `wakeContainer` callback（line 121-124）
+   - **不需要重建镜像** —— Bun 直接运行 TS，纯 MCP wiring 变动不需要 Dоcker 构建
+
+**Phase 6: 新容器启动**
+
+10. `onExit` callback 触发 → `wakeContainer()` → `spawnContainer()`：
+    - 从 DB 物化新 `container.json`（line 127）
+    - `composeGroupClaudeMd()` 重新生成 CLAUDE.md（line 261）
+    - `clearStaleProcessingAcks()` 清掉旧 processing ack（`connection.ts:175-177`）
+    - `getPendingMessages(isFirstPoll=true)` 捡起 `on_wake: 1` 消息 —— 仅第一轮 poll 可见
+
+### 为什么 `on_wake` 是防竞态的
+
+`messages_in` 表 `on_wake` 列 + `getPendingMessages()` 中 `isFirstPoll` 门控：第一轮 poll 包含 `on_wake = 1` 行，后续轮排除（它们已 `completed`）。结合 `killContainer` 的 `onExit` callback，旧容器绝无可能先于新容器偷走 on_wake 消息。
+
+---
+
+## Q15: 定时任务（cron）怎么实现？
+
+### 答案
+
+定时任务实现为 **`messages_in` 表中 `kind='task'` 的行**，piggyback 在核心 schema 上，没有专用表。Agent 通过 MCP tool 创建任务，host 把它们写入 `inbound.db`，recurrence 由 host sweep hook 驱动：克隆已完成的周期性任务为新 pending 行。
+
+### 创建任务
+
+1. Agent 通过 `schedule_task` MCP tool（`container/agent-runner/src/mcp-tools/scheduling.ts`）写 `kind: 'system'` + `action: 'schedule_task'` 的 outbound message，包含 `taskId`、`prompt`、`processAfter`（首次运行 ISO 时间戳）、可选 `recurrence`（cron 表达式）
+
+2. Host delivery 拾起 → `handleSystemAction()` → 注册的 `action: 'schedule_task'` handler（`scheduling/actions.ts:19-40`）：
+   - 调 `insertTask()`（`scheduling/db.ts:17-36`）：插入 `messages_in` 行，`kind = 'task'`、`status = 'pending'`、`process_after = <首次运行时间>`、`recurrence = <cron-expr>`、`series_id = <taskId>`
+   - 内容存储为 JSON `{ prompt, script }`
+
+3. Agent 也可以创建非周期性调度消息：`schedule_message` 工具同理，但 `kind` 匹配原始消息类型
+
+### 触发：Host sweep + `countDueMessages`
+
+4. Host sweep 唤醒容器（`host-sweep.ts:180-186`）：
+   - `countDueMessages(inDb)` 计数 `status = 'pending' AND process_after <= now` 的行
+   - `dueCount > 0 && !isContainerRunning` → `wakeContainer(session)`
+
+5. Container 处理任务：`getPendingMessages()` 读 pending 行，包括 task 行，格式化后给 provider
+
+### Recurrence：host sweep + `handleRecurrence`
+
+6. Recurrence fanout（`scheduling/recurrence.ts:21-53`），每 60s sweep 周期调用（`host-sweep.ts:205-206`）：
+   - `getCompletedRecurring(inDb)`（`scheduling/db.ts:122-126`）：找 `status = 'completed' AND recurrence IS NOT NULL` 的行
+   - 对每个完成的行：
+     - 在用户时区（非 UTC）解析 cron 表达式
+     - 计算 `nextRun = interval.next().toISOString()`
+     - `insertRecurrence()`（`scheduling/db.ts:128-149`）：复制原行，设置 `process_after = nextRun`、`status = 'pending'`
+     - `clearRecurrence()`（line 151-153）：设原行 `recurrence = NULL`，防止下次周期重复克隆
+
+### 其他任务生命周期操作
+
+7. **Cancel/Pause/Resume**（`actions.ts:42-70`）：
+   - `cancelTask()`（line 38-42）：通过 `id OR series_id` 匹配，设所有 `pending`/`paused` 行为 `completed`
+   - `pauseTask()`（line 44-48）、`resumeTask()`（line 50-54）：同理
+   - 用 `series_id` 匹配意味着 agent 可以引用任意一次执行取消整个系列
+
+### Host sweep + recurrence 协作顺序
+
+```
+Host sweep（每 60s）：
+  Step 1: syncProcessingAcks()
+  Step 2: countDueMessages() → 如果到期 + 容器不在运行 → wakeContainer()
+  Step 3: enforceRunningContainerSla() ← heartbeat/claim-stuck 检查
+  Step 4: resetStuckProcessingRows() ← 崩溃容器清理
+  Step 5: handleRecurrence() ← 扫描已完成周期性任务，克隆下次执行
+```
+
+**关键顺序：** Step 2（唤醒到期消息）在 Step 4（崩溃容器清理）之前运行，确保新容器有机会在启动时清自己的孤儿 `processing_ack`。
+
+### 边界情况
+
+- **没有专用表**：任务是 `messages_in` 行——核心 `messages_in` schema 中 `kind` 字段足以区分
+- **`series_id`**：周期性任务的每次执行共享同一 `series_id`。Cancel/pause/resume 用 `id OR series_id` 匹配，影响整个系列
+- **时区**：Cron 表达式以用户配置的 `TIMEZONE` 解析（来自 `.env`），非 UTC
+- **Pre-task scripts gating**：Task 行可带 `script` 字段。`applyPreTaskScripts()` hook（`poll-loop.ts:149,323`）先跑 script。如果返回 `wakeAgent: false`，任务标记完成但不唤醒 agent——实现 "仅用户活跃时运行" 等条件
+- **Agent 不能写 inbound.db**：Container 把 task 写成 `kind: 'system'` 的 outbound message；host 的 delivery action handler 才是实际插入 `messages_in` 的组件——保持单 writer 不变式