docs: add detailed answers for 21 learning roadmap questions
This commit is contained in:
142
docs/answers/04-container-lifecycle.md
Normal file
142
docs/answers/04-container-lifecycle.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Q10-Q12: 容器生命周期
|
||||
|
||||
---
|
||||
|
||||
## Q10: 启动 agent 容器时 mount 了哪些东西?
|
||||
|
||||
### 答案
|
||||
|
||||
Container mount 由 `src/container-runner.ts:242-335` 的 `buildMounts()` 构建。容器内文件系统结构:
|
||||
|
||||
| 容器路径 | 宿主机来源 | 权限 | 用途 |
|
||||
|----------|-----------|------|------|
|
||||
| `/workspace/` | `data/v2-sessions/<agentGroup>/<session>/` | RW | Session 目录:`inbound.db`、`outbound.db`、`.heartbeat`、`outbox/`、`inbox/` |
|
||||
| `/workspace/agent/` | `groups/<folder>/` | RW | Per-group 工作文件 + `CLAUDE.local.md` |
|
||||
| `/workspace/agent/container.json` | `groups/<folder>/container.json` | **RO** | 嵌套 RO 覆盖——agent 可读不能改(line 276-278) |
|
||||
| `/workspace/agent/CLAUDE.md` | `groups/<folder>/CLAUDE.md` | **RO** | 组合的 CLAUDE.md,spawn 时重新生成(line 287-290) |
|
||||
| `/workspace/agent/.claude-fragments/` | `groups/<folder>/.claude-fragments/` | **RO** | per-skill/per-MCP 指令片段 |
|
||||
| `/workspace/global/` | `groups/global/` | **RO** | 共享全局记忆 |
|
||||
| `/app/CLAUDE.md` | `container/CLAUDE.md` | **RO** | 共享基础 CLAUDE.md,通过 `.claude-shared.md` symlink 导入 |
|
||||
| `/home/node/.claude/` | `data/v2-sessions/<agentGroup>/.claude-shared/` | RW | Claude SDK 状态、`settings.json`、skill symlinks |
|
||||
| `/app/src/` | `container/agent-runner/src/` | **RO** | 共享 agent-runner TypeScript 源码 |
|
||||
| `/app/skills/` | `container/skills/` | **RO** | 共享容器技能 |
|
||||
| 额外 | `containerConfig.additionalMounts` | → | Provider-contributed mounts(line 330) |
|
||||
|
||||
### 调用链
|
||||
|
||||
1. `spawnContainer()`(line 108)→ `buildMounts()`(line 134)
|
||||
2. Pre-mount 初始化:
|
||||
- `initGroupFilesystem(agentGroup)`(line 253):幂等创建 `groups/<folder>/`、`CLAUDE.local.md`、`.claude-shared/` 目录和 DB 行
|
||||
- `syncSkillSymlinks()`(line 257):根据 `container.json` 的 `skills` 选择,在 `.claude-shared/skills/` 下创建 symlink
|
||||
- `composeGroupClaudeMd(agentGroup)`(line 261):重新生成组合 CLAUDE.md
|
||||
3. Mount 按顺序组装(line 267-333)
|
||||
4. 所有 volume mount 进入 `buildContainerArgs()`(line 447-453)
|
||||
|
||||
### 边界情况
|
||||
|
||||
- `CLAUDE.md` 和 `.claude-fragments/` 是嵌套 RO mount,叠加在 RW group 目录上——agent 只能写 `CLAUDE.local.md`
|
||||
- `container.json` 单独 RO mount 防止 agent 修改自己的配置
|
||||
- Skill symlinks 指向容器内路径(`/app/skills/<name>`),在宿主机上是悬空符号链接,容器内有效
|
||||
|
||||
---
|
||||
|
||||
## Q11: Agent 的 system prompt 是怎么拼出来的?
|
||||
|
||||
### 答案
|
||||
|
||||
Agent 的 system prompt 由三部分拼成:**(A)** 共享基础 `CLAUDE.md`,**(B)** per-skill/per-MCP 指令片段,**(C)** 运行时 addendum(身份 + destinations)。
|
||||
|
||||
### Host 端组合(spawn 时)
|
||||
|
||||
`src/claude-md-compose.ts:43-136` → `composeGroupClaudeMd()`:
|
||||
|
||||
1. **共享基础 symlink**(line 49-50):`groups/<folder>/.claude-shared.md` → `/app/CLAUDE.md`(21 行通用 agent 指令:交流风格、workspace、memory、conversation history)
|
||||
|
||||
2. **Fragment 发现**(line 58-107):
|
||||
- **Skill fragments**(line 66-76):`container/skills/` 下任何有 `instructions.md` 的技能
|
||||
- **内置模块 fragments**(line 83-96):`container/agent-runner/src/mcp-tools/` 下的 `.instructions.md`。**`cli.instructions.md` 在 `cli_scope='disabled'` 时被跳过**
|
||||
- **MCP server fragments**(line 100-107):`container.json` 中外部 MCP server 的 `instructions` 字段,生成内联 fragment 文件
|
||||
|
||||
3. **Fragment 协调**(line 110-122):删除不再需要的过期 fragment,创建/更新 symlink
|
||||
|
||||
4. **组合入口**(line 125-130):写出 `groups/<folder>/CLAUDE.md`,只含 import 指令:
|
||||
```
|
||||
@./.claude-shared.md
|
||||
@./.claude-fragments/skill-onecli-gateway.md
|
||||
@./.claude-fragments/skill-welcome.md
|
||||
@./.claude-fragments/module-cli.md
|
||||
```
|
||||
Claude Code 跟随 `@` import 解决所有 fragment
|
||||
|
||||
5. **Per-group 记忆**(line 132-135):确保 `CLAUDE.local.md` 存在——这是唯一可写的 CLAUDE.md 文件
|
||||
|
||||
### Container 端运行时 addendum
|
||||
|
||||
`container/agent-runner/src/destinations.ts:82-92` → `buildSystemPromptAddendum()`:
|
||||
|
||||
- **身份**(line 85-87):如果设置了 `assistantName`:`"You are <name>"` + 自我介绍和签名指引
|
||||
- **Destination map**(line 94-130):从 `inbound.db` 的 `destinations` 表读取,生成 "Sending messages" 部分
|
||||
|
||||
### 各部分贡献
|
||||
|
||||
| 来源 | 内容 | Agent 可修改? |
|
||||
|------|------|---------------|
|
||||
| `container/CLAUDE.md`(共享基础) | 通用 agent 行为 | 否(RO mount) |
|
||||
| Skill `instructions.md` | Per-skill 指引 | 否(RO) |
|
||||
| MCP tool `.instructions.md` | 如何使用内置工具 | 否(RO) |
|
||||
| MCP server `instructions` | 外部 MCP server 指引 | 仅 admin(DB中) |
|
||||
| `CLAUDE.local.md` | Per-group 记忆 | **是**(唯一可写) |
|
||||
| `buildSystemPromptAddendum()` | 身份 + destination map | 生成时自动 |
|
||||
|
||||
---
|
||||
|
||||
## Q12: 容器心跳怎么检测?进程活着但 poll loop 卡死了怎么发现?
|
||||
|
||||
### 答案
|
||||
|
||||
心跳是**文件 touch 机制**,不是 DB 写入,避免跨 mount DB 写入争用。
|
||||
|
||||
### Container 端:心跳 touch
|
||||
|
||||
- **Path**:`/workspace/.heartbeat`(`container/agent-runner/src/db/connection.ts:25`)
|
||||
- **`touchHeartbeat()`**(`connection.ts:156-168`):用 `fs.utimesSync()` 更新文件 mtime。失败时回退 `fs.writeFileSync()`
|
||||
- **触发时机**:`poll-loop.ts:361`,在 `for await (const event of query.events)` 循环中——每个 SDK event(`init`、`result`、`error`、`progress`)都触发。意味着**仅 agent 活跃流式回复时更新**,poll 轮空间歇不更新
|
||||
|
||||
### Host 端:卡住检测
|
||||
|
||||
Host sweep 每 60s 运行(`host-sweep.ts:61`),对每个有运行容器的 session 调用 `enforceRunningContainerSla()`(line 192 → line 228)。
|
||||
|
||||
**两种检测层级**(`decideStuckAction()`,line 82-118):
|
||||
|
||||
**1. 绝对天花板**(line 91-105):heartbeat mtime 年龄 > `max(30 min, 当前Bash超时)` → `kill-ceiling`
|
||||
- `ABSOLUTE_CEILING_MS = 30 * 60 * 1000`
|
||||
- 扩展 `declaredBashMs`:从 `outbound.db` 读 `container_state`,如果当前工具是 Bash 且有 `tool_declared_timeout_ms`,天花板扩大到该值
|
||||
- **关键守护**(line 92-98):`heartbeatMtimeMs === 0`(刚 spawn,还没有 SDK activity)→ 跳过
|
||||
|
||||
**2. Claim-stuck per-message**(line 107-115):对每个 `processing_ack` 中的 `processing` 行,如果 `(claim_age > tolerance)` 且 `(heartbeat_mtime <= status_changed)` → `kill-claim`
|
||||
- `CLAIM_STUCK_MS = 60s`:claim 一条消息后 60s 内没有任何 heartbeat → poll loop 卡住
|
||||
- 条件 `heartbeat_mtime <= claimedAt` 恰好检测 "claim 了消息后没有任何生命迹象"
|
||||
|
||||
**3. 容器不在运行**(line 199-201):`!isContainerRunning` → `resetStuckProcessingRows()` 用指数退避(基数 5s × 2^tries,最多 5 次)重调度消息
|
||||
|
||||
### 处理器卡在 poll loop 的完整场景
|
||||
|
||||
```
|
||||
Container 进程存活,poll loop 卡住:
|
||||
poll-loop.ts:101 → markProcessing(ids) → DB: status='processing', status_changed=NOW
|
||||
poll-loop.ts:174 → config.provider.query(...) → 启动,但 SDK 挂起
|
||||
[没有 heartbeat touch,因为没有 event 触发]
|
||||
...
|
||||
Host sweep(60s后):
|
||||
→ getProcessingClaims(outDb) → 发现 claim,status_changed 很旧
|
||||
→ heartbeatMtimeMs() → 返回旧 mtime(挂起前最后一次 event 的)
|
||||
→ decideStuckAction(): claimAge > 60s, heartbeat_mtime <= claimedAt
|
||||
→ action: 'kill-claim' → killContainer() + resetStuckProcessingRows()
|
||||
```
|
||||
|
||||
### 边界情况
|
||||
|
||||
- **新 spawn 宽容期**:heartbeat 文件不存在时跳过天花板检查,但 claim-stuck 检查仍然处理 "claim 了消息但在门口卡住"
|
||||
- **每次 spawn 清 heartbeat**(`container-runner.ts:155`):防止旧容器的过期 mtime 立即触发 kill
|
||||
- **孤儿 claim 清理**(line 319):kill 后 `deleteOrphanProcessingClaims()` 清掉 `outbound.db` 中的 processing 行,防止 sweep 立即 kill 新容器
|
||||
- **Bash 自定义超时**:扩展容忍度,确保长运行 Bash 不被误杀
|
||||
Reference in New Issue
Block a user