Files
nanoclaw/docs/v1-vs-v2/container-runner.md
gavrielc 47950671fa docs: add v1→v2 action-items analysis + SDK signal probe tool
- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module
  docs + ACTION-ITEMS rollup with decisions + timezone recreation spec).
- container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness
  used to characterise Claude Agent SDK event/hook/stderr timing for the
  stuck-detection design in item 9.
- src/channels/chat-sdk-bridge.ts: document the conversations Map staleness
  in a code comment; fix deferred to when dynamic group registration lands
  (ACTION-ITEMS item 17).

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:00:04 +03:00

52 lines
4.6 KiB
Markdown

# container-runner: v1 vs v2
## Scope
- v1: `src/v1/container-runner.ts` (677 LOC) + `container-runner.test.ts` (204 LOC) — spawn + IPC plumbing + stdin/stdout JSON + process supervision + output-marker parsing
- v2: `src/container-runner.ts` (405 LOC) + `src/container-config.ts` (114 LOC) + `src/session-manager.ts` (DB paths). Net ~272 LOC removed by eliminating IPC and output parsing
## Capability map
| v1 behavior | v2 location | Status | Notes |
|---|---|---|---|
| Image selection | `container-runner.ts:348-349` | kept | Reads `imageTag` from container.json or env |
| Env injection | `container-runner.ts:266-284` | **changed** | Replaced IPC vars with `SESSION_INBOUND/OUTBOUND_DB_PATH`, `SESSION_HEARTBEAT_PATH`, `AGENT_PROVIDER`, `NANOCLAW_*` admin IDs |
| Volume mounts | `container-runner.ts:200-252` | **changed** | Removed per-group IPC dir; added session folder `/workspace` + agent group `/workspace/agent` |
| Mount validation | `container-runner.ts:240-244` | kept | Validates `additionalMounts` from container.json |
| Provider integration | `container-runner.ts:184-198` | **new** | `resolveProviderContribution()` wires provider host-side configs |
| stdin/stdout IPC | — | **removed** | v1 lines 318-387; v2 uses DB polling only; stdio=`['ignore','pipe','pipe']` |
| Process spawn | `container-runner.ts:119` | kept | |
| OneCLI `ensureAgent` + `applyContainerConfig` | `container-runner.ts:301-313` | enhanced | v2 calls `ensureAgent` first |
| Admin ID injection | `container-runner.ts:289-295` | **new** | Queries `getOwners/getGlobalAdmins/getAdminsOfAgentGroup` at wake |
| Idle timeout | `container-runner.ts:135-140` | changed | v2 uses `resetIdle()` callback on activeContainers entry, settable by `delivery.ts` |
| Timeout logic | — | **removed** | v1 had configurable per-group timeout reset on output markers |
| Output parsing | — | **removed** | v1 parsed `---NANOCLAW_OUTPUT_START/END---` from stdout; v2 ignores stdout |
| Streaming output callback | — | **removed** | v1 had `onOutput()` for real-time delivery |
| Per-exit log file | — | **removed** | v1 wrote `groups/<folder>/logs/container-*.log` with full I/O; v2 only logs stderr to logger.debug |
| Graceful SIGTERM→SIGKILL | — | simplified | v2 just calls `stopContainer()` |
| Concurrent wake dedup | `container-runner.ts:44-82` | **new** | `wakePromises` Map prevents race on spawn |
| Per-group image builds | `container-runner.ts:357-405` | **new** | `buildAgentGroupImage()` writes `imageTag` |
| Session folder init | `container-runner.ts:210` | **new** | `initGroupFilesystem()` at spawn |
| Heartbeat file `/workspace/.heartbeat` | session-manager.ts | **new** | File-touch replaces IPC liveness |
| Task/group JSON snapshots (`current_tasks.json`, `available_groups.json`) | — | **removed** | v2 pushes data via inbound.db writeDestinations/writeSessionRouting |
| Container name | `container-runner.ts:103` | changed | `nanoclaw-v2-${folder}-${Date.now()}` |
## Missing from v2
1. **Streaming output markers**`---NANOCLAW_OUTPUT_START/END---` enabled pre-completion delivery; v2 must wait for outbound.db poll to deliver results
2. **Configurable per-group timeout**`group.containerConfig.timeout` override is gone; all groups share `IDLE_TIMEOUT`
3. **Per-exit detailed logs** — v1 wrote timestamped logs with full I/O + mounts + stderr + stdout; invaluable for post-mortem
4. **Graceful-stop sentinel** — v1 sent SIGTERM and waited for `_close` marker before SIGKILL
5. **JSON snapshots for tasks/groups**`current_tasks.json` / `available_groups.json` in the group IPC dir
## Behavioral discrepancies
1. **Async result model**: v1 `runContainerAgent()` returned `Promise<ContainerOutput>` with inline result; v2 `wakeContainer()` is fire-and-forget — results asynchronous via delivery poll
2. **No stdin**: v1 wrote full `ContainerInput` JSON to stdin; v2 container reads everything from inbound.db
3. **Admin injection at wake**: v2 queries admins fresh on every spawn (`NANOCLAW_ADMIN_USER_IDS`)
4. **Destination routing timing**: v2 calls `writeDestinations()` + `writeSessionRouting()` on every wake so changes apply without restart
5. **Session lifecycle**: v1 created a session per spawn; v2 resolves session via router before wake
## Worth preserving?
- **Streaming output**: Meaningful latency improvement. Hybrid model (DB polling + optional marker pre-delivery) could reduce perceived latency for long outputs
- **Per-group timeout**: Restore — different agent groups have different expected latencies
- **Per-exit logs**: At minimum, restore on non-zero exit. Cheap forensics, huge debug value
- **Graceful-stop sentinel**: Not critical — bun container is disposable