Files
nanoclaw/docs/v1-vs-v2/index-host.md
gavrielc 47950671fa docs: add v1→v2 action-items analysis + SDK signal probe tool
- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module
  docs + ACTION-ITEMS rollup with decisions + timezone recreation spec).
- container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness
  used to characterise Claude Agent SDK event/hook/stderr timing for the
  stuck-detection design in item 9.
- src/channels/chat-sdk-bridge.ts: document the conversations Map staleness
  in a code comment; fix deferred to when dynamic group registration lands
  (ACTION-ITEMS item 17).

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:00:04 +03:00

4.4 KiB

host index: v1 vs v2

Scope

  • v1: src/v1/index.ts (647 LOC) — monolithic entry: config, DB, state, channels, queues, scheduler, IPC watcher, message loop
  • v2: src/index.ts (345 LOC) — lean entry: DB+migrations, channels, delivery/sweep polls, OneCLI handler

Startup sequence diff

# v1 step v2 step Status
1 ensureContainerRuntimeRunning() + cleanupOrphans() same kept
2 initDatabase() initDb() + runMigrations() enhanced (explicit migrations)
3 loadState() — cursor, groups, agent timestamps removed (no global state)
4 OneCLI ensureAgent per group removed (now per-wake in container-runner.ts)
5 restoreRemoteControl() removed
6 SIGTERM/SIGINT handlers same kept
7 handleRemoteControl bind removed
8 Channel options + callbacks initChannelAdapters() rewritten (adapter API)
9 Channel discovery + connection absorbed into adapters
10 startSchedulerLoop() removed (folded into startHostSweep)
11 startIpcWatcher() removed (no IPC in v2)
12 startSessionCleanup() removed (folded into startHostSweep)
13 queue.setProcessMessagesFn() removed (GroupQueue gone)
14 recoverPendingMessages() removed (implicit in sweep)
15 startMessageLoop() (polling) startActiveDeliveryPoll() + startSweepDeliveryPoll() fundamentally changed (event-driven)
16 startHostSweep() new
17 startOneCLIApprovalHandler() new

Capability map

v1 behavior v2 location Status Notes
Arg/env parsing src/config.ts (shared) kept
Central DB init src/index.ts:47-50 kept + runMigrations()
Container runtime bring-up src/index.ts:52-54 kept identical
Global cursor + timestamps state removed v2 session-scoped state in outbound.db
Periodic message polling loop removed Replaced by event-driven delivery + 60s sweep
OneCLI group-wide sync at startup removed Per-wake in container-runner.ts:303
Remote control subsystem removed No equivalent — feature deferred
Group message queue (GroupQueue) removed DB-based serialization
Channel adapter array + callbacks src/channels/channel-registry.ts refactored ChannelAdapter interface
Pending message recovery on startup removed Sweep detects stale containers + resets messages
IPC watcher (dynamic group add) removed Static topology at startup; restart to add groups
Signal handlers src/index.ts:339-340 kept Simplified teardown
Top-level error handling src/index.ts:342-345 kept Same fatal exit

Missing from v2

  1. Polling message loop (v1:370-459) — replaced by event-driven + sweep (net improvement)
  2. GroupQueue state machine — now DB-based
  3. Cross-restart cursor state — no lastAgentTimestamp persisted; recovery implicit via DB scan
  4. Remote control — gone
  5. Explicit recoverPendingMessages() — implicit in sweep; worth verifying via post-crash test
  6. IPC watcher (startIpcWatcher) — cannot add groups dynamically; restart required
  7. Scheduler loop — merged into sweep's due-message wake

Behavioral discrepancies

Aspect v1 v2
Startup time ~500ms (long loop init) ~200ms
Message fetch polling every POLL_INTERVAL event-driven callbacks + 1s delivery poll
Container spawn on-demand via GroupQueue per-message wake via router/sweep
Group topology dynamic (IPC watcher) static at startup
Error recovery per-message cursor rollback implicit via stale detection
Shutdown GroupQueue 10s grace then disconnect stop handlers/polls/sweep/adapters in order

Worth preserving?

  1. Polling loop: No — event-driven is superior. Verify delivery poll latency regression vs old POLL_INTERVAL under load
  2. Pending-message recovery: Worth explicit restoration — kill a container mid-message, restart host, verify re-delivery within ≤5s. If sweep doesn't cover this, add startup-phase scan
  3. Remote control: Unknown — either restore as opt-in skill or document removal
  4. Dynamic group add (IPC watcher): Probably not worth — modern flow is "admin skill adds group to DB, restart". But document that restart is required