Files

gavrielc 47950671fa docs: add v1→v2 action-items analysis + SDK signal probe tool

- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module
  docs + ACTION-ITEMS rollup with decisions + timezone recreation spec).
- container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness
  used to characterise Claude Agent SDK event/hook/stderr timing for the
  stuck-detection design in item 9.
- src/channels/chat-sdk-bridge.ts: document the conversations Map staleness
  in a code comment; fix deferred to when dynamic group registration lands
  (ACTION-ITEMS item 17).

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 01:00:04 +03:00

4.4 KiB

Raw Blame History

host index: v1 vs v2

Scope

v1: src/v1/index.ts (647 LOC) — monolithic entry: config, DB, state, channels, queues, scheduler, IPC watcher, message loop
v2: src/index.ts (345 LOC) — lean entry: DB+migrations, channels, delivery/sweep polls, OneCLI handler

Startup sequence diff

#	v1 step	v2 step	Status
1	`ensureContainerRuntimeRunning()` + `cleanupOrphans()`	same	kept
2	`initDatabase()`	`initDb()` + `runMigrations()`	enhanced (explicit migrations)
3	`loadState()` — cursor, groups, agent timestamps	—	removed (no global state)
4	OneCLI `ensureAgent` per group	—	removed (now per-wake in `container-runner.ts`)
5	`restoreRemoteControl()`	—	removed
6	SIGTERM/SIGINT handlers	same	kept
7	`handleRemoteControl` bind	—	removed
8	Channel options + callbacks	`initChannelAdapters()`	rewritten (adapter API)
9	Channel discovery + connection	absorbed into adapters	—
10	`startSchedulerLoop()`	—	removed (folded into `startHostSweep`)
11	`startIpcWatcher()`	—	removed (no IPC in v2)
12	`startSessionCleanup()`	—	removed (folded into `startHostSweep`)
13	`queue.setProcessMessagesFn()`	—	removed (GroupQueue gone)
14	`recoverPendingMessages()`	—	removed (implicit in sweep)
15	`startMessageLoop()` (polling)	`startActiveDeliveryPoll()` + `startSweepDeliveryPoll()`	fundamentally changed (event-driven)
16	—	`startHostSweep()`	new
17	—	`startOneCLIApprovalHandler()`	new

Capability map

v1 behavior	v2 location	Status	Notes
Arg/env parsing	`src/config.ts` (shared)	kept
Central DB init	`src/index.ts:47-50`	kept	+ `runMigrations()`
Container runtime bring-up	`src/index.ts:52-54`	kept	identical
Global cursor + timestamps state	—	removed	v2 session-scoped state in outbound.db
Periodic message polling loop	—	removed	Replaced by event-driven delivery + 60s sweep
OneCLI group-wide sync at startup	—	removed	Per-wake in `container-runner.ts:303`
Remote control subsystem	—	removed	No equivalent — feature deferred
Group message queue (`GroupQueue`)	—	removed	DB-based serialization
Channel adapter array + callbacks	`src/channels/channel-registry.ts`	refactored	`ChannelAdapter` interface
Pending message recovery on startup	—	removed	Sweep detects stale containers + resets messages
IPC watcher (dynamic group add)	—	removed	Static topology at startup; restart to add groups
Signal handlers	`src/index.ts:339-340`	kept	Simplified teardown
Top-level error handling	`src/index.ts:342-345`	kept	Same fatal exit

Missing from v2

Polling message loop (v1:370-459) — replaced by event-driven + sweep (net improvement)
GroupQueue state machine — now DB-based
Cross-restart cursor state — no lastAgentTimestamp persisted; recovery implicit via DB scan
Remote control — gone
Explicit recoverPendingMessages() — implicit in sweep; worth verifying via post-crash test
IPC watcher (startIpcWatcher) — cannot add groups dynamically; restart required
Scheduler loop — merged into sweep's due-message wake

Behavioral discrepancies

Aspect	v1	v2
Startup time	~500ms (long loop init)	~200ms
Message fetch	polling every POLL_INTERVAL	event-driven callbacks + 1s delivery poll
Container spawn	on-demand via GroupQueue	per-message wake via router/sweep
Group topology	dynamic (IPC watcher)	static at startup
Error recovery	per-message cursor rollback	implicit via stale detection
Shutdown	GroupQueue 10s grace then disconnect	stop handlers/polls/sweep/adapters in order

Worth preserving?

Polling loop: No — event-driven is superior. Verify delivery poll latency regression vs old POLL_INTERVAL under load
Pending-message recovery: Worth explicit restoration — kill a container mid-message, restart host, verify re-delivery within ≤5s. If sweep doesn't cover this, add startup-phase scan
Remote control: Unknown — either restore as opt-in skill or document removal
Dynamic group add (IPC watcher): Probably not worth — modern flow is "admin skill adds group to DB, restart". But document that restart is required

4.4 KiB Raw Blame History

host index: v1 vs v2

Scope

Startup sequence diff

Capability map

Missing from v2

Behavioral discrepancies

Worth preserving?

4.4 KiB

Raw Blame History