nanoclaw

Author	SHA1	Message	Date
Ethan Munoz	ec23bd7a7e	fix(host-sweep): parse SQLite timestamps as UTC, not local time SQLite TIMESTAMP columns store UTC without a zone marker. `Date.parse` treats timezoneless ISO strings as local time, so on any non-UTC host every claim and processAfter looks (TZ offset) hours stale. That makes fresh claims trip the kill-claim path on the first sweep tick — every container gets killed within seconds of spawn. Two affected sites in host-sweep.ts: - decideStuckAction reads claim.status_changed and computes claimAge. On a TZ=Europe/Madrid host (UTC+2), a claim made 5s ago looks 7205s old and exceeds CLAIM_STUCK_MS (60s). - The orphan retry loop reads msg.processAfter and skips messages rescheduled into the future. On the same host, future timestamps look (TZ offset) hours in the past, so the skip is missed and tries gets bumped on every tick. Fix: introduce parseSqliteUtc(s) which appends "Z" only when no zone marker is present, then call it from both sites. Behavior under TZ=UTC is unchanged. Verified on a production v2 install on TZ=Europe/Madrid: with the patch applied, an idle container survived 30+ minutes without being killed (previously: killed within 60s of spawn). Tests: 5 new cases covering the bare/Z/+offset/invalid input matrix and a TZ-independence check. All 19 host-sweep tests pass and tsc clears against main.	2026-05-05 23:49:18 +02:00
Ethan	7ce9922cde	fix(host-sweep): clear orphan processing_ack on kill to prevent claim-stuck loop When the host kills a container (absolute-ceiling, claim-stuck, or crashed), resetStuckProcessingRows reset messages_in but left orphan rows in processing_ack. The next sweep tick spawned a fresh container and, on the same tick, ran enforceRunningContainerSla against outbound.db that still contained the previous container's claim with a hours-old status_changed timestamp — instant kill-claim, before the agent-runner could open outbound.db to run its own clearStaleProcessingAcks(). Loop until tries hit MAX_TRIES. Add deleteOrphanProcessingClaims() in session-db and call it at the end of resetStuckProcessingRows. Safe to write outbound.db here because the host only enters this path after killContainer (or when no container is running). Tests in host-sweep.test.ts cover the helper plus the regression: orphan claim from a 2h-old kill is now removed atomically with the messages_in reset, so the next sweep tick sees an empty claims list and the freshly respawned container survives long enough to start its agent-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:54:42 +02:00
gavrielc	0105de0257	fix(host-sweep): skip ceiling check when heartbeat file is absent decideStuckAction treated a missing heartbeat file as heartbeatAge = Infinity, which always exceeded the 30-minute ceiling. Result: every freshly-spawned container got killed within seconds of spawn on the first sweep pass because it hadn't produced an SDK event yet (heartbeat is only touched on SDK events inside processQuery, not on boot). Skip the ceiling branch when heartbeatMtimeMs === 0. Containers that genuinely never wrote a heartbeat because they died are caught by the separate "container process not running" cleanup path. Containers that boot, claim a message, but hang at the gate are caught by the claim-stuck check below — which correctly fires regardless of heartbeat presence once claimAge exceeds tolerance. Updates the "absent heartbeat → kill-ceiling" test (which was encoding the bug) and adds a companion that the claim-stuck path still fires for absent-heartbeat containers with aged claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:15:52 +03:00
gavrielc	6a815190c0	feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:16:57 +03:00

4 Commits