nanoclaw

wangjunxiao/nanoclaw

Fork 0

Commit Graph

Author	SHA1	Message	Date
gavrielc	0105de0257	fix(host-sweep): skip ceiling check when heartbeat file is absent decideStuckAction treated a missing heartbeat file as heartbeatAge = Infinity, which always exceeded the 30-minute ceiling. Result: every freshly-spawned container got killed within seconds of spawn on the first sweep pass because it hadn't produced an SDK event yet (heartbeat is only touched on SDK events inside processQuery, not on boot). Skip the ceiling branch when heartbeatMtimeMs === 0. Containers that genuinely never wrote a heartbeat because they died are caught by the separate "container process not running" cleanup path. Containers that boot, claim a message, but hang at the gate are caught by the claim-stuck check below — which correctly fires regardless of heartbeat presence once claimAge exceeds tolerance. Updates the "absent heartbeat → kill-ceiling" test (which was encoding the bug) and adds a companion that the claim-stuck path still fires for absent-heartbeat containers with aged claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:15:52 +03:00
gavrielc	6a815190c0	feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:16:57 +03:00

Author

SHA1

Message

Date

gavrielc

0105de0257

fix(host-sweep): skip ceiling check when heartbeat file is absent

decideStuckAction treated a missing heartbeat file as heartbeatAge =
Infinity, which always exceeded the 30-minute ceiling. Result: every
freshly-spawned container got killed within seconds of spawn on the
first sweep pass because it hadn't produced an SDK event yet (heartbeat
is only touched on SDK events inside processQuery, not on boot).

Skip the ceiling branch when heartbeatMtimeMs === 0. Containers that
genuinely never wrote a heartbeat because they died are caught by the
separate "container process not running" cleanup path. Containers that
boot, claim a message, but hang at the gate are caught by the
claim-stuck check below — which correctly fires regardless of heartbeat
presence once claimAge exceeds tolerance.

Updates the "absent heartbeat → kill-ceiling" test (which was encoding
the bug) and adds a companion that the claim-stuck path still fires for
absent-heartbeat containers with aged claims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 12:15:52 +03:00

gavrielc

6a815190c0

feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist

Replaces the two overlapping old mechanisms (30-min setTimeout kill in
container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep)
with message-scoped stuck detection anchored to the processing_ack claim
age + an absolute 30-min ceiling that extends for long-declared Bash
tools.

Old model problems:
- IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive
  agents got killed at 30min regardless of activity
- 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is
  only touched on SDK events, so legitimate silent tool work (sleep 30,
  long WebFetch, npm install) looked identical to a hung container
- Two overlapping sources of truth for "when to let go of a container"

New model:
- Host sweep is the single source of truth.
- Container exposes a new `container_state` single-row table in outbound.db
  (schema added; container writes, host reads). PreToolUse hook writes
  current_tool + tool_declared_timeout_ms (read from Bash's tool_input);
  PostToolUse / PostToolUseFailure clear it.
- Sweep decides with a pure helper `decideStuckAction`:
    * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout)
    * per-claim stuck  — kill if any processing_ack row has claim_age >
      max(60s, bash_timeout) AND heartbeat hasn't been touched since claim
    * otherwise ok
  Kill paths reset leftover processing rows with exponential backoff,
  reusing the existing retry machinery.

Tool blocklist expanded:
- AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question)
- EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI
  affordances; would hang in headless containers)
PreToolUse hook is also defense-in-depth: if a disallowed tool name slips
through, it returns `{ decision: 'block' }` so the agent sees a clear
error instead of appearing stuck.

Removed:
- container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on
  activeContainers entry, resetContainerIdleTimer export.
- delivery.ts: the resetContainerIdleTimer call on successful delivery.
- poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is
  cheaper than close+reopen (no cold prompt cache). Liveness is now a
  host-side concern.
- host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the
  stale-detection path (still exported for kill reset).

Tests:
- src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh
  heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension
  (both ceiling and per-claim), claim age below tolerance, heartbeat
  touched after claim, unparseable timestamps.

Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 01:16:57 +03:00

2 Commits