feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist
Replaces the two overlapping old mechanisms (30-min setTimeout kill in
container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep)
with message-scoped stuck detection anchored to the processing_ack claim
age + an absolute 30-min ceiling that extends for long-declared Bash
tools.
Old model problems:
- IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive
agents got killed at 30min regardless of activity
- 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is
only touched on SDK events, so legitimate silent tool work (sleep 30,
long WebFetch, npm install) looked identical to a hung container
- Two overlapping sources of truth for "when to let go of a container"
New model:
- Host sweep is the single source of truth.
- Container exposes a new `container_state` single-row table in outbound.db
(schema added; container writes, host reads). PreToolUse hook writes
current_tool + tool_declared_timeout_ms (read from Bash's tool_input);
PostToolUse / PostToolUseFailure clear it.
- Sweep decides with a pure helper `decideStuckAction`:
* absolute ceiling — kill if heartbeat age > max(30min, bash_timeout)
* per-claim stuck — kill if any processing_ack row has claim_age >
max(60s, bash_timeout) AND heartbeat hasn't been touched since claim
* otherwise ok
Kill paths reset leftover processing rows with exponential backoff,
reusing the existing retry machinery.
Tool blocklist expanded:
- AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question)
- EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI
affordances; would hang in headless containers)
PreToolUse hook is also defense-in-depth: if a disallowed tool name slips
through, it returns `{ decision: 'block' }` so the agent sees a clear
error instead of appearing stuck.
Removed:
- container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on
activeContainers entry, resetContainerIdleTimer export.
- delivery.ts: the resetContainerIdleTimer call on successful delivery.
- poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is
cheaper than close+reopen (no cold prompt cache). Liveness is now a
host-side concern.
- host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the
stale-detection path (still exported for kill reset).
Tests:
- src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh
heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension
(both ceiling and per-claim), claim age below tolerance, heartbeat
touched after claim, unparseable timestamps.
Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -213,4 +213,16 @@ CREATE TABLE IF NOT EXISTS session_state (
|
||||
value TEXT NOT NULL,
|
||||
updated_at TEXT NOT NULL
|
||||
);
|
||||
|
||||
-- Current tool-in-flight state. Single-row table (id=1). Container writes on
|
||||
-- PreToolUse and clears on PostToolUse / PostToolUseFailure. Host reads in the
|
||||
-- sweep to extend the stuck-tolerance window when Bash is running with a
|
||||
-- declared timeout > 60s (long-running scripts shouldn't be flagged as stuck).
|
||||
CREATE TABLE IF NOT EXISTS container_state (
|
||||
id INTEGER PRIMARY KEY CHECK (id = 1),
|
||||
current_tool TEXT,
|
||||
tool_declared_timeout_ms INTEGER,
|
||||
tool_started_at TEXT,
|
||||
updated_at TEXT NOT NULL
|
||||
);
|
||||
`;
|
||||
|
||||
@@ -161,6 +161,47 @@ export function getStuckProcessingIds(outDb: Database.Database): string[] {
|
||||
).map((r) => r.message_id);
|
||||
}
|
||||
|
||||
export interface ProcessingClaim {
|
||||
message_id: string;
|
||||
status_changed: string;
|
||||
}
|
||||
|
||||
/** Return processing_ack rows still in 'processing' with their claim timestamps. */
|
||||
export function getProcessingClaims(outDb: Database.Database): ProcessingClaim[] {
|
||||
return outDb
|
||||
.prepare(
|
||||
"SELECT message_id, status_changed FROM processing_ack WHERE status = 'processing'",
|
||||
)
|
||||
.all() as ProcessingClaim[];
|
||||
}
|
||||
|
||||
export interface ContainerState {
|
||||
current_tool: string | null;
|
||||
tool_declared_timeout_ms: number | null;
|
||||
tool_started_at: string | null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Read the container's current tool-in-flight state, if any. Returns null
|
||||
* when either the table doesn't exist yet (older session DB) or no tool is
|
||||
* active. Host sweep reads this to widen stuck-detection tolerance while
|
||||
* Bash is running with a long declared timeout.
|
||||
*/
|
||||
export function getContainerState(outDb: Database.Database): ContainerState | null {
|
||||
try {
|
||||
const row = outDb
|
||||
.prepare(
|
||||
`SELECT current_tool, tool_declared_timeout_ms, tool_started_at
|
||||
FROM container_state WHERE id = 1`,
|
||||
)
|
||||
.get() as ContainerState | undefined;
|
||||
return row ?? null;
|
||||
} catch {
|
||||
// Table not present on older session DBs — treat as "no tool in flight".
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// messages_out (read-only from host)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user