feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist
Replaces the two overlapping old mechanisms (30-min setTimeout kill in
container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep)
with message-scoped stuck detection anchored to the processing_ack claim
age + an absolute 30-min ceiling that extends for long-declared Bash
tools.
Old model problems:
- IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive
agents got killed at 30min regardless of activity
- 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is
only touched on SDK events, so legitimate silent tool work (sleep 30,
long WebFetch, npm install) looked identical to a hung container
- Two overlapping sources of truth for "when to let go of a container"
New model:
- Host sweep is the single source of truth.
- Container exposes a new `container_state` single-row table in outbound.db
(schema added; container writes, host reads). PreToolUse hook writes
current_tool + tool_declared_timeout_ms (read from Bash's tool_input);
PostToolUse / PostToolUseFailure clear it.
- Sweep decides with a pure helper `decideStuckAction`:
* absolute ceiling — kill if heartbeat age > max(30min, bash_timeout)
* per-claim stuck — kill if any processing_ack row has claim_age >
max(60s, bash_timeout) AND heartbeat hasn't been touched since claim
* otherwise ok
Kill paths reset leftover processing rows with exponential backoff,
reusing the existing retry machinery.
Tool blocklist expanded:
- AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question)
- EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI
affordances; would hang in headless containers)
PreToolUse hook is also defense-in-depth: if a disallowed tool name slips
through, it returns `{ decision: 'block' }` so the agent sees a clear
error instead of appearing stuck.
Removed:
- container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on
activeContainers entry, resetContainerIdleTimer export.
- delivery.ts: the resetContainerIdleTimer call on successful delivery.
- poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is
cheaper than close+reopen (no cold prompt cache). Liveness is now a
host-side concern.
- host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the
stale-detection path (still exported for kill reset).
Tests:
- src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh
heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension
(both ceiling and per-claim), claim age below tolerance, heartbeat
touched after claim, unparseable timestamps.
Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,6 +3,7 @@ import path from 'path';
|
||||
|
||||
import { query as sdkQuery, type HookCallback, type PreCompactHookInput } from '@anthropic-ai/claude-agent-sdk';
|
||||
|
||||
import { clearContainerToolInFlight, setContainerToolInFlight } from '../db/connection.js';
|
||||
import { registerProvider } from './provider-registry.js';
|
||||
import type { AgentProvider, AgentQuery, McpServerConfig, ProviderEvent, ProviderOptions, QueryInput } from './types.js';
|
||||
|
||||
@@ -10,10 +11,28 @@ function log(msg: string): void {
|
||||
console.error(`[claude-provider] ${msg}`);
|
||||
}
|
||||
|
||||
// Deferred SDK builtins that would sidestep nanoclaw's own scheduling.
|
||||
// Scheduling goes through mcp__nanoclaw__schedule_task so that tasks are
|
||||
// durable across sessions/restarts and gated by our pre-task script hook.
|
||||
const SDK_DISALLOWED_TOOLS = ['CronCreate', 'CronDelete', 'CronList', 'ScheduleWakeup'];
|
||||
// Deferred SDK builtins that either sidestep nanoclaw's own scheduling or
|
||||
// don't fit our async message-passing model (they're designed for Claude
|
||||
// Code's interactive UI and would hang here).
|
||||
//
|
||||
// - CronCreate / CronDelete / CronList / ScheduleWakeup: we have durable
|
||||
// scheduling via mcp__nanoclaw__schedule_task.
|
||||
// - AskUserQuestion: SDK returns a placeholder instead of blocking on a
|
||||
// real answer — we have mcp__nanoclaw__ask_user_question that persists
|
||||
// the question and blocks on the real reply.
|
||||
// - EnterPlanMode / ExitPlanMode / EnterWorktree / ExitWorktree: Claude
|
||||
// Code UI affordances; in a headless container they'd appear stuck.
|
||||
const SDK_DISALLOWED_TOOLS = [
|
||||
'CronCreate',
|
||||
'CronDelete',
|
||||
'CronList',
|
||||
'ScheduleWakeup',
|
||||
'AskUserQuestion',
|
||||
'EnterPlanMode',
|
||||
'ExitPlanMode',
|
||||
'EnterWorktree',
|
||||
'ExitWorktree',
|
||||
];
|
||||
|
||||
// Tool allowlist for NanoClaw agent containers
|
||||
const TOOL_ALLOWLIST = [
|
||||
@@ -122,6 +141,43 @@ function formatTranscriptMarkdown(messages: ParsedMessage[], title?: string | nu
|
||||
return lines.join('\n');
|
||||
}
|
||||
|
||||
/**
|
||||
* PreToolUse hook: record the current tool + its declared timeout so the host
|
||||
* sweep can widen its stuck tolerance while Bash is running a long-declared
|
||||
* script. Defense-in-depth: if SDK_DISALLOWED_TOOLS slips through somehow,
|
||||
* block the call here instead of letting the agent hang.
|
||||
*/
|
||||
const preToolUseHook: HookCallback = async (input) => {
|
||||
const i = input as { tool_name?: string; tool_input?: Record<string, unknown> };
|
||||
const toolName = i.tool_name ?? '';
|
||||
if (SDK_DISALLOWED_TOOLS.includes(toolName)) {
|
||||
return {
|
||||
decision: 'block',
|
||||
stopReason: `Tool '${toolName}' is not available in this environment — use the nanoclaw equivalent.`,
|
||||
} as unknown as ReturnType<HookCallback>;
|
||||
}
|
||||
// Bash exposes its timeout via the tool_input.timeout field (ms). Any other
|
||||
// tool: no declared timeout.
|
||||
const declaredTimeoutMs =
|
||||
toolName === 'Bash' && typeof i.tool_input?.timeout === 'number' ? (i.tool_input.timeout as number) : null;
|
||||
try {
|
||||
setContainerToolInFlight(toolName, declaredTimeoutMs);
|
||||
} catch (err) {
|
||||
log(`PreToolUse: failed to record container_state: ${err instanceof Error ? err.message : String(err)}`);
|
||||
}
|
||||
return { continue: true };
|
||||
};
|
||||
|
||||
/** Clear in-flight tool on PostToolUse / PostToolUseFailure. */
|
||||
const postToolUseHook: HookCallback = async () => {
|
||||
try {
|
||||
clearContainerToolInFlight();
|
||||
} catch (err) {
|
||||
log(`PostToolUse: failed to clear container_state: ${err instanceof Error ? err.message : String(err)}`);
|
||||
}
|
||||
return { continue: true };
|
||||
};
|
||||
|
||||
function createPreCompactHook(assistantName?: string): HookCallback {
|
||||
return async (input) => {
|
||||
const preCompact = input as PreCompactHookInput;
|
||||
@@ -224,6 +280,9 @@ export class ClaudeProvider implements AgentProvider {
|
||||
settingSources: ['project', 'user'],
|
||||
mcpServers: this.mcpServers,
|
||||
hooks: {
|
||||
PreToolUse: [{ hooks: [preToolUseHook] }],
|
||||
PostToolUse: [{ hooks: [postToolUseHook] }],
|
||||
PostToolUseFailure: [{ hooks: [postToolUseHook] }],
|
||||
PreCompact: [{ hooks: [createPreCompactHook(this.assistantName)] }],
|
||||
},
|
||||
},
|
||||
|
||||
Reference in New Issue
Block a user