feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist

Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:16:57 +03:00
parent dcfa12ea06
commit 6a815190c0
12 changed files with 459 additions and 86 deletions
--- a/src/host-sweep.ts
+++ b/src/host-sweep.ts
@@ -2,10 +2,29 @@
 * Host sweep — periodic maintenance of all session DBs.
 *
 * Two-DB architecture:
- *   - Reads processing_ack from outbound.db to sync message status
- *   - Writes to inbound.db (host-owned) for status updates and recurrence
- *   - Uses heartbeat file mtime for stale container detection (not DB writes)
+ *   - Reads processing_ack + container_state from outbound.db
+ *   - Writes to inbound.db (host-owned) for status updates + recurrence
+ *   - Uses heartbeat file mtime for liveness (never polls DB for it)
 *   - Never writes to outbound.db — preserves single-writer-per-file invariant
+ *
+ * Stuck / idle detection (replaces the old IDLE_TIMEOUT setTimeout + 10-min
+ * heartbeat threshold):
+ *
+ *   If the container isn't running and there are 'processing' rows left over
+ *   (e.g. it crashed mid-turn) → reset them to pending with backoff +
+ *   tries++. Existing retry machinery does the rest.
+ *
+ *   If the container IS running:
+ *     1. Absolute ceiling: heartbeat age > max(30 min, current_bash_timeout)
+ *        → kill. Covers the "alive but silent for 30 min" case. Extended
+ *        only while Bash is declared as running longer, honouring the
+ *        user's own timeout directive. Kill then resets processing rows.
+ *
+ *     2. Message-scoped stuck: for each 'processing' row, tolerance =
+ *        max(60s, current_bash_timeout_ms_if_Bash_running). If
+ *        (claim_age > tolerance) AND (heartbeat_mtime <= status_changed)
+ *        → kill + reset this message + tries++. Semantics: "container
+ *        claimed a message and went quiet past tolerance since the claim."
 */
 import type Database from 'better-sqlite3';
 import fs from 'fs';
@@ -14,22 +33,68 @@ import { getActiveSessions } from './db/sessions.js';
 import { getAgentGroup } from './db/agent-groups.js';
 import {
  countDueMessages,
-  syncProcessingAcks,
-  getStuckProcessingIds,
+  getContainerState,
  getMessageForRetry,
+  getProcessingClaims,
  markMessageFailed,
  retryWithBackoff,
+  syncProcessingAcks,
+  type ContainerState,
 } from './db/session-db.js';
 import { log } from './log.js';
 import { openInboundDb, openOutboundDb, inboundDbPath, heartbeatPath } from './session-manager.js';
-import { wakeContainer, isContainerRunning } from './container-runner.js';
+import { isContainerRunning, killContainer, wakeContainer } from './container-runner.js';
 import type { Session } from './types.js';

 const SWEEP_INTERVAL_MS = 60_000;
-const STALE_THRESHOLD_MS = 10 * 60 * 1000; // 10 minutes
+// Absolute idle ceiling for a running container. If the heartbeat file hasn't
+// been touched in this long, the container is either stuck or doing genuinely
+// nothing — kill and restart on the next inbound.
+export const ABSOLUTE_CEILING_MS = 30 * 60 * 1000;
+// Stuck tolerance window applied per 'processing' claim — "did we see any
+// signs of life since this message was claimed?"
+export const CLAIM_STUCK_MS = 60 * 1000;
 const MAX_TRIES = 5;
 const BACKOFF_BASE_MS = 5000;

+export type StuckDecision =
+  | { action: 'ok' }
+  | { action: 'kill-ceiling'; heartbeatAgeMs: number; ceilingMs: number }
+  | { action: 'kill-claim'; messageId: string; claimAgeMs: number; toleranceMs: number };
+
+/**
+ * Pure decision for whether a running container should be killed this sweep
+ * tick. Inputs are all deterministic; filesystem + DB reads happen in the
+ * caller.
+ */
+export function decideStuckAction(args: {
+  now: number;
+  heartbeatMtimeMs: number; // 0 when heartbeat file absent
+  containerState: ContainerState | null;
+  claims: Array<{ message_id: string; status_changed: string }>;
+}): StuckDecision {
+  const { now, heartbeatMtimeMs, containerState, claims } = args;
+  const declaredBashMs = bashTimeoutMs(containerState);
+  const heartbeatAge = heartbeatMtimeMs === 0 ? Infinity : now - heartbeatMtimeMs;
+
+  const ceiling = Math.max(ABSOLUTE_CEILING_MS, declaredBashMs ?? 0);
+  if (heartbeatAge > ceiling) {
+    return { action: 'kill-ceiling', heartbeatAgeMs: heartbeatAge, ceilingMs: ceiling };
+  }
+
+  const tolerance = Math.max(CLAIM_STUCK_MS, declaredBashMs ?? 0);
+  for (const claim of claims) {
+    const claimedAt = Date.parse(claim.status_changed);
+    if (Number.isNaN(claimedAt)) continue;
+    const claimAge = now - claimedAt;
+    if (claimAge <= tolerance) continue;
+    if (heartbeatMtimeMs > claimedAt) continue;
+    return { action: 'kill-claim', messageId: claim.message_id, claimAgeMs: claimAge, toleranceMs: tolerance };
+  }
+
+  return { action: 'ok' };
+}
+
 let running = false;

 export function startHostSweep(): void {
@@ -84,20 +149,26 @@ async function sweepSession(session: Session): Promise<void> {
      syncProcessingAcks(inDb, outDb);
    }

-    // 2. Check for due pending messages → wake container
-    const dueCount = countDueMessages(inDb);
+    const alive = isContainerRunning(session.id);

+    // 2. Crashed-container cleanup: processing rows left behind get retried.
+    if (!alive && outDb) {
+      resetStuckProcessingRows(inDb, outDb, session, 'container not running');
+    }
+
+    // 3. Running-container SLA: absolute ceiling + per-claim stuck rules.
+    if (alive && outDb) {
+      enforceRunningContainerSla(inDb, outDb, session, agentGroup.id);
+    }
+
+    // 4. Wake a container if new work is due and nothing is running.
+    const dueCount = countDueMessages(inDb);
    if (dueCount > 0 && !isContainerRunning(session.id)) {
      log.info('Waking container for due messages', { sessionId: session.id, count: dueCount });
      await wakeContainer(session);
    }

-    // 3. Detect stale containers via heartbeat file
-    if (outDb) {
-      detectStaleContainers(inDb, outDb, session, agentGroup.id);
-    }
-
-    // 4. Handle recurrence for completed messages.
+    // 5. Recurrence fanout for completed recurring tasks.
    // MODULE-HOOK:scheduling-recurrence:start
    const { handleRecurrence } = await import('./modules/scheduling/recurrence.js');
    await handleRecurrence(inDb, session);
@@ -108,45 +179,84 @@ async function sweepSession(session: Session): Promise<void> {
  }
 }

-/**
- * Detect stale containers using heartbeat file mtime.
- * If the heartbeat is older than STALE_THRESHOLD and processing_ack has
- * 'processing' entries, the container likely crashed — reset with backoff.
- */
-function detectStaleContainers(
+function heartbeatMtimeMs(agentGroupId: string, sessionId: string): number {
+  const hbPath = heartbeatPath(agentGroupId, sessionId);
+  try {
+    return fs.statSync(hbPath).mtimeMs;
+  } catch {
+    return 0;
+  }
+}
+
+function bashTimeoutMs(state: ContainerState | null): number | null {
+  if (!state || state.current_tool !== 'Bash') return null;
+  return typeof state.tool_declared_timeout_ms === 'number' ? state.tool_declared_timeout_ms : null;
+}
+
+function enforceRunningContainerSla(
  inDb: Database.Database,
  outDb: Database.Database,
  session: Session,
  agentGroupId: string,
 ): void {
-  const hbPath = heartbeatPath(agentGroupId, session.id);
-  let heartbeatAge = Infinity;
-  try {
-    const stat = fs.statSync(hbPath);
-    heartbeatAge = Date.now() - stat.mtimeMs;
-  } catch {
-    // No heartbeat file — container may never have started, or it's very old
+  const decision = decideStuckAction({
+    now: Date.now(),
+    heartbeatMtimeMs: heartbeatMtimeMs(agentGroupId, session.id),
+    containerState: getContainerState(outDb),
+    claims: getProcessingClaims(outDb),
+  });
+
+  if (decision.action === 'ok') return;
+
+  if (decision.action === 'kill-ceiling') {
+    log.warn('Killing container past absolute ceiling', {
+      sessionId: session.id,
+      heartbeatAgeMs: decision.heartbeatAgeMs,
+      ceilingMs: decision.ceilingMs,
+    });
+    killContainer(session.id, 'absolute-ceiling');
+    resetStuckProcessingRows(inDb, outDb, session, 'absolute-ceiling');
+    return;
  }

-  if (heartbeatAge < STALE_THRESHOLD_MS) return; // Container is alive
+  log.warn('Killing container — message claimed then silent', {
+    sessionId: session.id,
+    messageId: decision.messageId,
+    claimAgeMs: decision.claimAgeMs,
+    toleranceMs: decision.toleranceMs,
+  });
+  killContainer(session.id, 'claim-stuck');
+  resetStuckProcessingRows(inDb, outDb, session, 'claim-stuck');
+}

-  // Heartbeat is stale — check for stuck processing entries
-  const processingIds = getStuckProcessingIds(outDb);
-  if (processingIds.length === 0) return;
-
-  for (const messageId of processingIds) {
-    const msg = getMessageForRetry(inDb, messageId, 'pending');
+function resetStuckProcessingRows(
+  inDb: Database.Database,
+  outDb: Database.Database,
+  session: Session,
+  reason: string,
+): void {
+  const claims = getProcessingClaims(outDb);
+  for (const { message_id } of claims) {
+    const msg = getMessageForRetry(inDb, message_id, 'pending');
    if (!msg) continue;

    if (msg.tries >= MAX_TRIES) {
      markMessageFailed(inDb, msg.id);
-      log.warn('Message marked as failed after max retries', { messageId: msg.id, sessionId: session.id });
+      log.warn('Message marked as failed after max retries', {
+        messageId: msg.id,
+        sessionId: session.id,
+        reason,
+      });
    } else {
      const backoffMs = BACKOFF_BASE_MS * Math.pow(2, msg.tries);
      const backoffSec = Math.floor(backoffMs / 1000);
      retryWithBackoff(inDb, msg.id, backoffSec);
-      log.info('Reset stale message with backoff', { messageId: msg.id, tries: msg.tries, backoffMs });
+      log.info('Reset stale message with backoff', {
+        messageId: msg.id,
+        tries: msg.tries,
+        backoffMs,
+        reason,
+      });
    }
  }
 }
-