fix(sweep): wake before reset + idempotent retry for orphan claims

When a container exits with an unresolved processing_ack claim, the sweep's crashed-container cleanup would reset the matching inbound message with tries++ and a future process_after. dueCount then dropped to 0, so the wake step never fired — and the next sweep tick found the same orphan claim, bumped tries again, and pushed process_after further out. The message reached MAX_TRIES and was marked failed without any container ever being spawned. Two changes: 1. Reorder sweep so the wake step runs before crashed-container cleanup. A fresh container clears orphan 'processing' rows on its own startup (container/agent-runner/src/db/connection.ts), so once we get it running the claim resolves itself. 2. Make resetStuckProcessingRows idempotent: if a message already has process_after set to a future time, skip the retry bump. The wake path will pick it up when the backoff elapses. Requires returning process_after from getMessageForRetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:12:16 +00:00
parent bee80b0072
commit 209061f54f
2 changed files with 28 additions and 14 deletions
--- a/src/db/session-db.ts
+++ b/src/db/session-db.ts
@@ -139,10 +139,10 @@ export function getMessageForRetry(
  db: Database.Database,
  messageId: string,
  status: string,
-): { id: string; tries: number } | undefined {
-  return db.prepare('SELECT id, tries FROM messages_in WHERE id = ? AND status = ?').get(messageId, status) as
-    | { id: string; tries: number }
-    | undefined;
+): { id: string; tries: number; processAfter: string | null } | undefined {
+  return db
+    .prepare('SELECT id, tries, process_after as processAfter FROM messages_in WHERE id = ? AND status = ?')
+    .get(messageId, status) as { id: string; tries: number; processAfter: string | null } | undefined;
 }

 export function syncProcessingAcks(inDb: Database.Database, outDb: Database.Database): void {