fix(host-sweep): clear orphan processing_ack on kill to prevent claim-stuck loop
When the host kills a container (absolute-ceiling, claim-stuck, or crashed), resetStuckProcessingRows reset messages_in but left orphan rows in processing_ack. The next sweep tick spawned a fresh container and, on the same tick, ran enforceRunningContainerSla against outbound.db that still contained the previous container's claim with a hours-old status_changed timestamp — instant kill-claim, before the agent-runner could open outbound.db to run its own clearStaleProcessingAcks(). Loop until tries hit MAX_TRIES. Add deleteOrphanProcessingClaims() in session-db and call it at the end of resetStuckProcessingRows. Safe to write outbound.db here because the host only enters this path after killContainer (or when no container is running). Tests in host-sweep.test.ts cover the helper plus the regression: orphan claim from a 2h-old kill is now removed atomically with the messages_in reset, so the next sweep tick sees an empty claims list and the freshly respawned container survives long enough to start its agent-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -33,6 +33,7 @@ import { getActiveSessions } from './db/sessions.js';
|
||||
import { getAgentGroup } from './db/agent-groups.js';
|
||||
import {
|
||||
countDueMessages,
|
||||
deleteOrphanProcessingClaims,
|
||||
getContainerState,
|
||||
getMessageForRetry,
|
||||
getProcessingClaims,
|
||||
@@ -249,6 +250,15 @@ function enforceRunningContainerSla(
|
||||
resetStuckProcessingRows(inDb, outDb, session, 'claim-stuck');
|
||||
}
|
||||
|
||||
export function _resetStuckProcessingRowsForTesting(
|
||||
inDb: Database.Database,
|
||||
outDb: Database.Database,
|
||||
session: Session,
|
||||
reason: string,
|
||||
): void {
|
||||
resetStuckProcessingRows(inDb, outDb, session, reason);
|
||||
}
|
||||
|
||||
function resetStuckProcessingRows(
|
||||
inDb: Database.Database,
|
||||
outDb: Database.Database,
|
||||
@@ -285,4 +295,15 @@ function resetStuckProcessingRows(
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Drop the orphan 'processing' rows. Without this, the next sweep tick
|
||||
// would re-read them, see the old status_changed timestamp, conclude the
|
||||
// freshly respawned container is stuck, and SIGKILL it before its
|
||||
// agent-runner has a chance to run clearStaleProcessingAcks() on startup.
|
||||
// We're safe to write outbound.db here because we just killed the container
|
||||
// that owned it (or it crashed and left no writer behind).
|
||||
const cleared = deleteOrphanProcessingClaims(outDb);
|
||||
if (cleared > 0) {
|
||||
log.info('Cleared orphan processing claims', { sessionId: session.id, cleared, reason });
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user