Decouple container restart from config updates — config CLI ops now only
write to the DB; restart is a separate `ncl groups restart` command with
--rebuild and --message flags. Add on_wake column to messages_in so wake
messages are only picked up by a fresh container's first poll, preventing
dying containers from stealing them during the SIGTERM grace window.
killContainer accepts an onExit callback for race-free respawn. Agent-
called restart auto-scopes to the calling session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Squash merge of PR #2267 by ddaniels.
When an agent group has more than one active session, A2A replies landed
in the newest session via findSessionByAgentGroup's ORDER BY created_at
DESC. The session that asked the question never saw the answer.
Adds origin-aware return-path routing with three layers:
1. Direct return-path: if the reply has in_reply_to, look up the
triggering inbound row's source_session_id and route there.
2. Peer-affinity fallback: find the most recent A2A inbound from this
peer and use its source_session_id.
3. Legacy fallback: newest active session (pre-migration compat).
Container-side: MCP send_message/send_file now thread the current
batch's in_reply_to through to outbound rows via current-batch.ts.
Also flips our A2A bug-documenting test (#2332) from asserting the
broken behavior to asserting the fixed behavior.
Co-Authored-By: Doug Daniels <ddaniels888@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #2151 added deleteOrphanProcessingClaims() to resetStuckProcessingRows(),
but outDb is always opened readonly (openOutboundDb uses immutable: true).
The write silently failed, leaving orphan processing_ack rows that block
future message delivery for the session.
Fix: add openOutboundDbRw() alongside the existing readonly opener and use
it in resetStuckProcessingRows() to open a short-lived writable handle just
for the delete. The readonly handle is still used for all reads above.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the host kills a container (absolute-ceiling, claim-stuck, or crashed),
resetStuckProcessingRows reset messages_in but left orphan rows in
processing_ack. The next sweep tick spawned a fresh container and, on the
same tick, ran enforceRunningContainerSla against outbound.db that still
contained the previous container's claim with a hours-old status_changed
timestamp — instant kill-claim, before the agent-runner could open
outbound.db to run its own clearStaleProcessingAcks(). Loop until tries
hit MAX_TRIES.
Add deleteOrphanProcessingClaims() in session-db and call it at the end of
resetStuckProcessingRows. Safe to write outbound.db here because the host
only enters this path after killContainer (or when no container is running).
Tests in host-sweep.test.ts cover the helper plus the regression: orphan
claim from a 2h-old kill is now removed atomically with the messages_in
reset, so the next sweep tick sees an empty claims list and the freshly
respawned container survives long enough to start its agent-runner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a container exits with an unresolved processing_ack claim, the
sweep's crashed-container cleanup would reset the matching inbound
message with tries++ and a future process_after. dueCount then dropped
to 0, so the wake step never fired — and the next sweep tick found the
same orphan claim, bumped tries again, and pushed process_after further
out. The message reached MAX_TRIES and was marked failed without any
container ever being spawned.
Two changes:
1. Reorder sweep so the wake step runs before crashed-container
cleanup. A fresh container clears orphan 'processing' rows on its
own startup (container/agent-runner/src/db/connection.ts), so once
we get it running the claim resolves itself.
2. Make resetStuckProcessingRows idempotent: if a message already has
process_after set to a future time, skip the retry bump. The wake
path will pick it up when the backoff elapses. Requires returning
process_after from getMessageForRetry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the opaque trigger_rules JSON + response_scope enum on
messaging_group_agents with four explicit orthogonal columns:
engage_mode 'pattern' | 'mention' | 'mention-sticky'
engage_pattern regex source; required when mode='pattern';
'.' is the "always" sentinel
sender_scope 'all' | 'known'
ignored_message_policy 'drop' | 'accumulate'
Inbound routing becomes a fan-out — every wired agent is evaluated
independently. A match gets its own session + container wake. A miss
with accumulate keeps the message as context-only (trigger=0) in that
agent's session, so when the agent does eventually engage it sees the
prior chatter.
## Schema
- Migration 010 (`engage-modes`): adds the 4 new columns, backfills
from trigger_rules.pattern + requiresTrigger + response_scope, drops
the legacy columns.
- messages_in gains `trigger INTEGER NOT NULL DEFAULT 1` (session DB
schema + `migrateMessagesInTable` forward-compat).
- countDueMessages gates waking on `trigger = 1`.
## Routing
- `pickAgent` (returns one) → loop over all wired agents. Per agent:
evaluate engage_mode; run access gate + sender-scope gate; on full
match → resolveSession + writeSessionMessage(trigger=1) + wake. On
miss with accumulate → writeSessionMessage(trigger=0), no wake. On
miss with drop → skip.
- New `findSessionForAgent(agentGroupId, mgId, threadId)` scopes
session lookup by agent so fan-out doesn't cross sessions.
- `messageIdForAgent` namespaces inbound message ids by agent_group_id
so PRIMARY KEY doesn't collide across per-agent session DBs.
## Adapter layer
- `ConversationConfig` replaces `triggerPattern` + `requiresTrigger`
with `engageMode` + `engagePattern`.
- Chat SDK bridge stores `Map<platformId, ConversationConfig[]>` (multi-
agent per conversation) and applies union gating pre-onInbound:
* onSubscribedMessage: engage if any wiring keeps firing in
subscribed state (mention-sticky or pattern)
* onNewMention: engage on mention; only subscribes the thread if
at least one wiring is `mention-sticky`
* onDirectMessage: engage per mode; sticky follows same rule
- Bridge no longer unconditionally calls `thread.subscribe()`.
## Sender scope
- Permissions module registers a second hook `setSenderScopeGate` that
runs per-wiring after the existing access gate. `sender_scope='known'`
requires canAccessAgentGroup(); `'all'` is a no-op. Not installed →
no-op everywhere (default allow).
## Container side
- Host passes `NANOCLAW_MAX_MESSAGES_PER_PROMPT` (reuses existing
MAX_MESSAGES_PER_PROMPT config; was dead code from v1).
- `getPendingMessages` queries `ORDER BY seq DESC LIMIT N`, reverses to
chronological order for the prompt — accumulated context rides along
with trigger rows up to the cap.
- `MessageInRow` gains `trigger: number` so the container can tell them
apart in downstream code (container still processes both; only the
host uses `trigger=0` for don't-wake).
## Defaults (per ACTION-ITEMS item 1 decision)
- DM (is_group=0): `engage_mode='pattern'`, `engage_pattern='.'` (always)
- Threaded group: `engage_mode='mention-sticky'` (seed-discord)
- Non-threaded group / CLI: pattern '.' in bootstrap scripts
## Tests
- src/host-core.test.ts: 3 new cases — fan-out (2 agents, 2 sessions,
2 wakes), accumulate (trigger=0 + no wake), drop (no session created).
- Existing 10 host-core tests still pass.
- Migration 010 runs on an empty DB in 0-row path — verified.
Closes: ACTION-ITEMS items 1, 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the two overlapping old mechanisms (30-min setTimeout kill in
container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep)
with message-scoped stuck detection anchored to the processing_ack claim
age + an absolute 30-min ceiling that extends for long-declared Bash
tools.
Old model problems:
- IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive
agents got killed at 30min regardless of activity
- 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is
only touched on SDK events, so legitimate silent tool work (sleep 30,
long WebFetch, npm install) looked identical to a hung container
- Two overlapping sources of truth for "when to let go of a container"
New model:
- Host sweep is the single source of truth.
- Container exposes a new `container_state` single-row table in outbound.db
(schema added; container writes, host reads). PreToolUse hook writes
current_tool + tool_declared_timeout_ms (read from Bash's tool_input);
PostToolUse / PostToolUseFailure clear it.
- Sweep decides with a pure helper `decideStuckAction`:
* absolute ceiling — kill if heartbeat age > max(30min, bash_timeout)
* per-claim stuck — kill if any processing_ack row has claim_age >
max(60s, bash_timeout) AND heartbeat hasn't been touched since claim
* otherwise ok
Kill paths reset leftover processing rows with exponential backoff,
reusing the existing retry machinery.
Tool blocklist expanded:
- AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question)
- EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI
affordances; would hang in headless containers)
PreToolUse hook is also defense-in-depth: if a disallowed tool name slips
through, it returns `{ decision: 'block' }` so the agent sees a clear
error instead of appearing stuck.
Removed:
- container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on
activeContainers entry, resetContainerIdleTimer export.
- delivery.ts: the resetContainerIdleTimer call on successful delivery.
- poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is
cheaper than close+reopen (no cold prompt cache). Liveness is now a
host-side concern.
- host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the
stale-detection path (still exported for kill reset).
Tests:
- src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh
heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension
(both ceiling and per-claim), claim age below tolerance, heartbeat
touched after claim, unparseable timestamps.
Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the scheduling surface — 5 delivery actions (schedule_task,
cancel_task, pause_task, resume_task, update_task), handleRecurrence,
applyPreTaskScripts, and task DB helpers — out of core and into
src/modules/scheduling/ (host) and container/agent-runner/src/scheduling/
(container).
First PR to fill the MODULE-HOOK markers introduced in PR #2:
- src/host-sweep.ts MODULE-HOOK:scheduling-recurrence now dynamically
imports handleRecurrence from the module each sweep tick.
- container/agent-runner/src/poll-loop.ts MODULE-HOOK:scheduling-pre-task
dynamically imports applyPreTaskScripts before the provider call.
When the marker block is empty (scheduling uninstalled), `keep`
falls back to `normalMessages` so non-task messages still flow.
The 5 task cases are removed from delivery.ts's handleSystemAction
switch — the registry now routes them. Task DB helpers moved out of
src/db/session-db.ts (which kept `nextEvenSeq` as a named export so
the module can uphold the host-writes-even-seq invariant). Test suite
split to match: scheduling-specific tests live in the module.
No migration — tasks are messages_in rows with kind='task'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
update_task lets the agent adjust prompt/recurrence/processAfter/script
on a live scheduled task without losing the series id the user already
knows. Empty string clears recurrence/script.
list_tasks now groups by series_id so recurring tasks show as one row
(the live pending/paused occurrence) instead of one per firing — the
id displayed is the stable series handle that update/cancel/pause/resume
all match against.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recurring tasks spawn a new messages_in row per occurrence. Cancel
only matched the completed row the agent remembered, leaving the
live next occurrence running. Tag every row in a recurrence chain
with the originating task's id (series_id) so cancel/pause/resume
can reach any live row in the series. Cancel also clears recurrence
to prevent the sweep from cloning a cancelled task. Kind-aware id
prefix on recurrences (task- instead of msg-) keeps list_tasks output
consistent across occurrences.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scheduled tasks stored process_after as ISO-8601 with `T` and `Z`
(e.g. `2026-04-16T14:37:00Z`) but the due-check queries compared it
via raw `<=` against `datetime('now')`, which returns space-separated
format (`2026-04-16 14:37:00`). Since `'T' (0x54) > ' ' (0x20)`,
every ISO-formatted process_after sorted greater than any SQLite-format
`now`, so tasks were never picked up by either the host sweep's
countDueMessages or the container's getPendingMessages.
Wrapping process_after in datetime() normalises both sides before
comparison. Recurrence rows (written by retryWithBackoff using
datetime('now', ...)) already had SQLite format and were unaffected,
which is why the bug only surfaced for agent-scheduled tasks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Route webhook requests through chat.webhooks[name]() instead of calling
adapter.handleWebhook() directly, getting proper auto-initialization and
signature verification. Extract Node↔Web Request/Response conversion
into reusable helpers, parse URL pathname properly for query string
safety, and support all HTTP methods (not just POST).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all raw SQL out of session-manager, delivery, and host-sweep into
a dedicated DB module. Make session schemas idempotent (IF NOT EXISTS)
so initSessionFolder always applies them. Revert the markdown
plain-text retry from 4c477ac.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>