nanoclaw

Author	SHA1	Message	Date
gavrielc	be3a8a97c6	feat: race-free on-wake messages and explicit restart CLI Decouple container restart from config updates — config CLI ops now only write to the DB; restart is a separate `ncl groups restart` command with --rebuild and --message flags. Add on_wake column to messages_in so wake messages are only picked up by a fresh container's first poll, preventing dying containers from stealing them during the SIGTERM grace window. killContainer accepts an onExit callback for race-free respawn. Agent- called restart auto-scopes to the calling session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-09 19:02:15 +03:00
gavrielc	107945f10c	fix(agent-to-agent): route A2A replies back to originating session (#2267 ) Squash merge of PR #2267 by ddaniels. When an agent group has more than one active session, A2A replies landed in the newest session via findSessionByAgentGroup's ORDER BY created_at DESC. The session that asked the question never saw the answer. Adds origin-aware return-path routing with three layers: 1. Direct return-path: if the reply has in_reply_to, look up the triggering inbound row's source_session_id and route there. 2. Peer-affinity fallback: find the most recent A2A inbound from this peer and use its source_session_id. 3. Legacy fallback: newest active session (pre-migration compat). Container-side: MCP send_message/send_file now thread the current batch's in_reply_to through to outbound rows via current-batch.ts. Also flips our A2A bug-documenting test (#2332) from asserting the broken behavior to asserting the fixed behavior. Co-Authored-By: Doug Daniels <ddaniels888@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-08 00:48:10 +03:00
Charlie Savage	8d022fd9da	fix(host-sweep): reopen outbound DB as writable for orphan claim cleanup PR #2151 added deleteOrphanProcessingClaims() to resetStuckProcessingRows(), but outDb is always opened readonly (openOutboundDb uses immutable: true). The write silently failed, leaving orphan processing_ack rows that block future message delivery for the session. Fix: add openOutboundDbRw() alongside the existing readonly opener and use it in resetStuckProcessingRows() to open a short-lived writable handle just for the delete. The readonly handle is still used for all reads above. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 23:44:07 -07:00
Ethan	7ce9922cde	fix(host-sweep): clear orphan processing_ack on kill to prevent claim-stuck loop When the host kills a container (absolute-ceiling, claim-stuck, or crashed), resetStuckProcessingRows reset messages_in but left orphan rows in processing_ack. The next sweep tick spawned a fresh container and, on the same tick, ran enforceRunningContainerSla against outbound.db that still contained the previous container's claim with a hours-old status_changed timestamp — instant kill-claim, before the agent-runner could open outbound.db to run its own clearStaleProcessingAcks(). Loop until tries hit MAX_TRIES. Add deleteOrphanProcessingClaims() in session-db and call it at the end of resetStuckProcessingRows. Safe to write outbound.db here because the host only enters this path after killContainer (or when no container is running). Tests in host-sweep.test.ts cover the helper plus the regression: orphan claim from a 2h-old kill is now removed atomically with the messages_in reset, so the next sweep tick sees an empty claims list and the freshly respawned container survives long enough to start its agent-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:54:42 +02:00
exe.dev user	209061f54f	fix(sweep): wake before reset + idempotent retry for orphan claims When a container exits with an unresolved processing_ack claim, the sweep's crashed-container cleanup would reset the matching inbound message with tries++ and a future process_after. dueCount then dropped to 0, so the wake step never fired — and the next sweep tick found the same orphan claim, bumped tries again, and pushed process_after further out. The message reached MAX_TRIES and was marked failed without any container ever being spawned. Two changes: 1. Reorder sweep so the wake step runs before crashed-container cleanup. A fresh container clears orphan 'processing' rows on its own startup (container/agent-runner/src/db/connection.ts), so once we get it running the claim resolves itself. 2. Make resetStuckProcessingRows idempotent: if a message already has process_after set to a future time, skip the retry bump. The wake path will pick it up when the backoff elapses. Requires returning process_after from getMessageForRetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:12:16 +00:00
gavrielc	16b9499532	feat(routing): engage modes + sender scope + accumulate/drop + per-agent fan-out Replaces the opaque trigger_rules JSON + response_scope enum on messaging_group_agents with four explicit orthogonal columns: engage_mode 'pattern' \| 'mention' \| 'mention-sticky' engage_pattern regex source; required when mode='pattern'; '.' is the "always" sentinel sender_scope 'all' \| 'known' ignored_message_policy 'drop' \| 'accumulate' Inbound routing becomes a fan-out — every wired agent is evaluated independently. A match gets its own session + container wake. A miss with accumulate keeps the message as context-only (trigger=0) in that agent's session, so when the agent does eventually engage it sees the prior chatter. ## Schema - Migration 010 (`engage-modes`): adds the 4 new columns, backfills from trigger_rules.pattern + requiresTrigger + response_scope, drops the legacy columns. - messages_in gains `trigger INTEGER NOT NULL DEFAULT 1` (session DB schema + `migrateMessagesInTable` forward-compat). - countDueMessages gates waking on `trigger = 1`. ## Routing - `pickAgent` (returns one) → loop over all wired agents. Per agent: evaluate engage_mode; run access gate + sender-scope gate; on full match → resolveSession + writeSessionMessage(trigger=1) + wake. On miss with accumulate → writeSessionMessage(trigger=0), no wake. On miss with drop → skip. - New `findSessionForAgent(agentGroupId, mgId, threadId)` scopes session lookup by agent so fan-out doesn't cross sessions. - `messageIdForAgent` namespaces inbound message ids by agent_group_id so PRIMARY KEY doesn't collide across per-agent session DBs. ## Adapter layer - `ConversationConfig` replaces `triggerPattern` + `requiresTrigger` with `engageMode` + `engagePattern`. - Chat SDK bridge stores `Map<platformId, ConversationConfig[]>` (multi- agent per conversation) and applies union gating pre-onInbound: * onSubscribedMessage: engage if any wiring keeps firing in subscribed state (mention-sticky or pattern) * onNewMention: engage on mention; only subscribes the thread if at least one wiring is `mention-sticky` * onDirectMessage: engage per mode; sticky follows same rule - Bridge no longer unconditionally calls `thread.subscribe()`. ## Sender scope - Permissions module registers a second hook `setSenderScopeGate` that runs per-wiring after the existing access gate. `sender_scope='known'` requires canAccessAgentGroup(); `'all'` is a no-op. Not installed → no-op everywhere (default allow). ## Container side - Host passes `NANOCLAW_MAX_MESSAGES_PER_PROMPT` (reuses existing MAX_MESSAGES_PER_PROMPT config; was dead code from v1). - `getPendingMessages` queries `ORDER BY seq DESC LIMIT N`, reverses to chronological order for the prompt — accumulated context rides along with trigger rows up to the cap. - `MessageInRow` gains `trigger: number` so the container can tell them apart in downstream code (container still processes both; only the host uses `trigger=0` for don't-wake). ## Defaults (per ACTION-ITEMS item 1 decision) - DM (is_group=0): `engage_mode='pattern'`, `engage_pattern='.'` (always) - Threaded group: `engage_mode='mention-sticky'` (seed-discord) - Non-threaded group / CLI: pattern '.' in bootstrap scripts ## Tests - src/host-core.test.ts: 3 new cases — fan-out (2 agents, 2 sessions, 2 wakes), accumulate (trigger=0 + no wake), drop (no session created). - Existing 10 host-core tests still pass. - Migration 010 runs on an empty DB in 0-row path — verified. Closes: ACTION-ITEMS items 1, 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:30:04 +03:00
gavrielc	6a815190c0	feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:16:57 +03:00
gavrielc	473f766585	refactor(modules): extract scheduling as registry-based module Moves the scheduling surface — 5 delivery actions (schedule_task, cancel_task, pause_task, resume_task, update_task), handleRecurrence, applyPreTaskScripts, and task DB helpers — out of core and into src/modules/scheduling/ (host) and container/agent-runner/src/scheduling/ (container). First PR to fill the MODULE-HOOK markers introduced in PR #2: - src/host-sweep.ts MODULE-HOOK:scheduling-recurrence now dynamically imports handleRecurrence from the module each sweep tick. - container/agent-runner/src/poll-loop.ts MODULE-HOOK:scheduling-pre-task dynamically imports applyPreTaskScripts before the provider call. When the marker block is empty (scheduling uninstalled), `keep` falls back to `normalMessages` so non-task messages still flow. The 5 task cases are removed from delivery.ts's handleSystemAction switch — the registry now routes them. Task DB helpers moved out of src/db/session-db.ts (which kept `nextEvenSeq` as a named export so the module can uphold the host-writes-even-seq invariant). Test suite split to match: scheduling-specific tests live in the module. No migration — tasks are messages_in rows with kind='task'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:17:47 +03:00
exe.dev user	cdf18e608f	feat(v2): add update_task MCP tool, dedup list_tasks by series update_task lets the agent adjust prompt/recurrence/processAfter/script on a live scheduled task without losing the series id the user already knows. Empty string clears recurrence/script. list_tasks now groups by series_id so recurring tasks show as one row (the live pending/paused occurrence) instead of one per firing — the id displayed is the stable series handle that update/cancel/pause/resume all match against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
exe.dev user	8ef30ad289	fix(v2): cancel/pause/resume recurring tasks via series_id Recurring tasks spawn a new messages_in row per occurrence. Cancel only matched the completed row the agent remembered, leaving the live next occurrence running. Tag every row in a recurrence chain with the originating task's id (series_id) so cancel/pause/resume can reach any live row in the series. Cancel also clears recurrence to prevent the sweep from cloning a cancelled task. Kind-aware id prefix on recurrences (task- instead of msg-) keeps list_tasks output consistent across occurrences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
exe.dev user	9617087c16	fix(v2): compare process_after as datetime, not raw string Scheduled tasks stored process_after as ISO-8601 with `T` and `Z` (e.g. `2026-04-16T14:37:00Z`) but the due-check queries compared it via raw `<=` against `datetime('now')`, which returns space-separated format (`2026-04-16 14:37:00`). Since `'T' (0x54) > ' ' (0x20)`, every ISO-formatted process_after sorted greater than any SQLite-format `now`, so tasks were never picked up by either the host sweep's countDueMessages or the container's getPendingMessages. Wrapping process_after in datetime() normalises both sides before comparison. Recurrence rows (written by retryWithBackoff using datetime('now', ...)) already had SQLite format and were unaffected, which is why the bug only surfaced for agent-scheduled tasks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
gavrielc	5a606a83d4	refactor(v2): use Chat SDK webhooks proxy and clean up webhook server Route webhook requests through chat.webhooks[name]() instead of calling adapter.handleWebhook() directly, getting proper auto-initialization and signature verification. Extract Node↔Web Request/Response conversion into reusable helpers, parse URL pathname properly for query string safety, and support all HTTP methods (not just POST). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 17:36:16 +03:00
gavrielc	669a8444ef	refactor(v2): extract session DB operations into src/db/session-db.ts Move all raw SQL out of session-manager, delivery, and host-sweep into a dedicated DB module. Make session schemas idempotent (IF NOT EXISTS) so initSessionFolder always applies them. Revert the markdown plain-text retry from `4c477ac`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 17:36:16 +03:00

13 Commits