nanoclaw

Author	SHA1	Message	Date
Ethan Munoz	ec23bd7a7e	fix(host-sweep): parse SQLite timestamps as UTC, not local time SQLite TIMESTAMP columns store UTC without a zone marker. `Date.parse` treats timezoneless ISO strings as local time, so on any non-UTC host every claim and processAfter looks (TZ offset) hours stale. That makes fresh claims trip the kill-claim path on the first sweep tick — every container gets killed within seconds of spawn. Two affected sites in host-sweep.ts: - decideStuckAction reads claim.status_changed and computes claimAge. On a TZ=Europe/Madrid host (UTC+2), a claim made 5s ago looks 7205s old and exceeds CLAIM_STUCK_MS (60s). - The orphan retry loop reads msg.processAfter and skips messages rescheduled into the future. On the same host, future timestamps look (TZ offset) hours in the past, so the skip is missed and tries gets bumped on every tick. Fix: introduce parseSqliteUtc(s) which appends "Z" only when no zone marker is present, then call it from both sites. Behavior under TZ=UTC is unchanged. Verified on a production v2 install on TZ=Europe/Madrid: with the patch applied, an idle container survived 30+ minutes without being killed (previously: killed within 60s of spawn). Tests: 5 new cases covering the bare/Z/+offset/invalid input matrix and a TZ-independence check. All 19 host-sweep tests pass and tsc clears against main.	2026-05-05 23:49:18 +02:00
gavrielc	a870e7ebf2	fix: keep resetStuckProcessingRows private, restore test wrapper The test wrapper forwards the in-memory outDb as the writable handle, avoiding the filesystem reopen that fails in CI. The function stays private — the optional writableOutDb param is an internal detail, not a public API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-05 15:56:08 +03:00
Charlie Savage	e4181f5451	fix(host-sweep): regression in #2183 — orphan-claim delete missed in tests #2183 added orphan-claim cleanup that reopens `outbound.db` by session path (`openOutboundDbRw(session.agent_group_id, session.id)`) so the delete runs against a writable handle even when callers pass a readonly one. That works for the production caller — there's a real on-disk session DB at the expected path. The test wrapper `_resetStuckProcessingRowsForTesting` (introduced in the same series, #2151) is called with in-memory DBs that have no on-disk path. The reopen creates a fresh empty file at `<DATA_DIR>/v2-sessions/ag-test/sess-test/outbound.db`, runs the delete against that, and leaves the in-memory `outDb` (which the test reads afterward) untouched. The two `resetStuckProcessingRows — orphan claim cleanup` tests assert `getProcessingClaims(outDb).toEqual([])` after the call and fail on the row that's still there. Fix: drop the `_…ForTesting` wrapper, export `resetStuckProcessingRows` directly with an optional `writableOutDb` parameter. When omitted (production), the function reopens `outbound.db` RW by session path — existing behavior, existing safety guarantee. When provided (tests, or any future caller that already holds a writable handle), the function uses it directly and skips the reopen. The optional parameter has a real meaning, not a "for tests" hack. Public API surface change: `_resetStuckProcessingRowsForTesting` is gone, `resetStuckProcessingRows` is now exported. No other callers inside the repo besides the test.	2026-05-02 22:54:08 -07:00
Charlie Savage	8d022fd9da	fix(host-sweep): reopen outbound DB as writable for orphan claim cleanup PR #2151 added deleteOrphanProcessingClaims() to resetStuckProcessingRows(), but outDb is always opened readonly (openOutboundDb uses immutable: true). The write silently failed, leaving orphan processing_ack rows that block future message delivery for the session. Fix: add openOutboundDbRw() alongside the existing readonly opener and use it in resetStuckProcessingRows() to open a short-lived writable handle just for the delete. The readonly handle is still used for all reads above. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 23:44:07 -07:00
Ethan	7ce9922cde	fix(host-sweep): clear orphan processing_ack on kill to prevent claim-stuck loop When the host kills a container (absolute-ceiling, claim-stuck, or crashed), resetStuckProcessingRows reset messages_in but left orphan rows in processing_ack. The next sweep tick spawned a fresh container and, on the same tick, ran enforceRunningContainerSla against outbound.db that still contained the previous container's claim with a hours-old status_changed timestamp — instant kill-claim, before the agent-runner could open outbound.db to run its own clearStaleProcessingAcks(). Loop until tries hit MAX_TRIES. Add deleteOrphanProcessingClaims() in session-db and call it at the end of resetStuckProcessingRows. Safe to write outbound.db here because the host only enters this path after killContainer (or when no container is running). Tests in host-sweep.test.ts cover the helper plus the regression: orphan claim from a 2h-old kill is now removed atomically with the messages_in reset, so the next sweep tick sees an empty claims list and the freshly respawned container survives long enough to start its agent-runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:54:42 +02:00
gavrielc	d5b48e4742	fix(credentials): address review feedback - wakeContainer now never throws — returns Promise<boolean>, catches internally. Closes the regression risk for the 5 awaited callers in agent-to-agent, interactive, and approvals/response-handler that the previous version left unwrapped. Router uses the boolean to stop the typing indicator on transient failure; host-sweep just awaits. - Tighten AUTH_REQUIRED_RE: anchor to start-of-string with the specific `·` (U+00B7) separator the CLI uses, so an agent that quotes the banner mid-sentence in a normal reply doesn't trip the classifier. - Log a one-line note from writeAuthRequiredMessage so substitutions are visible when debugging "user got the credentials message but I don't see why." - Add unit tests for ClaudeProvider.isAuthRequired covering both banner variants, trailing content, mid-sentence quoting, leading-prose quoting, alternate separators, and unrelated text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:51:32 +03:00
gavrielc	5f34e26240	fix(credentials): translate auth errors and require OneCLI for spawn Two related fixes for the case where credentials aren't usable: 1. Replace Claude Code's "Not logged in / Invalid API key · Please run /login" output with a host-aware message. The user can't run /login from chat, so the raw text is unhelpful. Provider gains an optional isAuthRequired() classifier; the poll-loop substitutes the message on both result-text and error paths. 2. Treat OneCLI gateway failure as a transient hard error instead of spawning a credential-less container. The catch in container-runner now propagates; router and host-sweep wrap wakeContainer to log and leave the inbound row pending so the next 60s sweep tick retries. Router also stops the typing indicator on failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:02:15 +03:00
exe.dev user	209061f54f	fix(sweep): wake before reset + idempotent retry for orphan claims When a container exits with an unresolved processing_ack claim, the sweep's crashed-container cleanup would reset the matching inbound message with tries++ and a future process_after. dueCount then dropped to 0, so the wake step never fired — and the next sweep tick found the same orphan claim, bumped tries again, and pushed process_after further out. The message reached MAX_TRIES and was marked failed without any container ever being spawned. Two changes: 1. Reorder sweep so the wake step runs before crashed-container cleanup. A fresh container clears orphan 'processing' rows on its own startup (container/agent-runner/src/db/connection.ts), so once we get it running the claim resolves itself. 2. Make resetStuckProcessingRows idempotent: if a message already has process_after set to a future time, skip the retry bump. The wake path will pick it up when the backoff elapses. Requires returning process_after from getMessageForRetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:12:16 +00:00
gavrielc	0105de0257	fix(host-sweep): skip ceiling check when heartbeat file is absent decideStuckAction treated a missing heartbeat file as heartbeatAge = Infinity, which always exceeded the 30-minute ceiling. Result: every freshly-spawned container got killed within seconds of spawn on the first sweep pass because it hadn't produced an SDK event yet (heartbeat is only touched on SDK events inside processQuery, not on boot). Skip the ceiling branch when heartbeatMtimeMs === 0. Containers that genuinely never wrote a heartbeat because they died are caught by the separate "container process not running" cleanup path. Containers that boot, claim a message, but hang at the gate are caught by the claim-stuck check below — which correctly fires regardless of heartbeat presence once claimAge exceeds tolerance. Updates the "absent heartbeat → kill-ceiling" test (which was encoding the bug) and adds a companion that the claim-stuck path still fires for absent-heartbeat containers with aged claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:15:52 +03:00
gavrielc	6a815190c0	feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:16:57 +03:00
gavrielc	7169c25e70	refactor: relocate outbox I/O to session-manager + dead-code sweep ## Outbox extraction (delivery.ts → session-manager.ts) File I/O for outbound attachments now lives in session-manager.ts alongside the symmetric inbound extractAttachmentFiles. delivery.ts no longer touches the filesystem — it hands buffers to the adapter and calls clearOutbox on success. - New `readOutboxFiles(agentGroupId, sessionId, messageId, filenames)` and `clearOutbox(agentGroupId, sessionId, messageId)` in session-manager.ts. - deliverMessage in delivery.ts loses ~35 lines of fs/path code and its `fs`/`path` imports. ## Dead-code sweep TypeScript's --noUnusedLocals surfaced several cruft imports. Fixed: - src/container-runner.ts: drop unused `markContainerIdle` import; drop unused `session` parameter from `buildContainerArgs` signature. - src/delivery.ts: drop unused `getSession`, `writeSessionMessage`, `wakeContainer` imports. - src/host-sweep.ts: drop unused `updateSession`, `outboundDbPath` imports. - container/agent-runner/src/poll-loop.ts: drop unused `config`, `processingIds` params from `processQuery`. - Test files: drop unused imports in channel-registry.test, db-v2.test, host-core.test. Skipped: `conversations` state in chat-sdk-bridge.ts (never read but tangled with public `updateConversations` method; cleaning it risks a merge conflict with the channels branch at the next sync). ## Validation - `pnpm run build` clean - `pnpm test` — 137 host tests pass - `bun test` in container/agent-runner — 17 tests pass - Service boots (`NanoClaw running`, `OneCLI approval handler started`) and shuts down cleanly on SIGTERM Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:34:08 +03:00
gavrielc	473f766585	refactor(modules): extract scheduling as registry-based module Moves the scheduling surface — 5 delivery actions (schedule_task, cancel_task, pause_task, resume_task, update_task), handleRecurrence, applyPreTaskScripts, and task DB helpers — out of core and into src/modules/scheduling/ (host) and container/agent-runner/src/scheduling/ (container). First PR to fill the MODULE-HOOK markers introduced in PR #2: - src/host-sweep.ts MODULE-HOOK:scheduling-recurrence now dynamically imports handleRecurrence from the module each sweep tick. - container/agent-runner/src/poll-loop.ts MODULE-HOOK:scheduling-pre-task dynamically imports applyPreTaskScripts before the provider call. When the marker block is empty (scheduling uninstalled), `keep` falls back to `normalMessages` so non-task messages still flow. The 5 task cases are removed from delivery.ts's handleSystemAction switch — the registry now routes them. Task DB helpers moved out of src/db/session-db.ts (which kept `nextEvenSeq` as a named export so the module can uphold the host-writes-even-seq invariant). Test suite split to match: scheduling-specific tests live in the module. No migration — tasks are messages_in rows with kind='task'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:17:47 +03:00
gavrielc	4202041d0b	refactor: scaffold module registries and default-module layout Additive change — existing code paths still run via inline fallbacks. Prepares core for per-module extractions in PR #3 onward. Four registries added with empty defaults: - delivery action handlers (delivery.ts) - router inbound gate (router.ts) - response dispatcher (index.ts) - MCP tool self-registration (container/agent-runner/src/mcp-tools/server.ts) Default modules moved to src/modules/ for signaling: - src/modules/typing/ (extracted from delivery.ts) - src/modules/mount-security/ (moved from src/mount-security.ts) Both are imported directly by core — no hook, no registry. Removal requires editing core imports. Migrator now keys applied rows by name (uniqueness) so module migrations can pick arbitrary version numbers. Stored version column is auto-assigned as an applied-order sequence. sqlite_master guards added around core calls into module-owned tables (user_roles, agent_destinations, pending_questions). No-ops today; load-bearing after the owning modules are extracted. MODULE-HOOK markers placed at scheduling's two skill-edit sites (host-sweep.ts recurrence call, poll-loop.ts pre-task gate). PR #4 replaces the marked blocks when scheduling moves to its module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:46:19 +03:00
exe.dev user	8ef30ad289	fix(v2): cancel/pause/resume recurring tasks via series_id Recurring tasks spawn a new messages_in row per occurrence. Cancel only matched the completed row the agent remembered, leaving the live next occurrence running. Tag every row in a recurrence chain with the originating task's id (series_id) so cancel/pause/resume can reach any live row in the series. Cancel also clears recurrence to prevent the sweep from cloning a cancelled task. Kind-aware id prefix on recurrences (task- instead of msg-) keeps list_tasks output consistent across occurrences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
exe.dev user	4c46dd3b39	fix(v2): await handleRecurrence so the DB isn't closed mid-flight sweepSession called handleRecurrence without await, then synchronously closed inDb in its finally block. handleRecurrence is async because it does a dynamic `import('cron-parser')` before the first DB write; that import resolved after the finally had already run, so insertRecurrence hit a closed handle and threw "The database connection is not open". Net effect: every recurring task was correctly marked completed by syncProcessingAcks, but its next occurrence never got scheduled. Single-word fix — `await`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
gavrielc	669a8444ef	refactor(v2): extract session DB operations into src/db/session-db.ts Move all raw SQL out of session-manager, delivery, and host-sweep into a dedicated DB module. Make session schemas idempotent (IF NOT EXISTS) so initSessionFolder always applies them. Revert the markdown plain-text retry from `4c477ac`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 17:36:16 +03:00
gavrielc	b76fd425c8	style: prettier formatting fixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:18:31 +03:00
gavrielc	82cb363f84	v2: split session DB into inbound/outbound for write isolation Eliminates SQLite write contention across the host-container mount boundary by splitting the single session.db into two files, each with exactly one writer: inbound.db — host writes (messages_in, delivered tracking) outbound.db — container writes (messages_out, processing_ack) Key changes: - Host uses even seq numbers, container uses odd (collision-free) - Container heartbeat via file touch instead of DB UPDATE - Scheduling MCP tools now emit system actions via messages_out (host applies them to inbound.db during delivery) - Host sweep reads processing_ack + heartbeat file for stale detection - OneCLI ensureAgent() call added (was missing from v2, caused applyContainerConfig to reject unknown agent identifiers) Verified: tsc clean, 327 tests pass, real e2e through Docker works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 12:17:31 +03:00
gavrielc	9486d56b01	v2: make v2 the main entry point, move v1 to src/v1/ - Move all v1 files (index, router, container-runner, db, ipc, types, logger, channels/registry, and all utilities) to src/v1/ as a fully self-contained archive with no shared dependencies - Rename v2 files to remove -v2 suffix (index-v2.ts → index.ts, etc.) - Update all imports across v2 source, tests, and setup files - Migrate shared utilities (config, env, container-runtime, mount-security, timezone, group-folder) from pino logger to v2 log module - Migrate setup/ files from logger to log with argument order swap - Container agent-runner: move v1 entry to v1/, rename v2 to index.ts - Update setup skill to offer all 13 v2 channels - Install all Chat SDK adapter packages - dist/index.js now runs v2; dist/v1/index.js runs v1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 11:40:36 +03:00
gavrielc	8a06b01646	v2: SQLite state adapter, admin commands, compact feedback - Replace in-memory Chat SDK state with SqliteStateAdapter — thread subscriptions now persist across restarts - Add migration 002 for chat_sdk_kv, subscriptions, locks, lists tables - Handle /clear in agent-runner (reset sessionId) — SDK has supportsNonInteractive:false for this command - Pass /compact, /context, /cost, /files through to SDK as admin commands - Skip admin commands in follow-up poll so they start fresh queries - Emit compact_boundary events as user-visible feedback messages - Pass NANOCLAW_ADMIN_USER_ID and NANOCLAW_ASSISTANT_NAME to containers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 03:58:35 +03:00
gavrielc	c348fabf22	v2 phase 5: scheduling fixes, media handling, command processing - Host sweep: fix DELETE journal mode, busy_timeout, seq in recurrence INSERT - Outbound files: delivery reads from outbox dir, passes buffers to adapter, cleans up after delivery. Chat SDK bridge sends files via postMessage. - Inbound attachments: formatter includes attachment info in prompts - Commands: categorize /commands as admin, filtered, or passthrough. Admin commands check sender against NANOCLAW_ADMIN_USER_ID. Filtered commands silently dropped. Passthrough sent raw to agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 02:59:33 +03:00
gavrielc	d35386a46e	style: apply prettier formatting to v2 source files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:59:08 +03:00
gavrielc	d7c68e04b1	v2 phase 3: host core — router, session manager, delivery, sweep Host orchestrator connecting channel events to session DBs and delivering responses back through channel adapters. - session-manager.ts: session folder/DB lifecycle, message writing - container-runner-v2.ts: Docker spawn with session + agent group mounts, OneCLI, idle timeout, agent-runner recompilation - router-v2.ts: inbound routing (channel → messaging group → agent group → session → messages_in → wake container) - delivery.ts: two-tier polling (1s active, 60s sweep) for messages_out, channel adapter delivery - host-sweep.ts: stale detection with backoff, recurrence, wake containers for due messages - index-v2.ts: thin entry point wiring everything together - scripts/test-v2-agent.ts: real Claude provider integration test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:43:13 +03:00

23 Commits