nanoclaw

Author	SHA1	Message	Date
gavrielc	6a815190c0	feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist Replaces the two overlapping old mechanisms (30-min setTimeout kill in container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep) with message-scoped stuck detection anchored to the processing_ack claim age + an absolute 30-min ceiling that extends for long-declared Bash tools. Old model problems: - IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive agents got killed at 30min regardless of activity - 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is only touched on SDK events, so legitimate silent tool work (sleep 30, long WebFetch, npm install) looked identical to a hung container - Two overlapping sources of truth for "when to let go of a container" New model: - Host sweep is the single source of truth. - Container exposes a new `container_state` single-row table in outbound.db (schema added; container writes, host reads). PreToolUse hook writes current_tool + tool_declared_timeout_ms (read from Bash's tool_input); PostToolUse / PostToolUseFailure clear it. - Sweep decides with a pure helper `decideStuckAction`: * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout) * per-claim stuck — kill if any processing_ack row has claim_age > max(60s, bash_timeout) AND heartbeat hasn't been touched since claim * otherwise ok Kill paths reset leftover processing rows with exponential backoff, reusing the existing retry machinery. Tool blocklist expanded: - AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question) - EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI affordances; would hang in headless containers) PreToolUse hook is also defense-in-depth: if a disallowed tool name slips through, it returns `{ decision: 'block' }` so the agent sees a clear error instead of appearing stuck. Removed: - container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on activeContainers entry, resetContainerIdleTimer export. - delivery.ts: the resetContainerIdleTimer call on successful delivery. - poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is cheaper than close+reopen (no cold prompt cache). Liveness is now a host-side concern. - host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the stale-detection path (still exported for kill reset). Tests: - src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension (both ceiling and per-claim), claim age below tolerance, heartbeat touched after claim, unparseable timestamps. Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:16:57 +03:00
gavrielc	e0258e8c1b	refactor(v2): move opencode provider off v2 trunk v2 ships with only claude baked in. opencode now lives on the `providers` branch and gets copied in via the /add-opencode skill. Removed: - src/providers/opencode.ts - container/agent-runner/src/providers/{opencode,mcp-to-opencode}.ts + test - @opencode-ai/sdk from agent-runner package.json + bun.lock - opencode-ai global install + OPENCODE_VERSION ARG from Dockerfile - opencode self-registration imports from both provider barrels - opencode test case from factory.test.ts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 14:10:56 +03:00
Tal Moskovich	22150261c5	feat(v2): OpenCode agent provider - Add OpenCodeProvider (SSE, session resume, MCP via mcp-to-opencode) - Register opencode in factory; AGENT_PROVIDER passthrough from DB - Host: XDG mount, NO_PROXY merge, OPENCODE_* env for opencode sessions - Dockerfile: opencode-ai CLI; docs checklist + architecture diagram - Skill add-opencode for v2; AgentProviderName in src/types.ts Made-with: Cursor	2026-04-17 12:20:22 +03:00
gavrielc	1f3b023a5a	refactor(v2/providers): self-registration barrel + host container-config registry Providers now mirror the channels pattern: each module calls registerProvider() at top level, and providers/index.ts is a barrel of side-effect imports. createProvider() becomes a thin registry lookup; the closed ProviderName union is gone (now a string alias, since the env var is a runtime string anyway). Also adds a host-side provider-container-registry so providers can declare their own mounts and env passthrough in src/providers/<name>.ts instead of the container-runner having to know about each one. The resolver runs once per spawn and threads provider + contribution through buildMounts and buildContainerArgs so side effects (mkdir, etc.) fire exactly once. Both barrels are append-only — adding a new provider is a new file + one import line per barrel, no edits to existing files. The built-in providers (claude, mock) don't need host-side config, so src/providers/ ships with an empty barrel; the container-side barrel imports both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 12:17:09 +03:00
exe.dev user	b9f95df340	feat(v2): add pre-task script hook for scheduled tasks Scheduled tasks can now carry a bash script that runs inside the container before the agent is invoked. The script prints `{wakeAgent, data?}` on its last stdout line; if `wakeAgent: false` (or the script errors) the task row is marked completed and the agent is never queried, saving API calls on no-op checks. On wake, the script's `data` is injected into the task prompt. Semantics mirror V1: 30s bash timeout, 1MB buffer, last-line JSON, error == skip. Also blocks the Claude SDK's built-in scheduling tools (CronCreate, CronDelete, CronList, ScheduleWakeup) via `disallowedTools` so tasks actually flow through `mcp__nanoclaw__schedule_task` and get the script gate. CLAUDE.md gains a soft pointer explaining why `schedule_task` is the right path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:29 +00:00
gavrielc	b63dd186df	refactor(agent-runner): decouple provider interface from Claude specifics Reshape AgentProvider so provider-specific assumptions stop leaking into the generic layer. No change to what reaches sdkQuery() — same values, different plumbing. - QueryInput: opaque `continuation` replaces `sessionId` + `resumeAt`; `systemContext.instructions` replaces ambiguous `systemPrompt`; `mcpServers`, `env`, `additionalDirectories` move to `ProviderOptions` at construction time. - AgentProvider gains `isSessionInvalid(err)` and `supportsNativeSlashCommands` so the poll-loop stops regex-matching Claude error strings and gates passthrough slash commands per provider. - ClaudeProvider owns `CLAUDE_CODE_AUTO_COMPACT_WINDOW` and the stale-session regex internally. - ProviderEvent.activity kept and documented as the liveness signal (fires on every SDK message so the idle timer stays honest during long tool runs); init carries `continuation` instead of `sessionId`. - poll-loop drops mcpServers/env/systemPrompt from its config; admin user id now passed explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 10:25:29 +03:00
gavrielc	8a06b01646	v2: SQLite state adapter, admin commands, compact feedback - Replace in-memory Chat SDK state with SqliteStateAdapter — thread subscriptions now persist across restarts - Add migration 002 for chat_sdk_kv, subscriptions, locks, lists tables - Handle /clear in agent-runner (reset sessionId) — SDK has supportsNonInteractive:false for this command - Pass /compact, /context, /cost, /files through to SDK as admin commands - Skip admin commands in follow-up poll so they start fresh queries - Emit compact_boundary events as user-visible feedback messages - Pass NANOCLAW_ADMIN_USER_ID and NANOCLAW_ASSISTANT_NAME to containers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 03:58:35 +03:00
gavrielc	6f2a7314d0	v2: fix agent-runner lifecycle and session DB reliability - Use DELETE journal mode for session DBs instead of WAL. WAL doesn't sync reliably across Docker volume mounts (VirtioFS), causing dropped writes and duplicate deliveries. - Add 20s idle detection to end the query stream. The concurrent poll tracks SDK activity via a new 'activity' provider event. When no SDK events arrive for 20s and no messages are pending, the stream ends and the poll loop continues. - Add touchProcessing heartbeat so the host can distinguish active agents from idle ones by checking status_changed recency. - Catch query errors in the poll loop and write error responses to messages_out instead of crashing the process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 01:34:59 +03:00
gavrielc	5a0098edc9	v2 phase 2: agent-runner — provider interface, poll loop, formatter AgentProvider abstraction with Claude and Mock implementations. Poll loop reads messages_in, formats by kind, queries provider, writes results to messages_out. Concurrent polling pushes follow-up messages into active queries. - providers/types.ts: AgentProvider, AgentQuery, ProviderEvent - providers/claude.ts: wraps Agent SDK with MessageStream, hooks, transcript archiving - providers/mock.ts: canned responses with push() support - providers/factory.ts: createProvider() - formatter.ts: format by kind (chat/task/webhook/system), XML escaping, routing extraction - poll-loop.ts: poll → format → query → write, concurrent polling - mcp-tools.ts: MCP server with send_message tool - index-v2.ts: new entry point (config from env, enters poll loop) - 11 new tests, all 288 tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:36:55 +03:00

9 Commits