Commit Graph

19 Commits

Author SHA1 Message Date
Ethan
7ce9922cde fix(host-sweep): clear orphan processing_ack on kill to prevent claim-stuck loop
When the host kills a container (absolute-ceiling, claim-stuck, or crashed),
resetStuckProcessingRows reset messages_in but left orphan rows in
processing_ack. The next sweep tick spawned a fresh container and, on the
same tick, ran enforceRunningContainerSla against outbound.db that still
contained the previous container's claim with a hours-old status_changed
timestamp — instant kill-claim, before the agent-runner could open
outbound.db to run its own clearStaleProcessingAcks(). Loop until tries
hit MAX_TRIES.

Add deleteOrphanProcessingClaims() in session-db and call it at the end of
resetStuckProcessingRows. Safe to write outbound.db here because the host
only enters this path after killContainer (or when no container is running).

Tests in host-sweep.test.ts cover the helper plus the regression: orphan
claim from a 2h-old kill is now removed atomically with the messages_in
reset, so the next sweep tick sees an empty claims list and the freshly
respawned container survives long enough to start its agent-runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:54:42 +02:00
gavrielc
d5b48e4742 fix(credentials): address review feedback
- wakeContainer now never throws — returns Promise<boolean>, catches
  internally. Closes the regression risk for the 5 awaited callers in
  agent-to-agent, interactive, and approvals/response-handler that the
  previous version left unwrapped. Router uses the boolean to stop the
  typing indicator on transient failure; host-sweep just awaits.
- Tighten AUTH_REQUIRED_RE: anchor to start-of-string with the specific
  `·` (U+00B7) separator the CLI uses, so an agent that quotes the
  banner mid-sentence in a normal reply doesn't trip the classifier.
- Log a one-line note from writeAuthRequiredMessage so substitutions
  are visible when debugging "user got the credentials message but I
  don't see why."
- Add unit tests for ClaudeProvider.isAuthRequired covering both banner
  variants, trailing content, mid-sentence quoting, leading-prose
  quoting, alternate separators, and unrelated text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:51:32 +03:00
gavrielc
5f34e26240 fix(credentials): translate auth errors and require OneCLI for spawn
Two related fixes for the case where credentials aren't usable:

1. Replace Claude Code's "Not logged in / Invalid API key · Please run
   /login" output with a host-aware message. The user can't run /login
   from chat, so the raw text is unhelpful. Provider gains an optional
   isAuthRequired() classifier; the poll-loop substitutes the message
   on both result-text and error paths.

2. Treat OneCLI gateway failure as a transient hard error instead of
   spawning a credential-less container. The catch in container-runner
   now propagates; router and host-sweep wrap wakeContainer to log and
   leave the inbound row pending so the next 60s sweep tick retries.
   Router also stops the typing indicator on failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:02:15 +03:00
exe.dev user
209061f54f fix(sweep): wake before reset + idempotent retry for orphan claims
When a container exits with an unresolved processing_ack claim, the
sweep's crashed-container cleanup would reset the matching inbound
message with tries++ and a future process_after. dueCount then dropped
to 0, so the wake step never fired — and the next sweep tick found the
same orphan claim, bumped tries again, and pushed process_after further
out. The message reached MAX_TRIES and was marked failed without any
container ever being spawned.

Two changes:

1. Reorder sweep so the wake step runs before crashed-container
   cleanup. A fresh container clears orphan 'processing' rows on its
   own startup (container/agent-runner/src/db/connection.ts), so once
   we get it running the claim resolves itself.

2. Make resetStuckProcessingRows idempotent: if a message already has
   process_after set to a future time, skip the retry bump. The wake
   path will pick it up when the backoff elapses. Requires returning
   process_after from getMessageForRetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:12:16 +00:00
gavrielc
0105de0257 fix(host-sweep): skip ceiling check when heartbeat file is absent
decideStuckAction treated a missing heartbeat file as heartbeatAge =
Infinity, which always exceeded the 30-minute ceiling. Result: every
freshly-spawned container got killed within seconds of spawn on the
first sweep pass because it hadn't produced an SDK event yet (heartbeat
is only touched on SDK events inside processQuery, not on boot).

Skip the ceiling branch when heartbeatMtimeMs === 0. Containers that
genuinely never wrote a heartbeat because they died are caught by the
separate "container process not running" cleanup path. Containers that
boot, claim a message, but hang at the gate are caught by the
claim-stuck check below — which correctly fires regardless of heartbeat
presence once claimAge exceeds tolerance.

Updates the "absent heartbeat → kill-ceiling" test (which was encoding
the bug) and adds a companion that the claim-stuck path still fires for
absent-heartbeat containers with aged claims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:15:52 +03:00
gavrielc
6a815190c0 feat(lifecycle): stuck detection + heartbeat lifecycle + SDK tool blocklist
Replaces the two overlapping old mechanisms (30-min setTimeout kill in
container-runner, 10-min heartbeat STALE_THRESHOLD reset in host-sweep)
with message-scoped stuck detection anchored to the processing_ack claim
age + an absolute 30-min ceiling that extends for long-declared Bash
tools.

Old model problems:
- IDLE_TIMEOUT setTimeout fired on plain wall-clock time; slow-but-alive
  agents got killed at 30min regardless of activity
- 10-min STALE_THRESHOLD in the sweep was unreliable — the heartbeat is
  only touched on SDK events, so legitimate silent tool work (sleep 30,
  long WebFetch, npm install) looked identical to a hung container
- Two overlapping sources of truth for "when to let go of a container"

New model:
- Host sweep is the single source of truth.
- Container exposes a new `container_state` single-row table in outbound.db
  (schema added; container writes, host reads). PreToolUse hook writes
  current_tool + tool_declared_timeout_ms (read from Bash's tool_input);
  PostToolUse / PostToolUseFailure clear it.
- Sweep decides with a pure helper `decideStuckAction`:
    * absolute ceiling — kill if heartbeat age > max(30min, bash_timeout)
    * per-claim stuck  — kill if any processing_ack row has claim_age >
      max(60s, bash_timeout) AND heartbeat hasn't been touched since claim
    * otherwise ok
  Kill paths reset leftover processing rows with exponential backoff,
  reusing the existing retry machinery.

Tool blocklist expanded:
- AskUserQuestion (SDK placeholder; we have mcp__nanoclaw__ask_user_question)
- EnterPlanMode, ExitPlanMode, EnterWorktree, ExitWorktree (Claude Code UI
  affordances; would hang in headless containers)
PreToolUse hook is also defense-in-depth: if a disallowed tool name slips
through, it returns `{ decision: 'block' }` so the agent sees a clear
error instead of appearing stuck.

Removed:
- container-runner.ts: IDLE_TIMEOUT setTimeout, resetIdle callback on
  activeContainers entry, resetContainerIdleTimer export.
- delivery.ts: the resetContainerIdleTimer call on successful delivery.
- poll-loop.ts: IDLE_END_MS + its setInterval. Keeping the query open is
  cheaper than close+reopen (no cold prompt cache). Liveness is now a
  host-side concern.
- host-sweep.ts: 10-min STALE_THRESHOLD_MS + getStuckProcessingIds in the
  stale-detection path (still exported for kill reset).

Tests:
- src/host-sweep.test.ts — 9 tests for decideStuckAction covering: fresh
  heartbeat, absolute ceiling, absent heartbeat, Bash-timeout extension
  (both ceiling and per-claim), claim age below tolerance, heartbeat
  touched after claim, unparseable timestamps.

Ref: docs/v1-vs-v2/ACTION-ITEMS.md items 9, 6a, 10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:16:57 +03:00
gavrielc
7169c25e70 refactor: relocate outbox I/O to session-manager + dead-code sweep
## Outbox extraction (delivery.ts → session-manager.ts)

File I/O for outbound attachments now lives in session-manager.ts alongside
the symmetric inbound extractAttachmentFiles. delivery.ts no longer touches
the filesystem — it hands buffers to the adapter and calls clearOutbox on
success.

- New `readOutboxFiles(agentGroupId, sessionId, messageId, filenames)` and
  `clearOutbox(agentGroupId, sessionId, messageId)` in session-manager.ts.
- deliverMessage in delivery.ts loses ~35 lines of fs/path code and its
  `fs`/`path` imports.

## Dead-code sweep

TypeScript's --noUnusedLocals surfaced several cruft imports. Fixed:

- src/container-runner.ts: drop unused `markContainerIdle` import; drop
  unused `session` parameter from `buildContainerArgs` signature.
- src/delivery.ts: drop unused `getSession`, `writeSessionMessage`,
  `wakeContainer` imports.
- src/host-sweep.ts: drop unused `updateSession`, `outboundDbPath` imports.
- container/agent-runner/src/poll-loop.ts: drop unused `config`,
  `processingIds` params from `processQuery`.
- Test files: drop unused imports in channel-registry.test, db-v2.test,
  host-core.test.

Skipped: `conversations` state in chat-sdk-bridge.ts (never read but
tangled with public `updateConversations` method; cleaning it risks a
merge conflict with the channels branch at the next sync).

## Validation

- `pnpm run build` clean
- `pnpm test` — 137 host tests pass
- `bun test` in container/agent-runner — 17 tests pass
- Service boots (`NanoClaw running`, `OneCLI approval handler started`)
  and shuts down cleanly on SIGTERM

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:34:08 +03:00
gavrielc
473f766585 refactor(modules): extract scheduling as registry-based module
Moves the scheduling surface — 5 delivery actions (schedule_task,
cancel_task, pause_task, resume_task, update_task), handleRecurrence,
applyPreTaskScripts, and task DB helpers — out of core and into
src/modules/scheduling/ (host) and container/agent-runner/src/scheduling/
(container).

First PR to fill the MODULE-HOOK markers introduced in PR #2:
  - src/host-sweep.ts MODULE-HOOK:scheduling-recurrence now dynamically
    imports handleRecurrence from the module each sweep tick.
  - container/agent-runner/src/poll-loop.ts MODULE-HOOK:scheduling-pre-task
    dynamically imports applyPreTaskScripts before the provider call.
    When the marker block is empty (scheduling uninstalled), `keep`
    falls back to `normalMessages` so non-task messages still flow.

The 5 task cases are removed from delivery.ts's handleSystemAction
switch — the registry now routes them. Task DB helpers moved out of
src/db/session-db.ts (which kept `nextEvenSeq` as a named export so
the module can uphold the host-writes-even-seq invariant). Test suite
split to match: scheduling-specific tests live in the module.

No migration — tasks are messages_in rows with kind='task'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:17:47 +03:00
gavrielc
4202041d0b refactor: scaffold module registries and default-module layout
Additive change — existing code paths still run via inline fallbacks.
Prepares core for per-module extractions in PR #3 onward.

Four registries added with empty defaults:
  - delivery action handlers (delivery.ts)
  - router inbound gate (router.ts)
  - response dispatcher (index.ts)
  - MCP tool self-registration (container/agent-runner/src/mcp-tools/server.ts)

Default modules moved to src/modules/ for signaling:
  - src/modules/typing/       (extracted from delivery.ts)
  - src/modules/mount-security/ (moved from src/mount-security.ts)

Both are imported directly by core — no hook, no registry. Removal
requires editing core imports.

Migrator now keys applied rows by name (uniqueness) so module
migrations can pick arbitrary version numbers. Stored version column
is auto-assigned as an applied-order sequence.

sqlite_master guards added around core calls into module-owned tables
(user_roles, agent_destinations, pending_questions). No-ops today;
load-bearing after the owning modules are extracted.

MODULE-HOOK markers placed at scheduling's two skill-edit sites
(host-sweep.ts recurrence call, poll-loop.ts pre-task gate). PR #4
replaces the marked blocks when scheduling moves to its module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:46:19 +03:00
exe.dev user
8ef30ad289 fix(v2): cancel/pause/resume recurring tasks via series_id
Recurring tasks spawn a new messages_in row per occurrence. Cancel
only matched the completed row the agent remembered, leaving the
live next occurrence running. Tag every row in a recurrence chain
with the originating task's id (series_id) so cancel/pause/resume
can reach any live row in the series. Cancel also clears recurrence
to prevent the sweep from cloning a cancelled task. Kind-aware id
prefix on recurrences (task- instead of msg-) keeps list_tasks output
consistent across occurrences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 16:31:29 +00:00
exe.dev user
4c46dd3b39 fix(v2): await handleRecurrence so the DB isn't closed mid-flight
sweepSession called handleRecurrence without await, then synchronously
closed inDb in its finally block. handleRecurrence is async because it
does a dynamic `import('cron-parser')` before the first DB write; that
import resolved after the finally had already run, so insertRecurrence
hit a closed handle and threw "The database connection is not open".

Net effect: every recurring task was correctly marked completed by
syncProcessingAcks, but its next occurrence never got scheduled.
Single-word fix — `await`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:31:29 +00:00
gavrielc
669a8444ef refactor(v2): extract session DB operations into src/db/session-db.ts
Move all raw SQL out of session-manager, delivery, and host-sweep into
a dedicated DB module. Make session schemas idempotent (IF NOT EXISTS)
so initSessionFolder always applies them. Revert the markdown
plain-text retry from 4c477ac.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 17:36:16 +03:00
gavrielc
b76fd425c8 style: prettier formatting fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:18:31 +03:00
gavrielc
82cb363f84 v2: split session DB into inbound/outbound for write isolation
Eliminates SQLite write contention across the host-container mount
boundary by splitting the single session.db into two files, each with
exactly one writer:

  inbound.db  — host writes (messages_in, delivered tracking)
  outbound.db — container writes (messages_out, processing_ack)

Key changes:
- Host uses even seq numbers, container uses odd (collision-free)
- Container heartbeat via file touch instead of DB UPDATE
- Scheduling MCP tools now emit system actions via messages_out
  (host applies them to inbound.db during delivery)
- Host sweep reads processing_ack + heartbeat file for stale detection
- OneCLI ensureAgent() call added (was missing from v2, caused
  applyContainerConfig to reject unknown agent identifiers)

Verified: tsc clean, 327 tests pass, real e2e through Docker works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 12:17:31 +03:00
gavrielc
9486d56b01 v2: make v2 the main entry point, move v1 to src/v1/
- Move all v1 files (index, router, container-runner, db, ipc, types,
  logger, channels/registry, and all utilities) to src/v1/ as a
  fully self-contained archive with no shared dependencies
- Rename v2 files to remove -v2 suffix (index-v2.ts → index.ts, etc.)
- Update all imports across v2 source, tests, and setup files
- Migrate shared utilities (config, env, container-runtime, mount-security,
  timezone, group-folder) from pino logger to v2 log module
- Migrate setup/ files from logger to log with argument order swap
- Container agent-runner: move v1 entry to v1/, rename v2 to index.ts
- Update setup skill to offer all 13 v2 channels
- Install all Chat SDK adapter packages
- dist/index.js now runs v2; dist/v1/index.js runs v1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 11:40:36 +03:00
gavrielc
8a06b01646 v2: SQLite state adapter, admin commands, compact feedback
- Replace in-memory Chat SDK state with SqliteStateAdapter — thread
  subscriptions now persist across restarts
- Add migration 002 for chat_sdk_kv, subscriptions, locks, lists tables
- Handle /clear in agent-runner (reset sessionId) — SDK has
  supportsNonInteractive:false for this command
- Pass /compact, /context, /cost, /files through to SDK as admin commands
- Skip admin commands in follow-up poll so they start fresh queries
- Emit compact_boundary events as user-visible feedback messages
- Pass NANOCLAW_ADMIN_USER_ID and NANOCLAW_ASSISTANT_NAME to containers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 03:58:35 +03:00
gavrielc
c348fabf22 v2 phase 5: scheduling fixes, media handling, command processing
- Host sweep: fix DELETE journal mode, busy_timeout, seq in recurrence INSERT
- Outbound files: delivery reads from outbox dir, passes buffers to adapter,
  cleans up after delivery. Chat SDK bridge sends files via postMessage.
- Inbound attachments: formatter includes attachment info in prompts
- Commands: categorize /commands as admin, filtered, or passthrough.
  Admin commands check sender against NANOCLAW_ADMIN_USER_ID.
  Filtered commands silently dropped. Passthrough sent raw to agent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 02:59:33 +03:00
gavrielc
d35386a46e style: apply prettier formatting to v2 source files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:59:08 +03:00
gavrielc
d7c68e04b1 v2 phase 3: host core — router, session manager, delivery, sweep
Host orchestrator connecting channel events to session DBs and
delivering responses back through channel adapters.

- session-manager.ts: session folder/DB lifecycle, message writing
- container-runner-v2.ts: Docker spawn with session + agent group
  mounts, OneCLI, idle timeout, agent-runner recompilation
- router-v2.ts: inbound routing (channel → messaging group → agent
  group → session → messages_in → wake container)
- delivery.ts: two-tier polling (1s active, 60s sweep) for
  messages_out, channel adapter delivery
- host-sweep.ts: stale detection with backoff, recurrence, wake
  containers for due messages
- index-v2.ts: thin entry point wiring everything together
- scripts/test-v2-agent.ts: real Claude provider integration test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:43:13 +03:00