The follow-up poller filtered /clear out of every tick without acking
the row, and pushed every other slash command through plain
formatMessages() (XML wrapping). On a warm container the outer
while(true) loop never regains control, so:
- /clear sat pending in messages_in forever (no response at all)
- /compact, /cost, /context, /files, /remote-control arrived at the
SDK as XML-wrapped user text and were never dispatched as commands
Both modes are invisible to host monitoring: rows are either left
pending without a processing_ack claim, or marked completed normally;
heartbeat keeps firing inside the SDK event loop.
When the follow-up poller observes any slash command (admin or
passthrough — categorizeMessage decides), end the active query so the
current turn winds down cleanly and the outer loop wakes, re-fetches
the same pending set, and runs them through the canonical path
(/clear handler + formatMessagesWithCommands raw dispatch). Leave the
rows untouched so the outer-loop fetch sees the same set the poller
saw.
Cost: each slash command on a warm container forces close+reopen of
the SDK stream — a few seconds of subprocess startup. The Anthropic
prompt cache is server-side with a 5-min TTL keyed on prefix hash, so
stream lifecycle does not affect cache lifetime; close+reopen within
5 min still gets cache hits.
Also corrects the warm-stream rationale comment on processQuery, which
implied keeping the stream open preserved cache warmth — it doesn't.
Testing evidence — cache stays warm across stream close+reopen:
Turn 1 (warm session):
Usage: in=6 out=245 cache_create=92 cache_read=22996
Full cache hit (22996 tokens).
Turn 2 — /clear arrives:
Pending slash command — ending stream so outer loop can process
Clearing session (resetting continuation)
Usage: in=6 out=95 cache_create=9393 cache_read=13600
System prompt + tool defs (~13600 tokens) still hit cache;
conversation history is gone (continuation reset) so the new turn
writes fresh context.
Turn 3 — /cost arrives:
Pending slash command — ending stream so outer loop can process
Usage: in=0 out=0 cache_create=0 cache_read=0 wall=0.0s api=0.0s
/cost is a CLI built-in: dispatched locally by the SDK, no API
call. Pre-fix this would have arrived as XML-wrapped user text
and never dispatched — confirms the broader fix works.
Turn 4 (next chat after /cost):
Usage: in=6 out=142 cache_create=328 cache_read=22993
Full cache hit again (22993 tokens read, 328 written). Despite the
/cost-induced stream close+reopen, the server-side prompt cache
survived: the new sdkQuery() resumed the same continuation, the
request prefix matched the cached entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OneCLI runs in a Docker container, so Docker must be installed first.
Reordered: Docker (3a) → OneCLI (3b) → Auth (3c) → Skills (3d) →
Build (3e). OneCLI install now skips with a clear message if Docker
isn't available.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
migrate-v2.sh now runs setup/install-docker.sh when Docker isn't
found instead of just printing a message. The container build step
reports failure (not skip) when Docker is unavailable so the skill
can triage it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The migration is no longer experimental — it's been tested end-to-end
with service switchover, session continuity, and revert. Updated the
changelog entry to reflect the new migrate-v2.sh flow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracted the helpers we use (JID parsing, trigger mapping, channel
auth registry, generateId, v2PlatformId) into setup/migrate-v2/shared.ts.
Deleted setup/migrate-v1/ entirely — no code references it anymore.
Updated README, CLAUDE.md, docs/v1-to-v2-changes.md, and
docs/migration-dev.md to reference the new paths and migrate-v2.sh
entry point.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old migration flow (detect → validate → db → groups → env →
channel-auth → channels → tasks) ran inside `bash nanoclaw.sh` via
setup/auto.ts. Replaced by the standalone `bash migrate-v2.sh` flow.
Deleted:
- setup/migrate-v1.ts (orchestrator)
- setup/migrate-v1/{detect,validate,db,env,groups,channel-auth,channels,tasks}.ts
Kept:
- setup/migrate-v1/shared.ts (used by new migrate-v2/ steps)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New entry point: `bash migrate-v2.sh` from the v2 checkout.
Replaces the old setup-embedded migration flow with a standalone
4-phase script + rewritten Claude skill for the interactive parts.
Phase 0: Bootstrap (Node/pnpm/deps via setup.sh) + find v1
Phase 1: Core state (env, DB, groups, sessions, tasks)
Phase 2: Channels (clack multiselect, auth copy, code install)
Phase 3: Infrastructure (OneCLI, auth, Docker, skills, container build)
Service switchover: stop v1 → start v2 → test → keep or revert
Phase 4: Handoff → exec claude "/migrate-from-v1"
The skill handles: owner seeding, access policy, CLAUDE.local.md
cleanup, container config validation, fork customization porting.
Key fixes found during testing:
- triggerToEngage: requires_trigger=0 must override non-empty pattern
- unknown_sender_policy defaults to 'public' (strict drops all msgs
before owner is seeded)
- Service revert must stop v2 (parse unit name from step log, not
early tsx one-liner that can fail)
- Session continuity: copy JSONL from -workspace-group/ to
-workspace-agent/ and write continuation:claude into outbound.db
- container_config.additionalMounts written directly to container.json
(same shape in v1 and v2)
- EXIT trap writes handoff.json; explicit write_handoff before exec
Includes migrate-v2-reset.sh for dev iteration and docs/migration-dev.md
for testing/debugging reference.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- verify: remove the CLI ping; cli-agent step earlier in setup already
proved the round-trip works, and the test agent gets cleaned up before
verify runs — so the ping was guaranteed to fail on installs that wired
a messaging app instead of staying CLI-only. Status now collapses to
service-running ∧ credentials ∧ ≥1 wired group.
- agent-ping: catch Claude Code's "Please run /login" / "Not logged in" /
"Invalid API key" banners so a successfully-spawned agent that has no
credentials no longer reports as 'ok'.
- auth paste: validate the full sk-ant-oat…AA shape; when the cleaned
input is under 90 chars, surface a truncation-specific hint pointing at
terminal wrap as the likely cause. Strip internal whitespace at both
validate and assignment so multi-line pastes that survive clack also
go through cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve import conflict in setup/auto.ts — keep runMigrateV1 import,
deduplicate runWindowedStep and getLaunchdLabel/getSystemdUnit imports.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vercel@53.0.1 declares a dep on @vercel/static-build@2.9.22 which is not
published on npm (only 2.9.21 exists), breaking every fresh container
build that resolves vercel@latest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the four defenses on the outbound side onto extractAttachmentFiles:
1. Reject unsafe messageId via isSafeAttachmentName before any inbox path
is built. WhatsApp passes msg.key.id through raw and that field is
client generated, so a peer can craft it; future end to end encrypted
adapters will have the same property.
2. lstatSync on the inbox dir refuses a pre placed symlink before
mkdirSync would silently follow it.
3. realpathSync + isPathInside contains the resolved dir under the
session inbox root.
4. writeFileSync uses the wx flag so a pre placed symlink at the file
path is refused atomically by the kernel; EEXIST surfaces as a
logged skip.
Threat: the session dir is mounted writable into the container at
/workspace, so a compromised agent can pre place inbox/<future msgId>/
as a symlink and wait for a chat message with a matching id to redirect
the host write. The four guards together close that window.
Consolidates with the existing isSafeAttachmentName helper from
attachment-safety.ts rather than introducing a duplicate basename
validator inside session-manager.
Co-Authored-By: Daisuke Tsuji <dim0627@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes on top of the follow-up pre-task-script work:
1. The void async IIFE inside the interval handler had no catch, so a
throw from the dynamic import or applyPreTaskScripts escaped as an
unhandled rejection — terminating the container. The initial-batch
path is wrapped by processQuery's outer try/catch; the follow-up
path needs its own. Now logs the error and lets the next tick retry.
2. Re-check `done` immediately before query.push. The flag can flip
true while applyPreTaskScripts is awaited (outer stream finishes
during the script execution); without the re-check we'd push into a
closed query. Claimed messages get released by the host's
processing-claim sweep — same recovery posture as the rest of the
poller.
Co-Authored-By: Michael Zazon <mzazon@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes the post-ping `_ping-test` cleanup through `spawnQuiet` +
`setupLog.step` so a non-zero exit from `delete-cli-agent.ts` lands
in `logs/setup-steps/cleanup-cli-agent.log` and the progression log,
and prints a one-line warn to the user. Previously the spawnSync was
fire-and-forget with `stdio: 'ignore'`, leaving an orphan agent group
silently if cleanup failed.
Restores the original copy on the cli-agent step labels, the ping
explainer paragraph, and the post-ping spinner stop line — those
copy changes are out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>