Files

gavrielc c8fc1da719 refactor(claude-md): compose per-group CLAUDE.md from shared base + fragments

Replace the per-group "written once at init, owned by the group" CLAUDE.md
with a host-regenerated entry point that imports:

  - a shared base (`container/CLAUDE.md` mounted RO at `/app/CLAUDE.md`)
  - optional per-skill fragments (skills that ship `instructions.md`)
  - optional per-MCP-server fragments (inline `instructions` field in
    `container.json`)
  - per-group agent memory (`CLAUDE.local.md`, auto-loaded by Claude Code)

Principle: RW = per-group memory, RO = shared content. Source/skills/base
are shared; personality, config, working files, and Claude state stay
per-group.

Key changes:

  - New `src/claude-md-compose.ts` — per-spawn composition +
    `migrateGroupsToClaudeLocal()` one-time cutover.
  - New `container/CLAUDE.md` — shared base, seeded verbatim from the
    former `groups/global/CLAUDE.md`.
  - `src/container-runner.ts` — swap `/workspace/global` mount for RO
    `/app/CLAUDE.md`; call `composeGroupClaudeMd()` after
    `initGroupFilesystem()`.
  - `src/group-init.ts` — drop `.claude-global.md` symlink + initial
    `CLAUDE.md` write; seed `CLAUDE.local.md` from `opts.instructions`.
  - `src/index.ts` — call `migrateGroupsToClaudeLocal()` at startup.
  - `src/container-config.ts` — add optional `instructions` field to
    `McpServerConfig` (inline per-MCP guidance fragment).
  - `container/Dockerfile` — drop dead `/workspace/global` mkdir.
  - Remove obsolete `scripts/migrate-group-claude-md.ts`.

Migration (runs once at host startup, idempotent):

  - Delete `.claude-global.md` symlinks in each group.
  - Rename each `groups/<folder>/CLAUDE.md` → `CLAUDE.local.md`
    (preserves existing per-group content as memory).
  - Delete `groups/global/` directory.

Design docs: `docs/claude-md-composition.md` and `docs/shared-source.md`
(the latter is the sibling design discussion this refactor builds on).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-22 12:58:43 +03:00

14 KiB

Raw Blame History

Shared Source

Replace per-group agent-runner-src copies with a single shared read-only mount.

Problem

Each agent group gets a full copy of container/agent-runner/src/ at creation time. This copy is mounted RW at /app/src in the container. Consequences:

Bug fixes and features don't propagate to existing groups
Owner edits to container/agent-runner/src/ silently don't apply to existing groups
No tooling to diff or detect drift between groups and upstream
The RW mount lets agents write to their own runtime source without approval
Cross-cutting changes (host + container) break down when container code is per-group
Skills have the same copy-and-drift problem

Design

Principle: RW is per-group, RO is shared. Every mount is either read-only and shared across all groups, or read-write and scoped to one group. Source and skills become RO + shared. Personality, config, working files, and Claude state stay RW + per-group. This makes drift impossible by construction — no group can diverge from shared code because no group has write access to it.

Shared source mount

Mount container/agent-runner/src/ into all containers at /app/src as read-only.

container/agent-runner/src/ → /app/src (RO, shared)

Source is never baked into the image. /app/src/ exists only via this mount — running without it is an intentional startup failure (entrypoint bun run /app/src/index.ts → ENOENT). Source-only changes never trigger image rebuilds; edits to .ts files take effect on next container spawn.

Image rebuilds are only needed for:

Agent-runner npm dependency changes (package.json / bun.lock)
System packages, runtime versions, global CLI version bumps
Dockerfile/entrypoint changes

Shared skills mount

Mount container/skills/ into all containers at /app/skills/ as read-only.

Per-group skill selection via container.json:

{
  "skills": ["welcome", "agent-browser", "self-customize"]
  // or "skills": "all" (default)
}

At every spawn, the host syncs symlinks in the group's .claude-shared/skills/ directory to match the selected set. For "all", the set is recomputed from the shared skills dir on each spawn — newly-added upstream skills appear without intervention. Symlinks for skills no longer in the set are removed.

Each symlink points to a container path:

.claude-shared/skills/welcome → /app/skills/welcome
.claude-shared/skills/agent-browser → /app/skills/agent-browser

Claude Code scans /home/node/.claude/skills/, follows the symlinks, loads the selected skills. Same dangling-symlink-on-host pattern as .claude-global.md — host tools don't resolve the target, the container mount makes it valid at read time.

Per-group customization surface

What remains per-group (unchanged):

Resource	Location	Mechanism
Personality / instructions	`groups/<folder>/CLAUDE.md`	Mount at `/workspace/agent` (RW, live)
MCP servers	`groups/<folder>/container.json`	Env var at spawn
apt/npm packages	`groups/<folder>/container.json`	Per-group image layer
Skill selection	`groups/<folder>/container.json`	Symlinks at spawn
Additional mounts	`groups/<folder>/container.json`	Validated bind mounts
Agent provider / model	`groups/<folder>/container.json`	Read by runner at startup
Claude Code settings	`.claude-shared/settings.json`	Mount at `/home/node/.claude` (RW)
Working files	`groups/<folder>/`	Mount at `/workspace/agent` (RW)

Self-modification

Existing config-level self-mod tools (install_packages, add_mcp_server, request_rebuild) mutate container.json and per-group images, not source. Unchanged — stays per-group.

Source-level self-modification (not yet implemented) uses staging: edits happen against a copy of container/agent-runner/src/, reviewed and swapped in on approval. Owner can also edit source directly.

Environment variables

Env is for things read by code we don't own: glibc, Node's http agent, CLIs we shell out to. Everything NanoClaw-specific moves out of env.

Stays in env (read by non-nanoclaw code):

Var	Reader
`TZ`	glibc, child processes
`HTTPS_PROXY`, `NO_PROXY`	Node http agent, curl, git, etc. (OneCLI-injected)
`NODE_EXTRA_CA_CERTS`	Node at startup (OneCLI-injected)

Moves to container.json (read by runner at startup):

Var	Reason
`AGENT_PROVIDER`	Per-group config; runner reads before importing provider module
`NANOCLAW_AGENT_GROUP_NAME`	Per-group identity
`NANOCLAW_ASSISTANT_NAME`	Per-group identity
`NANOCLAW_MAX_MESSAGES_PER_PROMPT`	Config constant; per-group override possible

Deleted (admin gating moves to router):

NANOCLAW_ADMIN_USER_IDS is removed entirely — not moved to a new location. The container no longer makes authorization decisions. See Router command gate below.

Hardcoded as conventions:

Var	Convention
`SESSION_INBOUND_DB_PATH`	`/workspace/inbound.db`
`SESSION_OUTBOUND_DB_PATH`	`/workspace/outbound.db`
`SESSION_HEARTBEAT_PATH`	`/workspace/.heartbeat`
`NANOCLAW_AGENT_GROUP_ID`	Read from `/workspace/agent/container.json` at startup

Runner startup order

The runner can no longer assume DB paths or provider identity are handed to it in env. Revised startup:

Set up logging.
Read /workspace/agent/container.json (mounted RW but read-only here).
Open /workspace/inbound.db and /workspace/outbound.db (fixed paths).
Read bootstrap tables from inbound.db (destinations).
Import the provider module selected by container.json.
Enter the poll loop.

Router command gate

The host router gates slash commands before writing to messages_in. The container still handles whatever reaches it; it just stops making authorization decisions.

Filtered commands (/help, /login, /logout, /doctor, /config, /start, /remote-control) → drop silently. Never reach the container.
Admin commands (/clear, /compact, /context, /cost, /files) → check sender against user_roles (owners + global admins + admins scoped to this agent group).
- Denied: write "Permission denied: <cmd> requires admin access." directly to messages_out in the same thread. Do not write to messages_in.
- Allowed: pass through to container unchanged.
Normal messages → pass through unchanged.

Admin commands that flow through continue to be handled the same way they are today:

/clear — container's existing handler in poll-loop.ts resets session continuation and writes "Session cleared."
/compact, /context, /cost, /files — container forwards them to Claude Code's native slash-command handler.

Container receives only authorized messages. The runner has no admin concept, no adminUserIds field, no admin-gate branch — but it still recognizes /clear to reset session state.

Scope rules

Each channel answers a single scope question:

Channel	Scope	What it holds
Env vars	Process	Things read by code we don't own (`TZ`, `HTTPS_PROXY`)
`container.json`	Per-group	Per-group config (MCP, packages, provider, model, skills, mounts)
`inbound.db` / `outbound.db`	Per-session	Messages, session state, and host-projected views of cross-group state (destinations)
Central DB (`data/v2.db`)	Cross-group	Users, roles, wiring, messaging groups, sessions

The runner reads from env (for external-convention vars), container.json (for its own group's config), and inbound.db (for messages + projected views). It never reads central DB directly — that's always host-projected through inbound.db first.

After this change, the spawn-time -e flags shrink from ~10 to ~3-5 (TZ + OneCLI networking). No NANOCLAW_* env var survives.

Image layer strategy

Single Dockerfile with aggressive layer ordering: stable layers first, frequently-bumped layers last. BuildKit's layer cache handles "upstream layers unchanged" rebuilds efficiently — a separate base image isn't justified.

Two image tags exist at runtime:

nanoclaw-agent:latest          — shared base (rebuild: dep/CLI bumps + Dockerfile changes)
  └── nanoclaw-agent:<group>   — per-group apt/npm packages (rebuild: per-group via install_packages)

Layer order within the base:

FROM node:22-slim

# System deps (apt) — rarely change
RUN apt-get install ...

# Bun — pinned version, rarely changes
RUN ... bun

# Agent-runner deps — cached independently of CLI versions
COPY agent-runner/package.json agent-runner/bun.lock /app/
RUN cd /app && bun install --frozen-lockfile

# Global CLIs — most stable first, most frequently bumped last
RUN pnpm install -g "vercel@${VERCEL_VERSION}"
RUN pnpm install -g "agent-browser@${AGENT_BROWSER_VERSION}"
RUN pnpm install -g "@anthropic-ai/claude-code@${CLAUDE_CODE_VERSION}"

Bumping claude-code (the most common change) only rebuilds one layer. Agent-runner deps and other CLIs stay cached.

Source is never baked into the image — always provided by the shared RO mount at runtime.

Agent-triggered version bumps

Agents can request a claude-code version bump via a new self-mod tool (bump_claude_code). Same fire-and-forget pattern as install_packages: agent requests → owner approves → host rebuilds base image → kill all running containers. Unlike install_packages (per-group image), this rebuilds the shared base image and affects all groups.

Changes

`group-init.ts`

Remove the agent-runner-src copy block (lines 109–117)
Remove the skills/ copy block (lines 100–107)
Skill symlinks are no longer created at init — sync is spawn-owned (see container-runner.ts)

`container-runner.ts` `buildMounts()`

Remove per-group agent-runner-src mount (lines 206–209)
Add shared RO mount: container/agent-runner/src/ → /app/src
Add shared RO mount: container/skills/ → /app/skills
Sync skill symlinks in .claude-shared/skills/ at spawn: write desired set from container.json ("all" = every skill in the shared dir, recomputed per spawn), remove symlinks not in the set

`container-runner.ts` `buildContainerArgs()`

Remove -e SESSION_INBOUND_DB_PATH, -e SESSION_OUTBOUND_DB_PATH, -e SESSION_HEARTBEAT_PATH (hardcoded conventions now)
Remove -e AGENT_PROVIDER (moves to container.json)
Remove -e NANOCLAW_ASSISTANT_NAME, -e NANOCLAW_AGENT_GROUP_ID, -e NANOCLAW_AGENT_GROUP_NAME
Remove -e NANOCLAW_MAX_MESSAGES_PER_PROMPT
Remove the user_roles join + -e NANOCLAW_ADMIN_USER_IDS block (lines 269–287) entirely. Admin gating moves to the router — no admin data passed to the container.
Keep: -e TZ, OneCLI-contributed env (HTTPS_PROXY, NODE_EXTRA_CA_CERTS, NO_PROXY)

`router.ts` (new command gate)

Classify inbound slash commands before writing to messages_in: filtered / admin / normal.
Filtered (/help, /login, /logout, /doctor, /config, /start, /remote-control) → drop silently.
Admin commands (/clear, /compact, /context, /cost, /files) from non-admins → write "Permission denied" directly to messages_out, skip messages_in.
All authorized messages (admin commands from admins, and normal messages) → pass through unchanged to messages_in. Container handles them as today.
The ADMIN_COMMANDS and FILTERED_COMMANDS lists move from container/agent-runner/src/formatter.ts to a host-side module.

`container/agent-runner/src/` (runner)

New config.ts module: loads /workspace/agent/container.json at startup, exposes a typed config singleton. All previous process.env.NANOCLAW_* reads go through this.
db/connection.ts: use hardcoded paths /workspace/inbound.db and /workspace/outbound.db; drop SESSION_*_DB_PATH lookups.
formatter.ts: remove ADMIN_COMMANDS, FILTERED_COMMANDS, and the filtered / admin-gate categorization. Keep enough to recognize /clear so poll-loop.ts can route it (e.g., a narrow isClearCommand(msg) helper).
poll-loop.ts: remove adminUserIds field from config type and the admin-gate branch (lines 113–126). Keep the /clear handler (lines 128–142) — /clear still flows through from the router.
Provider selection (providers/index.ts or equivalent): read provider from config singleton, not env.

`container-config.ts`

Add skills field to ContainerConfig (string[] | "all", default "all")
Add fields: provider, groupName, assistantName, maxMessagesPerPrompt (optional, falls back to code default)

`.env` / `.env.example`

Remove any NANOCLAW_* entries that were documented as tunables. Update .env.example to list only TZ and OneCLI-related vars as valid overrides.

DB migration

Drop agent_groups.agent_provider column and sessions.agent_provider column. Source of truth becomes container.json.provider.
One-time data migration reads existing values and writes them to each group's container.json. Sessions lose any per-session provider override — provider is a per-group property now.

Migration

This is a breaking change. Host restart kills all running containers. No gradual rollout. Any code referencing dropped columns or removed env vars must be updated before the migration runs.

Provider install skills (/add-opencode, /add-ollama-tool) now write to the shared container/agent-runner/src/providers/ tree. The per-group providers/ overlay pattern is removed. Any uncommitted provider overlays must be upstreamed before cutover.
Delete existing data/v2-sessions/<id>/agent-runner-src/ directories on first run after cutover.
Existing .claude-shared/skills/ directories get replaced with symlinks on next spawn.
DB migration (see above) reads agent_provider columns and projects into container.json, then drops the columns.

What triggers what

Change	Action needed	Scope
Agent-runner `.ts` source	Kill running containers	All groups
Agent-runner npm deps	Rebuild `nanoclaw-agent` + kill all	All groups
System deps, Bun, Node	Rebuild `nanoclaw-agent` + kill all	All groups
Claude-code version bump	Rebuild `nanoclaw-agent` + kill all	All groups (agent-triggerable)
Skill content	Kill running containers	All groups
Per-group apt/npm packages	`buildAgentGroupImage()` + kill	One group
Per-group config (MCP, mounts, provider, model, skills)	Kill that group's containers	One group
CLAUDE.md, working files	Nothing (live via RW mount)	One group

14 KiB Raw Blame History Unescape Escape