Source of truth for container runtime config moves from
groups/<folder>/container.json to a new container_configs table.
The file becomes a materialized view written at spawn time.
- New container_configs table with scalar columns (provider, model,
effort, image_tag, assistant_name, max_messages_per_prompt) and
JSON columns (mcp_servers, packages_apt, packages_npm, skills,
additional_mounts)
- Startup backfill seeds DB from existing container.json files
- materializeContainerJson() replaces readContainerConfig + ensureRuntimeFields
- Self-mod handlers (install_packages, add_mcp_server) write to DB
- Provider cascade simplified: session -> container_configs -> 'claude'
- ncl groups config-{get,update,add-mcp-server,remove-mcp-server,
add-package,remove-package} custom operations
- restartAgentGroupContainers() helper for config change propagation
- Container side unchanged (still reads /workspace/agent/container.json)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- genericList now accepts column filters (--flag value) and LIMIT (default 200)
- Remove early inDb.close() in container pollResponse to avoid double-close
- Document filtering and --limit in cli.instructions.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a container agent calls an approval-gated ncl command, dispatch
now sends an approval card to an admin instead of returning a stub
error. On approve, the handler re-dispatches the original command
and notifies the agent with the result.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename the CLI binary, socket path, container wrapper, error prefixes,
and all references from `nc` to `ncl`. Add ~/.local/bin symlink during
setup and pnpm script alias.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- host-core.test.ts: add in_reply_to: null to routeAgentMessage calls
(required after #2267 added the field to RoutableAgentMessage)
- agent-route.test.ts: use 'closed' instead of 'archived' (not a valid
Session status)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The skipped coexistence test and the findSessionByAgentGroup
bug-documenting test were written before the A2A return-path fix
(#2267). That fix sidesteps findSessionByAgentGroup entirely —
A2A replies now use source_session_id for routing, so the
"newest session wins" behavior is only a fallback for unsolicited
first-contact A2A where any session will do.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Squash merge of PR #2267 by ddaniels.
When an agent group has more than one active session, A2A replies landed
in the newest session via findSessionByAgentGroup's ORDER BY created_at
DESC. The session that asked the question never saw the answer.
Adds origin-aware return-path routing with three layers:
1. Direct return-path: if the reply has in_reply_to, look up the
triggering inbound row's source_session_id and route there.
2. Peer-affinity fallback: find the most recent A2A inbound from this
peer and use its source_session_id.
3. Legacy fallback: newest active session (pre-migration compat).
Container-side: MCP send_message/send_file now thread the current
batch's in_reply_to through to outbound rows via current-batch.ts.
Also flips our A2A bug-documenting test (#2332) from asserting the
broken behavior to asserting the fixed behavior.
Co-Authored-By: Doug Daniels <ddaniels888@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three tests that exercise agent-to-agent routing and document the broken
behavior that #2332 describes:
1. A2A outbound lands in target session — basic happy path, passes.
2. A2A return path resolves to wrong session when source agent has
multiple channel sessions. Researcher responds to PA, but
findSessionByAgentGroup picks PA's newest session (Discord) instead
of the Slack session that originated the A2A call. Test asserts the
buggy behavior (response in Discord, nothing in Slack).
3. A2A-only session gets null session_routing. writeSessionRouting on a
session with messaging_group_id=NULL writes all nulls — the target
agent has no default routing for replies. Test asserts the nulls.
These tests pass today by asserting the broken state. When #2332 is
fixed (origin-aware return routing), these assertions should flip to
the correct behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Host-side (vitest):
- Routed message preserves platformId/channelType/threadId on messages_in
- Fan-out gives each agent correct per-agent routing
- writeSessionRouting populates session_routing from messaging group
- writeSessionRouting writes null routing for agent-shared sessions
- Per-thread session includes thread_id in session_routing
- Agent-shared resolves to same session on repeated calls
- Agent-shared session has null messaging_group_id
- findSessionByAgentGroup returns channel-bound session (documents #2332)
- Skip: agent-shared/channel-bound coexistence (blocked on #2332 fix)
Container-side (bun:test):
- Internal tags stripped between message blocks
- Mixed task + chat batch with correct routing
The agent-shared tests uncovered the exact bug from #2332:
findSessionByAgentGroup doesn't distinguish agent-shared from
channel-bound sessions, so A2A resolution reuses a channel session
when one exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this, an unrecoverable failure such as TokenInvalid causes the
gateway listener to restart ~10x/sec, which Discord's Cloudflare layer
treats as abuse and answers with a multi-hour IP block. Both the clean-
expiry path and the error path now share a backoff that doubles up to
1h, with a >5min healthy run resetting the counter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add integration test for per-destination thread_id resolution: seeds two
destinations with different thread IDs, verifies each outbound message
carries the correct thread_id (not a global one from the batch routing).
- Add log line in resolveDestinationThread catch block for debuggability.
- Remove stray "(ensurePreCompactHook is defined after the main function.)"
comment from group-init.ts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The poll loop had a bare-text routing fallback in dispatchResultText: when
the agent produced text without <message to="..."> wrapping, it would auto-
route to the session's originating channel (via a frozen RoutingContext) or
to the single configured destination. This caused three problems:
1. Routing drift: RoutingContext was extracted once from the initial batch
and never refreshed. When the initial batch was a null-routed cron task
and a real chat arrived mid-query, replies were silently dropped to
scratchpad because the frozen routing had all-null fields.
2. Cross-channel thread bleed: sendToDestination applied a single
routing.threadId to every outbound message regardless of destination.
In agent-shared sessions (multiple channels sharing one session), one
channel's thread ID was stamped onto messages to a different channel.
3. Inconsistent formatting: task, webhook, and system messages had no
origin metadata in their formatted output, so the agent couldn't tell
which destination they came from — even when the underlying messages_in
rows carried routing fields.
Changes:
- Remove the bare-text routing fallbacks in dispatchResultText (both the
routing-based and single-destination shortcuts). All agent output must
be wrapped in <message to="name">...</message>. Bare text is scratchpad.
- Update buildDestinationsSection() to require explicit wrapping for all
groups, including single-destination. No more "no special wrapping
needed" shortcut.
- Resolve thread_id per-destination via resolveDestinationThread(), which
queries messages_in for the most recent message matching the target
channel+platform. Falls back to null (top-level channel message) when
no prior inbound exists for that destination.
- Extract originAttr() helper in formatter.ts and apply it to all message
types. Tasks now render as <task from="dest" time="...">, webhooks as
<webhook from="dest" source="..." event="...">, system responses as
<system_response from="dest" ...>. The agent always sees where a
message originated.
- Add a PreCompact shell hook (compact-instructions.ts) that outputs
custom compaction instructions, telling the compactor to preserve
recent message XML structure and routing metadata in the summary.
Wired via settings.json in the .claude-shared scaffold, with a
migration path (ensurePreCompactHook) for existing groups.
Relation to open PRs:
- #2277 (mergeRouting) becomes unnecessary — the routing fallback it
patches no longer exists. Can be closed.
- #2327 (post-compaction destination reminder) is complementary — it
handles the post-compaction push, this handles pre-compaction
instructions. Both can merge independently.
- #2328 (default routing instruction) is complementary — it adds "reply
to the from= destination" guidance to the multi-destination section.
Compatible with the unified instruction format here.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SQLite TIMESTAMP columns store UTC without a zone marker. `Date.parse`
treats timezoneless ISO strings as local time, so on any non-UTC host
every claim and processAfter looks (TZ offset) hours stale. That makes
fresh claims trip the kill-claim path on the first sweep tick — every
container gets killed within seconds of spawn.
Two affected sites in host-sweep.ts:
- decideStuckAction reads claim.status_changed and computes claimAge.
On a TZ=Europe/Madrid host (UTC+2), a claim made 5s ago looks
7205s old and exceeds CLAIM_STUCK_MS (60s).
- The orphan retry loop reads msg.processAfter and skips messages
rescheduled into the future. On the same host, future timestamps
look (TZ offset) hours in the past, so the skip is missed and
tries gets bumped on every tick.
Fix: introduce parseSqliteUtc(s) which appends "Z" only when no zone
marker is present, then call it from both sites. Behavior under
TZ=UTC is unchanged.
Verified on a production v2 install on TZ=Europe/Madrid: with the
patch applied, an idle container survived 30+ minutes without being
killed (previously: killed within 60s of spawn).
Tests: 5 new cases covering the bare/Z/+offset/invalid input matrix
and a TZ-independence check. All 19 host-sweep tests pass and tsc
clears against main.
Resource-first CLI: `nc groups list`, `nc wirings get <id>`, etc.
Seven resources defined (groups, messaging-groups, wirings, users,
roles, members, sessions) with full column documentation that serves
as the single source of truth for help output and arg validation.
- CRUD helper auto-registers list/get/create/update/delete from
declarative resource definitions with generic SQL
- Custom operations for composite-PK resources (roles grant/revoke,
members add/remove)
- Access model: open (reads) / approval (writes) / hidden
- `nc help` lists resources; `nc <resource> help` shows fields
- Positional target IDs: `nc groups get <id>`
- Removed unused priority column from wirings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add delivery action handler (cli_request) so the host dispatches CLI
commands arriving from container agents via outbound.db and writes
responses back to inbound.db. Add nc MCP tool in the agent-runner
following the ask_user_question blocking pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in src/index.ts shutdown sequence — keep both
stopCliServer() from nc-cli and try/finally + resetCircuitBreaker()
from main.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test wrapper forwards the in-memory outDb as the writable handle,
avoiding the filesystem reopen that fails in CI. The function stays
private — the optional writableOutDb param is an internal detail, not
a public API.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The send_card MCP tool wrote outbound rows with type='card' but the
chat-sdk-bridge deliver() had no branch for them, so the payload fell
through to the text fallback (where text is undefined) and silently
returned without calling the adapter. delivery.ts then marked the
message delivered with platformMsgId=undefined and the user saw nothing.
Add a dedicated card branch mirroring the ask_question structure:
- Build Card from title, description, and string-or-{text} children
- Render only URL actions as LinkButtons (send_card is fire-and-forget
per its docstring, so callback buttons would have nowhere to land)
- Drop empty cards with a warn log instead of posting blank
- Fall back text: content.fallbackText > description > title
Affects every Chat SDK adapter that goes through the bridge: Discord,
Telegram, Slack, Teams, GChat, GitHub, Linear, WhatsApp Cloud, iMessage,
Matrix, Webex, Resend.
Tests: adds five cases covering normal render, action filtering,
link-button rendering, empty-card skip, and a regression check that
non-card chat-sdk payloads still flow through the text branch.
Closes#2263
/./ requires at least one character and silently drops messages with no
text (e.g. Telegram photo/video/file sent without a caption). Switching
to /[\s\S]*/ matches the empty string too, so media-only messages now
reach the router and then the agent.
#2183 added orphan-claim cleanup that reopens `outbound.db` by session
path (`openOutboundDbRw(session.agent_group_id, session.id)`) so the
delete runs against a writable handle even when callers pass a readonly
one. That works for the production caller — there's a real on-disk
session DB at the expected path.
The test wrapper `_resetStuckProcessingRowsForTesting` (introduced in
the same series, #2151) is called with in-memory DBs that have no
on-disk path. The reopen creates a fresh empty file at
`<DATA_DIR>/v2-sessions/ag-test/sess-test/outbound.db`, runs the delete
against that, and leaves the in-memory `outDb` (which the test reads
afterward) untouched. The two `resetStuckProcessingRows — orphan claim
cleanup` tests assert `getProcessingClaims(outDb).toEqual([])` after
the call and fail on the row that's still there.
Fix: drop the `_…ForTesting` wrapper, export `resetStuckProcessingRows`
directly with an optional `writableOutDb` parameter. When omitted
(production), the function reopens `outbound.db` RW by session path —
existing behavior, existing safety guarantee. When provided (tests, or
any future caller that already holds a writable handle), the function
uses it directly and skips the reopen. The optional parameter has a
real meaning, not a "for tests" hack.
Public API surface change: `_resetStuckProcessingRowsForTesting` is
gone, `resetStuckProcessingRows` is now exported. No other callers
inside the repo besides the test.
PR #2151 added deleteOrphanProcessingClaims() to resetStuckProcessingRows(),
but outDb is always opened readonly (openOutboundDb uses immutable: true).
The write silently failed, leaving orphan processing_ack rows that block
future message delivery for the session.
Fix: add openOutboundDbRw() alongside the existing readonly opener and use
it in resetStuckProcessingRows() to open a short-lived writable handle just
for the delete. The readonly handle is still used for all reads above.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the four defenses on the outbound side onto extractAttachmentFiles:
1. Reject unsafe messageId via isSafeAttachmentName before any inbox path
is built. WhatsApp passes msg.key.id through raw and that field is
client generated, so a peer can craft it; future end to end encrypted
adapters will have the same property.
2. lstatSync on the inbox dir refuses a pre placed symlink before
mkdirSync would silently follow it.
3. realpathSync + isPathInside contains the resolved dir under the
session inbox root.
4. writeFileSync uses the wx flag so a pre placed symlink at the file
path is refused atomically by the kernel; EEXIST surfaces as a
logged skip.
Threat: the session dir is mounted writable into the container at
/workspace, so a compromised agent can pre place inbox/<future msgId>/
as a symlink and wait for a chat message with a matching id to redirect
the host write. The four guards together close that window.
Consolidates with the existing isSafeAttachmentName helper from
attachment-safety.ts rather than introducing a duplicate basename
validator inside session-manager.
Co-Authored-By: Daisuke Tsuji <dim0627@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the host kills a container (absolute-ceiling, claim-stuck, or crashed),
resetStuckProcessingRows reset messages_in but left orphan rows in
processing_ack. The next sweep tick spawned a fresh container and, on the
same tick, ran enforceRunningContainerSla against outbound.db that still
contained the previous container's claim with a hours-old status_changed
timestamp — instant kill-claim, before the agent-runner could open
outbound.db to run its own clearStaleProcessingAcks(). Loop until tries
hit MAX_TRIES.
Add deleteOrphanProcessingClaims() in session-db and call it at the end of
resetStuckProcessingRows. Safe to write outbound.db here because the host
only enters this path after killContainer (or when no container is running).
Tests in host-sweep.test.ts cover the helper plus the regression: orphan
claim from a 2h-old kill is now removed atomically with the messages_in
reset, so the next sweep tick sees an empty claims list and the freshly
respawned container survives long enough to start its agent-runner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the MIME/type-to-extension maps and derivation helpers out of
session-manager.ts into a dedicated attachment-naming module — keeps
session-manager focused on session lifecycle and gives the helpers
a natural home for unit tests alongside the existing attachment-safety
module.
Two small fixes alongside the extraction:
- extForMime now guards `typeof mime !== 'string'` before .split, so a
buggy bridge passing `mimeType: { ... }` (object) no longer crashes
the inbound write loop.
- deriveAttachmentName computes Date.now() once per call instead of
twice, and tightens the explicit-name check to a string-and-truthy
guard so non-string values fall through to derivation.
Adds attachment-naming.test.ts with 11 cases covering MIME normalization
(case + parameters), Telegram type fallback, the non-string defensive
guard, and the bare-timestamp fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit hook ran prettier on the prior commit but left the reformats
unstaged. Folding them in here so the branch is clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a transport-agnostic CLI control plane shared between three eventual
callers (host shell, Claude in project, container agent) — though only the
host-side socket transport is wired in this commit. Container DB transport
and approval flow land alongside the first risky command.
- src/cli/frame.ts: wire format (RequestFrame, ResponseFrame, CallerContext)
- src/cli/registry.ts: command registry with RiskClass
- src/cli/dispatch.ts: transport-agnostic dispatcher
- src/cli/transport.ts: Transport interface
- src/cli/socket-client.ts: SocketTransport against data/nc.sock
- src/cli/socket-server.ts: host-side listener (chmod 0600, line-delimited JSON)
- src/cli/format.ts: human table / --json output modes
- src/cli/client.ts: `nc` argv -> frame -> transport -> stdout
- src/cli/commands/list-groups.ts: first command (riskClass: safe)
- bin/nc: bash launcher (resolves project root via symlink)
- src/index.ts: start/stop server + import command barrel
`data/nc.sock` is intentionally separate from `data/cli.sock` (which the
existing chat-style channel adapter still owns).
Verified end-to-end: `nc list-groups`, `nc list groups`, `--json`,
unknown-command error, host-down ENOENT message with start instructions.
typecheck clean; eslint reports only the same `no-catch-all` warnings the
rest of the codebase has.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a channel bridge passes an attachment without an explicit `name`,
extractAttachmentFiles fell back to `attachment-<ts>` with no extension.
Agents could not tell whether the file was a JPEG, PDF, or audio clip,
and tools keyed on extension (image viewers, exiftool, etc.) misbehaved.
Two cases are now covered:
1. Channels that set `mimeType` but no `name` (Discord/Slack documents,
Telegram document uploads). A small MIME-to-extension table covers
the common content types — image/*, audio/*, video/*, pdf, zip,
txt, json. Unknown MIMEs fall back to the unsuffixed name.
2. Channels that set `att.type` but no `mimeType` (Telegram photos,
stickers, voice, animations). The chat-sdk bridge sets a coarse
media-class (`photo` / `sticker` / `voice` / `video` /
`animation`) which is reliable enough to derive a canonical
extension. Telegram GIFs are MP4 under the hood.
The existing isSafeAttachmentName security guard is preserved — the
derived name still passes through it before disk I/O. The new lookup
tables emit static values from internal maps and cannot construct a
path-traversal payload; attacker-controlled att.name continues to flow
through the same validator.
- wakeContainer now never throws — returns Promise<boolean>, catches
internally. Closes the regression risk for the 5 awaited callers in
agent-to-agent, interactive, and approvals/response-handler that the
previous version left unwrapped. Router uses the boolean to stop the
typing indicator on transient failure; host-sweep just awaits.
- Tighten AUTH_REQUIRED_RE: anchor to start-of-string with the specific
`·` (U+00B7) separator the CLI uses, so an agent that quotes the
banner mid-sentence in a normal reply doesn't trip the classifier.
- Log a one-line note from writeAuthRequiredMessage so substitutions
are visible when debugging "user got the credentials message but I
don't see why."
- Add unit tests for ClaudeProvider.isAuthRequired covering both banner
variants, trailing content, mid-sentence quoting, leading-prose
quoting, alternate separators, and unrelated text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>