Covers the gap item 5 left open: request_approval presupposes a wired channel, so unknown-channel cases (new DM, @mention in unwired group, bot added to fresh group) short-circuit at no_agent_wired before the approval flow runs. Design: - Owner-sender auto-wire fast path (exactly one agent group → wire silently; multiple → card) - Card with one button per existing agent group + "Create new" + "Ignore" - New pending_channel_approvals table, UNIQUE(messaging_group_id) - nca- action-id prefix paralleling nsa- / ncq- - Handler lives alongside handleSenderApprovalResponse - "Create new" sub-flow is intentionally open scope Cross-reference added to item 5 so the scope boundary is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
619 lines
41 KiB
Markdown
619 lines
41 KiB
Markdown
# v1 → v2 Action Items
|
||
|
||
Working doc for each finding from [SUMMARY.md](SUMMARY.md). Decisions were made one-by-one; this rollup summarizes the outcome.
|
||
|
||
**Status legend**: `pending` · `discussing` · `decided` · `deferred` · `dropped` · `done`
|
||
|
||
---
|
||
|
||
## Rollup
|
||
|
||
### To implement (~800 LOC total, roughly)
|
||
|
||
| # | Topic | LOC | Notes |
|
||
|---|---|---|---|
|
||
| 1 | Engage modes + sender scope + accumulate/drop + fan-out + tool blocklist | ~315 | DB migration drops `trigger_rules`/`response_scope`, adds `engage_mode`/`engage_pattern`/`sender_scope`/`ignored_message_policy` + `trigger` column on `messages_in`; router `pickAgents` fan-out; adapter-level gating via new hooks |
|
||
| 5 | `request_approval` flow for unknown senders (default policy flips from `strict` to `request_approval`) | ~175 | New `pending_sender_approvals` table; reuses existing `pickApprover` + card infra |
|
||
| 9 | Stuck detection (60s claim-age rule), heartbeat-based lifecycle, `max(30m, bash_timeout)` absolute ceiling, SDK tool blocklist (`AskUserQuestion`, `EnterPlanMode`, `ExitPlanMode`, `EnterWorktree`, `ExitWorktree`), remove `IDLE_TIMEOUT` setTimeout + `IDLE_END_MS` machinery | ~115 | Container state row for Bash timeout tracking |
|
||
| 15 | Delete three dead config constants from `src/config.ts` | 3 | `POLL_INTERVAL`, `SCHEDULER_POLL_INTERVAL`, `IPC_POLL_INTERVAL` |
|
||
| 18 | Timezone + formatting recreation — port v1 bit-for-bit (`formatLocalTime`, `<context timezone="..."/>` header, `reply_to` + `<quoted_message>` XML, `stripInternalTags`) + scheduling tool TZ normalization + cron TZ parsing | ~195 (75 prod + 120 tests) | Full spec in [timezone-formatting-v1-recreation.md](timezone-formatting-v1-recreation.md) |
|
||
|
||
### Deferred (wait for trigger)
|
||
|
||
| # | Topic | Trigger |
|
||
|---|---|---|
|
||
| 2 | `nonMainReadOnly` mount isolation | If multi-tenant / untrusted-group use ever surfaces. In the meantime, mount-declaration skill must explicitly prompt RO/RW when added |
|
||
| 3a | End-to-end recovery test | When next touching `host-sweep.ts` / `index.ts` startup |
|
||
| 14 | Remote control subsystem | When someone needs it. Opt-in skill, provider-specific (Claude SDK only) |
|
||
| 17 | Dynamic group-add (bridge conversations cache refresh) | When implementing dynamic group registration feature. Code comment added at `chat-sdk-bridge.ts:73` |
|
||
|
||
### Dropped (won't implement / not-a-regression)
|
||
|
||
| # | Topic | Why |
|
||
|---|---|---|
|
||
| 3 | Explicit pending-message recovery | Working as designed via sweep's immediate first tick + `cleanupOrphans` |
|
||
| 4 | `response_scope` enforcement | Folded into item 1 migration (column deleted, values backfilled) |
|
||
| 6 | Per-group container timeout | Not a regression — v1's hard-kill was worse than v2's keep-alive-after-idle |
|
||
| 7 | Container streaming output markers | Replaced by `send_message` MCP tool; latency ~1s is fine for chat UX |
|
||
| 8 | Per-exit container log files | Underlying info still recoverable (session DBs, heartbeat mtime, exit code) |
|
||
| 10 | Host-level retry on agent error | Folded into item 9's kill + sweep-reset loop |
|
||
| 11 | Process ID in logger output | Single host process; container stderr already tagged with `agentGroup.folder` |
|
||
| 12 | Task dedup via unique series_id index | Recurrence logic is structurally dedup-safe; not a real issue |
|
||
| 13 | Silent-drop sender mode | Admin can use `unknown_sender_policy='strict'` or remove from members instead |
|
||
| 16 | Configurable retention thresholds | Personal-assistant scale; source constants are fine |
|
||
|
||
### Extras recorded during discussion
|
||
- **1a**: Implementation-ordering plan for item 1
|
||
- **6a**: Remove `IDLE_END_MS` from `poll-loop.ts` (folded into item 9)
|
||
- **3a**: E2E recovery test (deferred)
|
||
|
||
### Follow-up PRs (scoped, not in this branch)
|
||
| # | Topic | Why later |
|
||
|---|---|---|
|
||
| 22 | Unknown-channel wiring approval flow (card to owner when bot receives inbound in an unwired messaging group) | Gap surfaced after item 5 landed — item 5's `request_approval` covers unknown senders but presupposes a wired channel. See item 22 for the full design. |
|
||
|
||
---
|
||
|
||
## HIGH
|
||
|
||
### 1. Trigger-rule matching in `pickAgent`
|
||
**Finding**: `src/router.ts:246` TODO. Confirmed trigger filtering is non-functional end-to-end: `trigger_rules` JSON is parsed into `ConversationConfig` and passed to adapters, but the Chat SDK bridge never reads it, and router's `pickAgent` picks by priority only. `response_scope` on `messaging_group_agents` is stored but never enforced. Chat SDK bridge hard-subscribes on every mention (bridge:173) and every DM (bridge:189).
|
||
|
||
**Status**: decided — design locked; implementation pending
|
||
|
||
**Decision**: replace `trigger_rules` JSON + `response_scope` with four explicit orthogonal columns on `messaging_group_agents`. Fan out inbound messages to all matching agents (N containers for N agents). Adapter-level gating in the bridge. `sender_scope` enforcement moves to the permissions module.
|
||
|
||
**Schema** (`messaging_group_agents`):
|
||
```
|
||
engage_mode TEXT NOT NULL DEFAULT 'mention'
|
||
-- 'pattern' | 'mention' | 'mention-sticky'
|
||
engage_pattern TEXT -- required when mode='pattern'; '.' = always
|
||
sender_scope TEXT NOT NULL DEFAULT 'all' -- 'all' | 'known'
|
||
ignored_message_policy TEXT NOT NULL DEFAULT 'drop' -- 'drop' | 'accumulate'
|
||
```
|
||
Drop `trigger_rules` + `response_scope`. **No per-wiring accumulate cap** — storage is unbounded.
|
||
|
||
**Global wake cap** (not a column): reuse `MAX_MESSAGES_PER_PROMPT` in `src/config.ts` (already defined, default 10, currently dead code from v1). Pass to container via `NANOCLAW_MAX_MESSAGES_PER_PROMPT`. Container applies `ORDER BY seq DESC LIMIT $N` when pulling pending messages on wake.
|
||
|
||
**Session DB** (`messages_in`):
|
||
```
|
||
trigger INTEGER NOT NULL DEFAULT 1 -- 0 = context-only, 1 = wake agent
|
||
```
|
||
Host's `countDueMessages` / wake logic gates on `trigger=1`. Container reads all messages for context regardless.
|
||
|
||
**Decisions locked**:
|
||
- `always` collapses into `pattern` with `engage_pattern='.'` (three modes total)
|
||
- `mention` and `mention-sticky` are separate modes (stickiness is user-visible)
|
||
- `pattern` is a JS regex string — applied as `new RegExp(pattern).test(text)`
|
||
- Accumulate cap = last N messages, default 10
|
||
- Fan-out: each matching agent gets its own session + container
|
||
- Per-channel defaults live in the setup/register flow, not in the schema:
|
||
- DM → `pattern` with `.`
|
||
- Threaded group → `mention-sticky`
|
||
- Non-threaded group → `mention`
|
||
|
||
**Routing flow** (future):
|
||
1. Inbound → resolve messaging_group → group-level `unknown_sender_policy` gate
|
||
2. `pickAgents()` returns all wired agents (not just priority 0)
|
||
3. For each agent:
|
||
a. `sender_scope` check (permissions module)
|
||
b. `engage_mode` check (regex / mention / mention-sticky)
|
||
c. Matched → write with `trigger=1`, wake container
|
||
d. Not matched + `accumulate` → write with `trigger=0`, don't wake (no cap — stored forever)
|
||
e. Not matched + `drop` → skip
|
||
|
||
On wake, container pulls pending messages with `ORDER BY seq DESC LIMIT MAX_MESSAGES_PER_PROMPT` so only the most recent N reach the prompt regardless of accumulation depth.
|
||
|
||
**Adapter bridge**:
|
||
- Read `conversations.get(channelId)` before `setupConfig.onInbound(...)`
|
||
- For `pattern` mode: test regex
|
||
- For `mention` / `mention-sticky`: require bot to be mentioned
|
||
- Only `thread.subscribe()` when mode is `mention-sticky` (today it subscribes unconditionally)
|
||
|
||
**LOC estimate**: ~315 (~255 prod + ~60 test)
|
||
- schema migration + backfill: 40
|
||
- session DB `trigger` column: 25
|
||
- types + adapter contract: 20
|
||
- DB helpers (CRUD): 20
|
||
- host→adapter plumbing (including `NANOCLAW_MAX_MESSAGES_PER_PROMPT` env): 10
|
||
- router fan-out + gating: 70
|
||
- sender-scope in permissions module: 15
|
||
- Chat SDK bridge gating + subscribe control: 40
|
||
- container-side `LIMIT N` on pending-message pull: 5
|
||
- smart defaults in setup/register flow: 15
|
||
- tests: 60
|
||
|
||
(Note: earlier plan's "accumulate prune-to-N in router" is dropped — host doesn't prune. Cap is container-side only.)
|
||
|
||
**Core vs module split**:
|
||
- Core (`src/`): schema, engage_mode enforcement, pickAgents fan-out, bridge gating, `trigger` column, accumulate/drop
|
||
- Permissions module: `sender_scope` enforcement (extends existing access gate). Default `sender_scope='all'` → no-op when permissions module absent
|
||
|
||
**Next step**: new action item for implementation — see item 1a.
|
||
|
||
---
|
||
|
||
### 1a. Implementation plan for engage/sender/ignored columns
|
||
**Status**: pending — ready to implement
|
||
**Order**: (a) migration + backfill, (b) types + DB helpers, (c) router fan-out + gating, (d) bridge gating, (e) permissions sender_scope, (f) setup-flow defaults, (g) tests
|
||
**Next step**: draft the migration + write up the PR plan when ready
|
||
|
||
### 2. `nonMainReadOnly` mount isolation
|
||
**Finding**: `mount-security.ts` moved to `src/modules/mount-security/index.ts` during the refactor. `validateMount(mount)` no longer takes an `isMain` param; `MountAllowlist` has no `nonMainReadOnly` field. Regression is real. But v1's "main vs non-main" concept doesn't map cleanly to v2 — `agent_groups` has no `is_main` flag.
|
||
|
||
**Status**: deferred
|
||
|
||
**Decision**: do not restore the v1 flag. Trust admin-declared `readonly` values in `container.json`. The allowlist's per-root `allowReadWrite` + path gating is sufficient for the current threat model (personal-assistant use, single admin). If multi-tenant / untrusted auxiliary groups become a real use case, prefer framing B (add `agent_groups.mount_access: 'rw' | 'ro'` column) over resurrecting `isMain`.
|
||
|
||
**Rationale**: v2 deliberately dropped the "main" concept. Reintroducing `isMain` to restore a defense-in-depth check that was designed for a different entity model is the wrong trade. Admin already has to opt-in twice (allowlist `allowReadWrite: true` + container.json `readonly: false`) to get RW — that's two deliberate keys. The v1 flag was a triple-check for a rare class of admin mistakes in a shared-infra setup.
|
||
|
||
**Follow-up (required)**: when building the skill / guide / setup flow that lets admins declare additional mounts (e.g. self-customize, manage-mounts, or a new `/add-mount` skill), the flow **must clearly surface the RO vs RW distinction** to the admin — explicit choice, explicit warning when RW is selected, and default to RO. This replaces v1's automatic enforcement with informed consent.
|
||
|
||
**Next step**: when the mount-declaration skill/flow is next touched, add explicit RO/RW prompting. Track as a sub-item if a skill exists yet.
|
||
|
||
### 3. Explicit pending-message recovery on startup
|
||
**Finding**: v1 had a named `recoverPendingMessages()` function at startup. v2 relies on the host sweep. Verified: the recovery path exists and is correct — just renamed/relocated.
|
||
|
||
**Status**: decided — working as designed, no code change
|
||
|
||
**Current mechanism** (verified against tree):
|
||
1. `cleanupOrphans()` at startup kills any leftover container from the previous run (`src/index.ts:69`)
|
||
2. `startHostSweep()` runs its first sweep **immediately** — no 60s delay (`src/host-sweep.ts:38`)
|
||
3. Sweep per session: `syncProcessingAcks` → `countDueMessages` → `wakeContainer` if work pending and no container → `detectStaleContainers` resets stuck `processing` rows with backoff
|
||
|
||
**Scenarios covered**:
|
||
- Host crashed while container idle with pending messages → orphan cleanup + first sweep re-wakes
|
||
- Host crashed mid-processing → stale detection resets `processing → pending`, next sweep wakes
|
||
- Container crashed with host alive → heartbeat mtime catches it inside 10 min `STALE_THRESHOLD_MS`
|
||
|
||
**Rationale**: the function got renamed (recovery → sweep) but the behavior is equivalent or better. Sweep is continuous; recovery used to be one-shot.
|
||
|
||
**Next step**: see item 3a.
|
||
|
||
---
|
||
|
||
### 22. Unknown-channel wiring approval flow
|
||
**Finding** (post-item-5 discussion): item 5's `request_approval` only fires when a messaging group already has agents wired. Three scenarios slip through to the earlier `no_agent_wired` structural-drop branch in `src/router.ts` and get silent-dropped with no signal to the owner:
|
||
|
||
1. A new user DMs the agent directly (the DM's messaging group auto-creates but has no wiring)
|
||
2. The agent is @mentioned in a group the admin hasn't registered
|
||
3. The agent is added to a new group and someone there addresses it
|
||
|
||
In all three, the user sees no response and the owner has no signal anything happened.
|
||
|
||
**Status**: decided — companion PR to item 5, scoped separately
|
||
|
||
**Decision**: when the router hits `no_agent_wired` for a non-public event, **instead of silent-dropping, pick the owner and DM them a wiring card**. Two flavors depending on who triggered it:
|
||
|
||
- **Sender IS an owner/admin** (the common "I just added the bot" case) → auto-wire IF exactly one agent group exists. Silent seamless flow. If multiple agent groups exist, fall through to the card so the owner picks.
|
||
- **Sender is anyone else** (stranger, or owner in a multi-agent install) → deliver a card:
|
||
- Title: `🔌 New channel — wire it?`
|
||
- Body: `<senderName> is trying to reach you in <channelName> on <platform>. Wire to which agent?`
|
||
- Options: one button per existing `agent_groups` row, plus `➕ Create new` and `Ignore`
|
||
|
||
**On approve (existing agent group)**:
|
||
1. `createMessagingGroupAgent(...)` with channel-kind defaults — DM→`pattern` + `'.'`, threaded group→`mention-sticky`, non-threaded group→`mention` (same defaults as `scripts/init-first-agent.ts`)
|
||
2. Replay the stored event via `routeInbound` (sender-approval pattern)
|
||
3. Delete pending row
|
||
|
||
**On approve "Create new"**: [OPEN SCOPE] — needs name/folder input. Options:
|
||
- Follow-up ask_question card asking for a name → auto-derive folder from slug → create group + wire
|
||
- Or: skill-backed flow — the button dispatches to `/init-agent` or similar and the card just links out
|
||
- Punt until implementation; mention in the PR brief that we'll decide when building
|
||
|
||
**On ignore**: delete pending row; future attempts re-prompt fresh (consistent with sender-approval deny; no denial persistence).
|
||
|
||
**Failure cases** (drop silently with log, don't leave a pending row):
|
||
- No owner configured (fresh install) — same behaviour as sender-approval
|
||
- No reachable DM for any owner/admin
|
||
- Delivery adapter missing
|
||
|
||
**New table**:
|
||
```
|
||
pending_channel_approvals (
|
||
id TEXT PRIMARY KEY,
|
||
messaging_group_id TEXT NOT NULL REFERENCES messaging_groups(id),
|
||
sender_identity TEXT, -- NULL when triggered by a non-identifiable event
|
||
sender_name TEXT,
|
||
original_message TEXT NOT NULL, -- JSON InboundEvent for replay
|
||
approver_user_id TEXT NOT NULL,
|
||
created_at TEXT NOT NULL,
|
||
UNIQUE(messaging_group_id) -- one pending wiring per channel
|
||
)
|
||
```
|
||
|
||
Dedup is narrower than sender-approval's `(mg_id, sender_id)` — one pending wiring per channel, period. A second stranger writing into the same unwired channel piggybacks on the existing card instead of spawning a new one. Latest event replaces the stored `original_message` (we only replay one anyway, and latest is most useful).
|
||
|
||
**Card action id prefix**: `nca-<approvalId>:<value>` where value is `agent-group-<id>` / `create` / `ignore`. Response handler lives in `src/modules/permissions/` alongside `handleSenderApprovalResponse`.
|
||
|
||
**Owner-sender auto-wire logic**:
|
||
```
|
||
if sender is owner/admin AND getAllAgentGroups().length === 1:
|
||
auto-wire to that group, replay event, done — no card
|
||
else:
|
||
deliver card
|
||
```
|
||
|
||
Don't auto-create a new agent group silently — always require a prompt for that.
|
||
|
||
**LOC estimate**: ~145
|
||
- Migration + CRUD: 45
|
||
- Router hook before `no_agent_wired` drop → try channel approval: 15
|
||
- Owner-sender auto-wire fast path: 20
|
||
- Card delivery (scope `pickApprover(null)`; build buttons from `getAllAgentGroups()`): 25
|
||
- Response handler: 25
|
||
- Tests: 15
|
||
|
||
**Open scopes (flag at PR time)**:
|
||
- "Create new" sub-flow — pick between follow-up card vs skill link
|
||
- Do we also react to bot-added-to-group platform events? Simpler to stay lazy (first-message-triggered only). Platform lifecycle events are inconsistent across Discord/Slack/Telegram anyway.
|
||
- Worth scanning the `channels` branch for any existing channel-lifecycle handlers that might conflict.
|
||
|
||
**Next step**: open a follow-up PR off this branch once #1869 lands.
|
||
|
||
---
|
||
|
||
### 3a. End-to-end recovery test
|
||
**Finding**: no test confirms the host-crash-restart scenario produces timely re-delivery.
|
||
|
||
**Status**: pending — nice-to-have
|
||
|
||
**Decision**: add an integration test: (1) write a pending message to inbound.db, (2) kill the host simulating crash, (3) start host, (4) assert container is woken and message processed within a bounded time (≤5s? ≤ one sweep interval).
|
||
|
||
**Rationale**: the sweep logic is correct as written, but a regression here would be silent (messages just sit). Worth a safety net.
|
||
|
||
**Next step**: draft test when touching `host-sweep.ts` or `index.ts` startup flow next.
|
||
|
||
---
|
||
|
||
## MEDIUM
|
||
|
||
### 4. `response_scope` enforcement
|
||
**Finding**: `messaging_group_agents.response_scope` stores `'all' | 'triggered' | 'allowlisted'` but nothing reads it.
|
||
|
||
**Status**: decided — folded into item 1
|
||
|
||
**Decision**: delete the `response_scope` column as part of the item-1 migration. Values backfill into the new explicit columns:
|
||
|
||
| Old `response_scope` | New columns |
|
||
|---|---|
|
||
| `all` | `engage_mode='pattern'`, `engage_pattern='.'`, `sender_scope='all'` |
|
||
| `triggered` | `engage_mode='mention'` (or `'pattern'` if legacy row has a pattern), `sender_scope='all'` |
|
||
| `allowlisted` | `engage_mode` derived from `trigger_rules`, `sender_scope='known'` |
|
||
|
||
**Rationale**: `response_scope` conflated two orthogonal axes (engage + sender). Splitting them is the whole point of item 1.
|
||
|
||
**Next step**: ensure the item-1 migration includes the `response_scope` backfill in its UP step.
|
||
|
||
### 5. `request_approval` flow for unknown senders
|
||
**Finding**: `unknown_sender_policy='request_approval'` is scaffolded in `src/modules/permissions/index.ts:100-108` but falls through to log-and-drop (explicit TODO comment). Current default is `'strict'`, which silently drops — user has no signal that their agent isn't responding.
|
||
|
||
**Status**: decided — implement, keep simple
|
||
|
||
**Decision**: implement full approval flow **and** flip the schema default from `'strict'` to `'request_approval'`. UX rationale: users wire their DM during setup; silent drops create a mystery when the agent doesn't respond. Public is unsafe. Approval default → admin sees a card and explicitly decides.
|
||
|
||
**Flow**:
|
||
1. Unknown sender writes to wired messaging group with policy `'request_approval'`
|
||
2. If pending approval for `(messaging_group, sender)` already exists → drop this message silently (in-flight dedup; not persistence)
|
||
3. Otherwise: insert into `pending_sender_approvals` with original message + timestamp
|
||
4. `pickApprover(agent_group_id)` + `pickApprovalDelivery(approverUserId)` — existing machinery in `src/access.ts`
|
||
5. Deliver a card via adapter's `deliver()` with `Card`/`Actions`/`Button` primitives (already in chat-sdk-bridge)
|
||
6. Card action id prefix `nsa:<approval_id>:<allow|deny>` (parallels existing `ncq:` prefix for `ask_user_question`)
|
||
7. On `allow`: upsert `users` row, insert into `agent_group_members`, deliver stored message through normal routing (original timestamp preserved), cleanup pending row
|
||
8. On `deny`: cleanup pending row, drop the message. No denial persistence — next attempt from same sender triggers a fresh card.
|
||
|
||
**No denial persistence** explicit rationale: personal-assistant scale, admin can switch policy to `'strict'` per messaging group if a hostile sender starts spamming. Avoids a new table column and a TTL config.
|
||
|
||
**New table**:
|
||
```
|
||
pending_sender_approvals (
|
||
id TEXT PRIMARY KEY,
|
||
messaging_group_id TEXT NOT NULL,
|
||
agent_group_id TEXT NOT NULL,
|
||
sender_identity TEXT NOT NULL, -- channel_type:handle
|
||
sender_name TEXT,
|
||
original_message TEXT NOT NULL, -- JSON of the InboundEvent
|
||
approver_user_id TEXT NOT NULL,
|
||
created_at TEXT NOT NULL,
|
||
UNIQUE(messaging_group_id, sender_identity) -- enforces in-flight dedup
|
||
)
|
||
```
|
||
Dedicated (not reusing `pending_approvals` which is OneCLI-specific).
|
||
|
||
**Reuse**:
|
||
- `pickApprover` / `pickApprovalDelivery` in `src/access.ts`
|
||
- Card rendering primitives already in `src/channels/chat-sdk-bridge.ts`
|
||
- `onAction` dispatch — add the `nsa:` prefix handler alongside existing `ncq:`
|
||
|
||
**LOC estimate**: ~175
|
||
- Migration + CRUD for `pending_sender_approvals`: 55
|
||
- `handleUnknownSender` request_approval branch + in-flight dedup: 25
|
||
- Host-side card dispatcher (pick approver + deliver card): 25
|
||
- `onAction` handler for `nsa:` prefix (allow/deny): 30
|
||
- Schema default flip + router auto-create update: 5
|
||
- Tests: 35
|
||
|
||
**Module location**: all in `src/modules/permissions/`. Module stays optional; default-allow fallback behavior when not loaded is preserved.
|
||
|
||
**Open gotchas noted**:
|
||
- The router's auto-create at `router.ts:123` currently hardcodes `'strict'` — change to omit the field so schema default applies
|
||
- `pickApprover` may return null if no admin/owner exists (e.g. fresh install before first user registered). In that case: log + drop silently, treat as effectively `'strict'` for safety. Don't block message forever.
|
||
|
||
**Scope boundary** (important): this item covers **unknown sender in a wired channel**. The parallel case — **unknown channel** (new DM / unwired group / bot-added-to-group) — short-circuits at the `no_agent_wired` structural drop before this flow ever runs. Tracked as item 22.
|
||
|
||
**Next step**: implement alongside item 1 or as a follow-up. Same migration window is fine (one migration for engage columns + request_approval default change + new table).
|
||
|
||
### 6. Per-group container timeout
|
||
**Finding**: v1's `containerConfig.timeout` override is gone. All groups share `IDLE_TIMEOUT`. Original framing (slow-but-healthy agents getting killed) was wrong — v1's `timeout` was a hard wall-clock kill on the whole spawn, totally different from v2's `IDLE_TIMEOUT` (keep-alive after last activity). V2's behavior is strictly better for long-running agents.
|
||
|
||
**Status**: dropped — not a regression
|
||
|
||
**Decision**: don't restore per-group timeout override. `IDLE_TIMEOUT=30min` global is the right model. If per-group idle tuning ever becomes useful it's ~15 LOC (new column, env injection at spawn) — small feature add, not a regression to repair.
|
||
|
||
**Rationale**: v2 lets long-running agents finish; v1 would have hard-killed them at 30min. Current behavior is an improvement.
|
||
|
||
**Next step**: see 6a.
|
||
|
||
---
|
||
|
||
### 6a. Remove IDLE_END_MS (container-side query idle termination)
|
||
**Finding**: `container/agent-runner/src/poll-loop.ts:11` defines `IDLE_END_MS = 20_000`. Inside `processQuery`, a concurrent interval ends the active Agent SDK `query()` stream after 20s of SDK silence. Any SDK event (tool use, tool result, streamed text, new pushed message) resets the timer.
|
||
|
||
This is a general "SDK silence detector," not specifically post-result. It fires any time:
|
||
- Mid-work: slow tool call with no intermediate SDK events (`npm install`, `pytest`, long `WebFetch`, etc.)
|
||
- Post-result: agent finished, stream waiting for potential follow-up
|
||
- Any other pause in the SDK stream
|
||
|
||
**Status**: decided — remove, pending SDK verification
|
||
|
||
**Decision**: delete `IDLE_END_MS` and its setInterval check. Let the `query()` stream stay open as long as the container is active. Container's 30-min `IDLE_TIMEOUT` (host-side in `container-runner.ts`) is the single source of truth for "when to let go."
|
||
|
||
**Rationale**:
|
||
- When new messages arrive mid-stream, they push in via `query.push()` with no reconnect — stream-open is cheaper per-message than close-and-reopen
|
||
- Closing early forces a reconnect + cold prompt cache for the next message
|
||
- Container stays alive anyway; ending the stream doesn't save resources at the container level
|
||
- `CLAUDE_CODE_AUTO_COMPACT_WINDOW=165000` already handles context window growth within a long-lived query
|
||
- Anthropic API's own stream timeout will fire if needed; SDK should handle it transparently
|
||
- Avoids the false-positive kill during legitimate slow tool calls (common case: agent running `npm install` gets cut off at 20s)
|
||
|
||
**Caveat (must verify before removal)**: confirm Claude Agent SDK doesn't require explicit `query.end()` for prompt-cache commit or session-state persistence. Expected to be fine (SDK checkpoints per turn) but double-check docs / run a quick test where container idles with stream open, then processes a follow-up.
|
||
|
||
**LOC estimate**: ~−15 (net deletion — remove constant, setInterval idle check, the `done` flag plumbing may also simplify)
|
||
|
||
**Next step**: when implementing item 1's changes (or standalone), verify SDK behavior with stream-open-indefinite, then delete IDLE_END_MS block. Watch for any test assertions on it.
|
||
|
||
### 7. Container streaming output (marker-based pre-delivery)
|
||
**Finding**: v1's `---NANOCLAW_OUTPUT_START/END---` markers enabled pre-completion delivery. v2's two paths (final-result `dispatchResultText` + mid-turn `send_message` MCP tool) both write to outbound.db; host polls every `ACTIVE_POLL_MS = 1000ms`.
|
||
|
||
**Status**: dropped — not a regression
|
||
|
||
**Decision**: v2's `send_message` MCP tool is the correct replacement for v1's marker-based streaming. Latency is ≤1s (poll interval), which is fine for chat UX.
|
||
|
||
**Rationale**: v1's marker model required the agent and host to share a fragile state machine over stdout. v2 uses explicit tool calls and a DB surface — cleaner architecture, comparable latency, and control stays with the agent. If perceived latency ever becomes a real complaint, tune `ACTIVE_POLL_MS` down (500ms / 250ms) — low-cost knob.
|
||
|
||
**Next step**: none.
|
||
|
||
### 8. Per-exit container log files
|
||
**Finding**: v1 wrote timestamped per-exit logs with full I/O + mounts + stderr. v2: stderr → `log.debug` (invisible at default `LOG_LEVEL=info`), container close → `log.info` with exit code, session DBs preserved on disk. Real gap: stderr on abnormal exit isn't auto-surfaced.
|
||
|
||
**Status**: dropped
|
||
|
||
**Decision**: skip — no per-exit file restoration, no stderr-on-crash buffer.
|
||
|
||
**Rationale**: underlying forensic info is still recoverable (session DBs on disk, heartbeat mtime, exit code in log). `LOG_LEVEL=debug` surfaces stderr when needed. The cost of adding buffered crash-log promotion (~15 LOC) isn't justified by the frequency of post-mortem cases.
|
||
|
||
**Next step**: none.
|
||
|
||
### 9. Stuck detection + heartbeat-based container lifecycle
|
||
**Finding**: v2's sweep detects stale heartbeats (10 min) and resets messages with backoff, but doesn't kill the container. Idle timeout is delivery-count-based (30 min since last messages_out). Together these produce a gap where a stuck container holds resources + blocks new wakes for up to 30 min.
|
||
|
||
**Empirical findings from SDK probe** (`container/agent-runner/scripts/sdk-signal-probe.ts`, runs logged in `/tmp/probe-*.jsonl`):
|
||
- Silent Bash tools (e.g. `sleep 30`) produce 30+ seconds of zero SDK events — heartbeat goes stale during legitimate work
|
||
- Natural intra-stream silences up to ~12s observed mid-tool-use JSON streaming
|
||
- `PreToolUse` / `PostToolUse` hook pair is reliable; `PostToolUseFailure` fires on blocked requests
|
||
- `SubagentStart`/`SubagentStop` and `system/task_started`/`system/task_notification` pairs also reliable
|
||
- **Pushing a new message mid-active-turn does NOT fire `UserPromptSubmit`** (fires only at start of a new turn, after `result`)
|
||
- SDK's built-in `AskUserQuestion` doesn't actually block; returns placeholder
|
||
- Bash tool's declared `timeout` param is visible in `tool_use` input — we can read it container-side
|
||
- Stuck tools (hook that never resolves) produce indefinite silence — no SDK-side timeout
|
||
|
||
**Status**: decided
|
||
|
||
**Decision**: replace existing IDLE_TIMEOUT setTimeout + STALE_THRESHOLD=10min combo with message-scoped stuck detection + absolute 30-min ceiling. Reset messages inline when we kill. Blocklist SDK tools that don't fit our async model.
|
||
|
||
**Sweep logic** (per active session):
|
||
|
||
If container isn't running → reset any `'processing'` rows in processing_ack to `'pending'` + tries++ + backoff. Done.
|
||
|
||
If container IS running, apply in order:
|
||
|
||
1. **Absolute ceiling**: if `heartbeat_mtime` older than `max(30 min, current_bash_timeout)` → kill + reset any processing to pending + retry++.
|
||
Rationale: 30 min idle ceiling, extended only if agent is currently inside a Bash tool with longer declared timeout. Agents needing >30 min should use `run_in_background`.
|
||
|
||
2. **Message-scoped stuck**: for each `processing_ack` row with status=`'processing'`:
|
||
- `claim_age = now - status_changed`
|
||
- `tolerance = max(60s, current_bash_timeout)` if Bash in flight, else `60s`
|
||
- If `claim_age > tolerance` AND `heartbeat_mtime <= status_changed` → kill + reset this message + retry++
|
||
|
||
Semantics: "container claimed a message and went silent for >tolerance since claim."
|
||
|
||
No separate idle rule — rule 1 covers it. An idle container hits 30-min stale with no processing rows; kill has nothing to reset.
|
||
|
||
**Container state surface** (for Bash timeout tracking):
|
||
New table in outbound.db (or session_state row):
|
||
```
|
||
container_state (
|
||
session_id TEXT PRIMARY KEY,
|
||
current_tool TEXT, -- null when no tool in flight
|
||
tool_declared_timeout_ms INTEGER,
|
||
tool_started_at TEXT
|
||
)
|
||
```
|
||
Container writes on `PreToolUse` (reads Bash `timeout` from tool input), clears on `PostToolUse` / `PostToolUseFailure`. Host reads in sweep decision.
|
||
|
||
**Tool blocklist** (initial):
|
||
- `AskUserQuestion` — SDK built-in; we have our own DB-backed MCP version
|
||
- `EnterPlanMode` / `ExitPlanMode` — Claude Code UI only
|
||
- `EnterWorktree` / `ExitWorktree` — Claude Code UI only
|
||
|
||
Enforcement:
|
||
- Pass `disallowedTools: [...]` to `query()` options — agent never sees them in its tool list
|
||
- `PreToolUse` hook guard (defense-in-depth): if a blocklisted tool name somehow fires, immediately reset the current message + kill (treat as stuck)
|
||
|
||
**Kill old machinery**:
|
||
- Remove `setTimeout` + `resetIdle` plumbing in `container-runner.ts:128-140`
|
||
- Remove `resetContainerIdleTimer` export + its caller in `delivery.ts:26`
|
||
- Remove `IDLE_END_MS = 20_000` in `poll-loop.ts:11` (item 6a decision) — stream stays open as long as container alive
|
||
- Existing `detectStaleContainers` logic merges into the new sweep rules; the heartbeat-stale-10-min path disappears
|
||
|
||
**LOC estimate**: ~115
|
||
- New sweep decision logic replacing existing detectStaleContainers + IDLE_TIMEOUT path: 50
|
||
- Container state table + PreToolUse/PostToolUse write, host read: 25
|
||
- Tool blocklist (disallowedTools + hook guard): 15
|
||
- Deletions (IDLE_TIMEOUT setTimeout, IDLE_END_MS): −25
|
||
- Tests (kill paths, Bash-timeout grace, blocklist hit): 50
|
||
|
||
**Why this converged here** (rationale summary):
|
||
- Empirical data showed we can't reliably tell stuck from legitimate-silent-work without state. Bash-declared-timeout is the cleanest per-tool signal available.
|
||
- 60s-since-claim is tight enough for normal work (WebSearch/WebFetch finish in ~8s) but generous enough for reasonable delays. Exception for Bash covers agents running scripts with user-declared timeouts.
|
||
- 30-min absolute ceiling prevents infinitely-stuck containers; agents needing longer have `run_in_background`.
|
||
- Pushing messages can't serve as a liveness probe (they're silent mid-turn), so detection is state-driven, not push-driven.
|
||
- Blocklist prevents a whole class of "SDK tool designed for interactive UI" footguns that would appear stuck in our async model.
|
||
|
||
**Next step**: implement as a focused PR. Order: (a) tool blocklist — safe to ship alone, (b) container state table + PreToolUse writes, (c) sweep rewrite + message reset, (d) delete old IDLE_TIMEOUT + IDLE_END_MS machinery, (e) tests.
|
||
|
||
### 10. Host-level retry with backoff on agent error
|
||
**Finding**: v1 had MAX_RETRIES=5 + exp. backoff on `processGroupMessages` failure. v2's equivalent is now covered by item 9's sweep logic — any time the container isn't running with `'processing'` rows present, they get reset to pending with backoff + retry++.
|
||
|
||
**Status**: folded into item 9
|
||
|
||
**Decision**: no separate action. Agent-error retry happens via container-exit → sweep reset. Container errors also surface via provider-side session invalidation check (`poll-loop.ts:200-211` — `provider.isSessionInvalid(err)` → clears stored session id → fresh retry). Both paths preserved.
|
||
|
||
**Next step**: none.
|
||
|
||
---
|
||
|
||
### 11. Process ID in logger output
|
||
**Finding**: v1 emitted `(${process.pid})` after the level tag. v2 dropped it.
|
||
|
||
**Status**: dropped
|
||
|
||
**Decision**: don't restore. Host is single-process (PID is constant). Container stderr already gets tagged with `{ container: agentGroup.folder }` at `container-runner.ts:121`, which is more informative than a PID.
|
||
|
||
**Next step**: none.
|
||
|
||
---
|
||
|
||
## LOW
|
||
|
||
### 11. Process ID in logger output
|
||
**Finding**: v1 emitted `(${process.pid})` after the level tag. v2 dropped it.
|
||
**Status**: pending
|
||
**Decision**:
|
||
**Rationale**:
|
||
**Next step**:
|
||
|
||
### 12. Task dedup via unique `(kind, series_id)` index
|
||
**Finding**: verified — `messages_in.series_id` column exists with a non-unique index. Concern was theoretical: two pending rows with same series could coexist.
|
||
|
||
**Status**: dropped
|
||
|
||
**Decision**: not a real issue. Recurrence logic at `src/modules/scheduling/recurrence.ts` is structurally dedup-safe: only `completed` rows with `recurrence` get cloned, and after cloning `recurrence` is cleared on the original so it can't re-clone. Plus container's atomic `markProcessing` prevents double-execution at claim time.
|
||
|
||
**Next step**: none.
|
||
|
||
### 13. Silent-drop mode for noisy senders
|
||
**Finding**: v1's `mode:'drop'` let you ignore specific users without logging. v2 only has binary allow/deny via access gate.
|
||
|
||
**Status**: dropped — won't implement
|
||
|
||
**Decision**: not worth the table + gate complexity for a personal-assistant scale. If a specific sender becomes a problem, admin can switch the messaging_group's `unknown_sender_policy` to `'strict'` or remove the sender from `agent_group_members`.
|
||
|
||
**Next step**: none.
|
||
|
||
### 14. Remote control subsystem
|
||
**Finding**: v1's `/remote-control` command spawned `claude remote-control` CLI detached, polled stdout for session URL, persisted PID/URL state. Entirely gone in v2.
|
||
|
||
**Status**: deferred — opt-in skill when needed
|
||
|
||
**Decision**: reintroduce as an opt-in install skill (e.g. `/add-remote-control`), not on trunk. Provider-specific: only works with `claude` provider (Claude Agent SDK); not supported by OpenCode or other providers. Skill should check `agent_group.provider` at install time and bail gracefully with an error message if not `'claude'`.
|
||
|
||
**Rationale**: niche feature valuable only for direct agent SDK attachment during dev/debugging. Keeping it off trunk matches v2's "infra-only trunk, features-via-skills" philosophy. Also avoids carrying code for a feature that simply doesn't exist in non-Claude providers.
|
||
|
||
**Next step**: none until someone needs it. When implementing, likely lives on the `providers` branch (since it's provider-specific) or its own branch, installed via skill that copies files + checks provider.
|
||
|
||
### 15. Dead config constants
|
||
**Finding**: verified — `POLL_INTERVAL` (line 13), `SCHEDULER_POLL_INTERVAL` (line 14), and `IPC_POLL_INTERVAL` (line 32) in `src/config.ts` have zero imports elsewhere in v2. Container's `POLL_INTERVAL_MS` in `poll-loop.ts` is a distinct local constant, unrelated.
|
||
|
||
**Status**: decided — delete
|
||
|
||
**Decision**: remove the three constants from `src/config.ts`. Trivial 3-line deletion.
|
||
|
||
**Next step**: do as part of any sweep-touching PR, or standalone.
|
||
|
||
### 16. Configurable retention thresholds
|
||
**Finding**: `STALE_THRESHOLD_MS` (10 min) and `MAX_TRIES` (5) in `host-sweep.ts` are hardcoded. Item 9's redesign replaces `STALE_THRESHOLD_MS` with new constants (60s claim-age, 30 min ceiling).
|
||
|
||
**Status**: dropped — keep as constants
|
||
|
||
**Decision**: leave the new item-9 thresholds + `MAX_TRIES` as source constants. Adding config surface for them isn't worth it at personal-assistant scale. If operational tuning ever becomes a real need, revisit — they're small centralized constants, one-line change each.
|
||
|
||
**Next step**: none.
|
||
|
||
### 17. Dynamic group-add (IPC watcher equivalent)
|
||
**Finding**: not actually a restart requirement — investigation showed:
|
||
- Router reads `messaging_groups` + `messaging_group_agents` fresh per inbound (dynamic by design)
|
||
- Chat SDK bridge has a `conversations: Map<platformId, ConversationConfig>` populated at setup + `updateConversations()` method
|
||
- **Nothing in the bridge currently reads the map**, and no code calls `updateConversations()` after startup
|
||
- Today: stale map has no observable effect (dead state)
|
||
- After item 1 ships (adapter-level gating): stale map would matter; new wirings wouldn't apply in the adapter gate until restart
|
||
|
||
**Status**: deferred — comment added now, implement alongside dynamic group registration feature
|
||
|
||
**Decision**: don't refactor the adapter interface now. Added a NOTE comment at `src/channels/chat-sdk-bridge.ts:73` flagging the staleness issue so the next person touching the bridge or adding dynamic-registration sees it. When dynamic group registration is implemented (admin adds a new messaging_group_agents row while host is running), handle cache refresh then — most likely by calling `adapter.updateConversations(freshConfigs)` after the mutation, keyed off the adapter's `channelType`.
|
||
|
||
**Rationale**: item 1's initial landing can keep the adapter gating responsibilities small or skip adapter-side gating entirely. Refactoring ConversationConfig now would add scope; better to ship item 1 first, see if over-subscription bites, address if it does.
|
||
|
||
**Next step**: when building the admin-skill path for adding messaging_group ↔ agent_group wirings, include a `refreshAdapterConversations(channelType)` call after the INSERT. ~10 LOC when needed.
|
||
|
||
---
|
||
|
||
## Test regressions (v1 `formatting.test.ts` assertions)
|
||
|
||
### 18+19+20+21. Timezone + formatting recreation (merged)
|
||
**Finding**: v1 had a full timezone-aware formatting pipeline. v2 lost most of it, producing real bugs where the agent misinterprets user intent (scheduling for wrong times, suggesting time-inappropriate things).
|
||
|
||
**Scope** — recreate v1 behavior faithfully wherever times touch the agent:
|
||
- Timestamp formatting on inbound messages: `formatLocalTime(utcIso, TIMEZONE)` producing "Jan 1, 2024, 1:30 PM" format via `Intl.DateTimeFormat('en-US', {...})` (v1 `timezone.ts`)
|
||
- `<context timezone="<IANA_NAME>" />` header prepended to message block (v1 `router.ts:20-22`)
|
||
- Reply-to with message ID: `<message ... reply_to="<id>">...<quoted_message from="...">...</quoted_message></message>` (v1 `router.ts:10-18`)
|
||
- `stripInternalTags()`: regex `/<internal>[\s\S]*?<\/internal>/g` applied to outbound text, then `.trim()` (v1 `router.ts:25-27`)
|
||
- Cron expressions parsed with explicit user TZ: `CronExpressionParser.parse(expr, { tz: TIMEZONE })` (v1 `task-scheduler.ts:20-49`)
|
||
- User-specified times normalized via the user's TZ: in v1 this was the host-side task scheduler; in v2 it's the new-in-v2 scheduling MCP tool (`mcp-tools/scheduling.ts`). Same principle — accept user-local times, normalize to UTC for storage, interpret cron in user's TZ.
|
||
|
||
**Status**: decided — recreate with tests
|
||
|
||
**Decision**: port v1's formatter + timezone behavior faithfully. Full recreation spec at [`timezone-formatting-v1-recreation.md`](timezone-formatting-v1-recreation.md) — includes exact v1 code, line numbers at commit `27c5220`, complete test inventory from `src/v1/formatting.test.ts` and `src/v1/task-scheduler.test.ts`.
|
||
|
||
**Core principle** (per Gavriel): the agent operates in the user's timezone. Every timestamp the agent sees is user-local. Every time the agent outputs is interpreted as user-local. This is load-bearing for correctness, not a nice-to-have.
|
||
|
||
**Porting plan** (from recreation spec):
|
||
1. `container/agent-runner/src/formatter.ts` — replace `formatTime` with `formatLocalTime(ts, TIMEZONE)` call; add reply_to attribute + `<quoted_message>` element exactly as v1
|
||
2. Prepend `<context timezone="<IANA>" />\n` to the messages block at formatter entry
|
||
3. Extract `stripInternalTags` as a named function; apply in outbound dispatch path (`poll-loop.ts:389` currently uses inline regex)
|
||
4. `container/agent-runner/src/mcp-tools/scheduling.ts` — clarify `processAfter` description, normalize to UTC ISO in handler
|
||
5. `src/modules/scheduling/recurrence.ts` — pass `{ tz: TIMEZONE }` to `CronExpressionParser.parse()` explicitly
|
||
6. Port all test cases from v1's `formatting.test.ts` and `task-scheduler.test.ts` to v2's test tree
|
||
|
||
**LOC estimate**: ~75 prod + ~120 tests (reproducing v1's 40+ test cases)
|
||
|
||
**Next step**: implement as a focused PR. Order: (a) formatter changes + tests, (b) context header + tests, (c) reply_to + tests, (d) stripInternalTags extraction + tests, (e) scheduling tool + cron TZ + tests.
|
||
|
||
### 19, 20, 21 — merged into 18 above
|
||
See item 18 for the full recreation plan and spec reference.
|
||
|
||
---
|
||
|
||
## Notes
|
||
- `src/v1/` was deleted upstream (commit 86becf8) after this analysis was written. v2 tree has since had a major module extraction (approvals, interactive, scheduling, permissions, agent-to-agent, self-mod) and a new CLI channel. **Verify each item against the current tree before deciding** — some may already be addressed.
|