Files
nanoclaw/docs/v1-vs-v2/task-scheduler.md
gavrielc 47950671fa docs: add v1→v2 action-items analysis + SDK signal probe tool
- docs/v1-vs-v2/: full v1→v2 regression analysis (SUMMARY + 21 per-module
  docs + ACTION-ITEMS rollup with decisions + timezone recreation spec).
- container/agent-runner/scripts/sdk-signal-probe.ts: empirical harness
  used to characterise Claude Agent SDK event/hook/stderr timing for the
  stuck-detection design in item 9.
- src/channels/chat-sdk-bridge.ts: document the conversations Map staleness
  in a code comment; fix deferred to when dynamic group registration lands
  (ACTION-ITEMS item 17).

No runtime behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:00:04 +03:00

101 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# task-scheduler: v1 vs v2
## Scope
**v1 task scheduler:**
- Files: `src/v1/task-scheduler.ts` (241 lines), `src/v1/task-scheduler.test.ts` (122 lines)
- Self-contained scheduler loop with DB persistence and container execution
- Stores tasks in central DB table `scheduled_tasks`
- Runs a polling loop at `SCHEDULER_POLL_INTERVAL` (configurable, typically 560s)
**v2 task distribution:**
- No central task-scheduler file; tasks spread across host sweep and session DBs
- Core files: `src/host-sweep.ts` (174 lines), `src/delivery.ts` (task handlers ~line 654713), `src/db/session-db.ts` (task mutation logic)
- Optional: `container/agent-runner/src/task-script.ts` (pre-task script execution)
- Task rows live in per-session `inbound.db` table `messages_in` (polymorphic message kind)
- Recurrence computed in `host-sweep.ts` (host-sweep.ts:159173)
---
## Capability map
| v1 Behavior | v2 Location | Status | Notes |
|---|---|---|---|
| **One-shot tasks** (schedule_type='once') | `insertTask()` in `src/db/session-db.ts:103122`; processAfter field set, recurrence=NULL | ✅ Supported | Task inserted into messages_in with process_after timestamp, processed once, no recurrence |
| **Recurring via cron** (schedule_type='cron') | `insertTask()` with recurrence field; `host-sweep.ts:159173` parses cron | ✅ Supported | Cron expression stored in messages_in.recurrence, next occurrence computed on completion via CronExpressionParser |
| **Recurring via fixed interval** (schedule_type='interval') | Not directly supported; v2 uses cron for all recurring | ⚠️ Removed | v2 requires cron syntax for recurrence. No interval-based scheduling (e.g., "every 5 minutes") without converting to cron |
| **Timezone handling** | `host-sweep.ts:159161` uses CronExpressionParser with no explicit TZ param; cron-parser respects system TZ | ⚠️ Degraded | v1's explicit TIMEZONE config (via timezone.ts helpers) is absent in v2. Cron evaluation uses system/Node.js default TZ, not agent/session-level configuration |
| **Persistence** | Per-session `inbound.db` `messages_in` table + `series_id` grouping | ✅ Supported | Tasks persisted as DB rows with status (pending/completed/paused). Series_id backfilled for recurring task groups |
| **Restart recovery** | `host-sweep.ts:8596` syncs processing_ack on startup to detect stale containers; tasks marked paused if container crashes | ✅ Supported | Stale container detection via heartbeat file mtime (host-sweep.ts:122131); stuck messages retried with exponential backoff |
| **Due-message wake** | `host-sweep.ts:9196` queries countDueMessages, wakes container if due tasks exist | ✅ Supported | 60s sweep checks for pending tasks with process_after in the past and wakes container if found |
| **Missed-run catch-up** (interval-based) | `computeNextRun()` skips past missed intervals to prevent cumulative drift; tests verify no infinite loop | ⚠️ Degraded | v2 doesn't handle missed intervals — if a recurring cron task gets skipped, next occurrence is computed from completion time only. No "make up" for missed runs |
| **Cancellation** | `updateTask(id, {status: 'paused'})` prevents retry churn | ✅ Supported | `cancelTask()` in `src/db/session-db.ts:128132` sets status='completed' and clears recurrence; matches by id OR series_id |
| **Pause/resume** | `updateTask(id, {status: 'paused'})` / resume | ✅ Supported | `pauseTask()` (line 134138) and `resumeTask()` (line 140144); both match id or series_id |
| **Retry-on-failure** | `updateTaskAfterRun()` on error; no explicit retry logic in scheduler loop | ⚠️ Degraded | v2 uses `retryWithBackoff()` only when container goes stale (host-sweep.ts:147). No automatic retry for task execution errors |
| **Concurrent-run prevention** | Task status 'active' gate (task-scheduler.ts:221); no concurrent-run logic | ⚠️ Degraded | v2 allows multiple pending tasks to wake the container in the same sweep; container processes serially but no host-level concurrency control |
| **Idempotency** | Task ID is primary key; `insertTask()` will fail if re-run with same ID | ✅ Supported | messages_in.id is PRIMARY KEY; insertTask() fails on duplicate (caller must handle or use ON CONFLICT) |
| **Max-age drop** | No explicit max-age field; tasks can remain pending indefinitely | ⚠️ Missing | No max-age or TTL in v2 messages_in schema. A stuck task can remain pending forever unless manually cancelled |
| **Task context mode** (group vs isolated session) | v1: context_mode field drives session reuse (task-scheduler.ts:122) | ⚠️ Removed | v2 doesn't track context_mode; all tasks are processed in the container's default session context; no isolation toggle |
| **Task result logging** | `logTaskRun()` writes to task_runs table; stores error + result summary | ⚠️ Degraded | v2 has no equivalent task_runs table. Task output is written as system messages back to the agent; no persistent audit trail |
| **Task script execution** | v1: prompt + optional script field, passed to container | ✅ Supported | v2: `applyPreTaskScripts()` in `container/agent-runner/src/task-script.ts:79121` runs scripts pre-prompt, enriches prompt with scriptOutput |
---
## Missing from v2
1. **Interval-based recurrence** — v1 `schedule_type='interval'` (e.g., "every 5000ms") is gone. v2 only supports cron expressions. Workaround: convert to equivalent cron (e.g., `*/5 * * * * *` for every 5 min).
2. **Timezone awareness** — v1 passed `TIMEZONE` config to cron parser and had explicit `formatLocalTime()` helpers. v2 has no way to specify a session/agent timezone for cron evaluation; it uses the system/Node.js TZ.
3. **Task context modes** — v1's `context_mode: 'group' | 'isolated'` is removed. No way to force a task into a dedicated session vs. the agent group's shared session.
4. **Task result audit trail** — v1 logged every run to `task_runs(task_id, run_at, duration_ms, status, result, error)`. v2 has no persistent task execution history; output is a system message only.
5. **Max-age / task TTL** — v1 tasks could be implicitly aged out (not directly visible in the code, but conceivable via cleanup logic). v2 has no TTL; a paused/completed task lingers in messages_in forever.
6. **Task-level concurrency control** — v1 prevented concurrent runs of the same task (single status check per loop iteration). v2 can queue multiple pending tasks in one sweep, though the container processes them serially.
---
## Behavioral discrepancies
1. **Missed-interval catch-up** (v1 `computeNextRun()` lines 3246 vs. v2 absence):
- **v1:** If a task is due at 10:00, 10:05, 10:10 but the scheduler is down during 10:0010:15, it computes `next_run = 10:20` (skips missed intervals, stays on the grid).
- **v2:** If the same recurring cron task is skipped, the next occurrence is computed from the *completion* time (host-sweep.ts:160161), not from the original grid. A task that should run at :00 and :05 every 10 minutes might drift if completions are delayed.
2. **Stale-container recovery** (v1 none vs. v2 heartbeat-based):
- **v1:** Tasks remain due if the container crashes; the scheduler will retry on the next poll.
- **v2:** If the heartbeat goes stale (container unresponsive for 10 min), stuck processing messages are retried with exponential backoff. Tasks stuck in 'processing' state are reset.
3. **Task script pre-processing** (v1 prompt + script → container vs. v2 script → output enrichment):
- **v1:** Passes script alongside prompt to container; container execution model unclear from scheduler.ts (likely runs in group-queue).
- **v2:** Host runs script *before* waking container; script output (`scriptOutput`) is merged into prompt JSON via `applyPreTaskScripts()` (task-script.ts:115117). If script fails or returns `wakeAgent=false`, the task is skipped entirely.
4. **Retry semantics**:
- **v1:** On execution error (runTask throws), `updateTaskAfterRun()` is called with `error`. Next retry relies on scheduler polling the same task again (no backoff).
- **v2:** Execution errors are not retried; container processes the task once. If the container crashes mid-task, the message is retried with exponential backoff only up to `MAX_TRIES=5` (host-sweep.ts:145150).
---
## Worth preserving?
**Interval-based recurrence** (v1 `schedule_type='interval'`) is a practical feature that v2 trades away. Cron syntax is powerful but less intuitive for simple "every X milliseconds" patterns. If users want "run every 30 seconds," they must learn cron (`*/30 * * * * *` for seconds doesn't exist in standard cron; workaround is job-level looping in the prompt). Consider a thin adapter layer in agent-facing APIs to accept `{interval: 5000}` and convert to cron, or extend the v2 schema to support an optional `interval_ms` alongside `recurrence`.
**Task context modes** (`group` vs. `isolated`) were a way to isolate task execution context. v2's removal simplifies the model but loses the ability to run a task in a fresh container state. If a task needs a clean slate (no session history), that's now impossible; workaround is a manual system-action to clear session state before running the task.
**Task result audit trail** is a gap for operational visibility. v2's system messages are ephemeral; there's no way to query "how many times did task X run and what were the outcomes?" Adding a lightweight `task_execution_log` table (optional, populated on task completion) would help without burdening the common case.
---
## References by line
- v1 task-scheduler: `src/v1/task-scheduler.ts:2049` (computeNextRun), `:203235` (startSchedulerLoop)
- v1 test coverage: `src/v1/task-scheduler.test.ts:49121` (drift, missed-interval, once-task tests)
- v1 timezone: `src/v1/timezone.ts:2637` (formatLocalTime with explicit TZ)
- v1 types: `src/v1/types.ts:6074` (ScheduledTask interface with context_mode)
- v2 sweep: `src/host-sweep.ts:154173` (handleRecurrence, insertRecurrence)
- v2 delivery system actions: `src/delivery.ts:645713` (handleSystemAction switch on schedule_task/cancel_task/pause_task/resume_task/update_task)
- v2 session-db: `src/db/session-db.ts:103198` (insertTask, cancelTask, pauseTask, resumeTask, updateTask, all with series_id matching)
- v2 task-script: `container/agent-runner/src/task-script.ts:79121` (applyPreTaskScripts, wakeAgent logic)
- v2 DB schema: `docs/db-session.md:3156` (messages_in table with process_after, recurrence, series_id)