Adaptation Engine

The Adaptation Engine is Clawpy's autonomous self-improvement system. It captures runtime outcomes (failures, budget incidents, human approvals), synthesises them into improvement candidates, evaluates each candidate against held-out evidence, and — upon approval — promotes changes into the live system.

This is how Clawpy learns from its own mistakes without human intervention.


The Adaptation Pipeline

Runtime Events (15+ types)
       │
       ▼
┌────────────────────────┐
│  Reflection Service     │  ← Captures high-signal events
│  (reflection_service.py)│     into structured learning records
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Learning Digest        │  ← Synthesises learning records
│  (learning_digest.py)   │     into candidate proposals
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Candidate Evaluation   │  ← Scores candidates against
│                         │     held-out evidence
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Promotion              │  ← Approved candidates become
│                         │     active system modifications
└────────────────────────┘

Event Types

The Reflection Service captures 15+ distinct event types that signal learning opportunities:

Event TypeSourceSignal
failed_tddValidation LoopA test-driven repair cycle failed
successful_healValidation LoopA validation failure was self-healed
repeated_retryValidation LoopSame error type recurred across multiple runs
task_conflictTask BoardTwo agents attempted conflicting work
budget_incidentBudget ServiceAn agent exhausted its budget
human_approvalDashboardA human approved a pending action
agent_errorTool ExecutorA tool call failed unexpectedly
successful_bubbleBubble FilterA worker learning was promoted to a leader
guidance_appliedWisdom CascadeLeadership guidance was injected into a prompt
guidance_noisyWisdom CascadeInjected guidance was counterproductive
stale_guidanceWisdom CascadeLeadership guidance was outdated
teaching_refreshWisdom TeacherA teaching cycle completed successfully
cloud_sync_failureSupabase SyncCloud memory sync failed
cloud_memory_recall_hitContext EngineCloud memory was successfully recalled
tool_sequenceTool ExecutorA repeated multi-step tool pattern was detected

Candidate Types

Learning records are synthesised into five candidate types:

1. Prompt Fragments

Injected into the system prompt for specific run kinds. Used to evolve how agents reason about tasks.

Sources: Research summaries, blueprint drafts, auto-reply evaluations, memory fact extraction, guidance events.

2. Fix Templates

Reusable repair playbooks for recurring failure patterns. Capture the successful repair strategy so it can be applied automatically next time.

Sources: Failed TDD runs, successful heals.

3. Routing Hints

Adjustments to the semantic router's behaviour. For example, routing ambiguous tasks to cheaper models after detecting budget pressure.

Sources: Budget incidents, task conflicts, successful bubbles.

4. Validator Policy Tweaks

Changes to validation parameters — tighter retry caps, adjusted timeout values, stricter guardrails.

Sources: Repeated retries, stale guidance, cloud sync failures, noisy guidance.

5. Flow Offloads

Proposals to convert repeated tool-call sequences into deterministic flows (see Flow Sequence Detector).

Sources: Detected tool sequence repetitions.


Candidate Lifecycle

detect → draft → score → corroborate → review → promote/reject

1. Detect

The Learning Digest scans recent learning records and identifies patterns:

def digest_reflection_opportunities(opportunities, *, limit=5, min_score=60.0):
    # Filter: only records scoring above the threshold
    # Deduplicate by (candidate_type, scope, source_label)
    # Return top N candidate drafts

2. Draft

Each candidate is assigned a dedupe key to prevent duplicate proposals:

dedupe_key = "{candidate_type}:{scope_type}:{scope_id}:{source_label}"
# e.g., "routing_hint:agent:cto:budget_incident"

3. Score

Initial scoring uses the learning record's own score (0–100).

4. Corroborate

The candidate is evaluated against held-out evidence — learning records that weren't used to generate the candidate:

# Match: same source_label, same scope, different record IDs
avg_score = mean(matching_record_scores)
corroboration_bonus = min(15.0, num_matches * 5.0)
candidate_score = min(100, avg_score * 0.85 + corroboration_bonus)

A candidate with 3 corroborating records gets a +15 bonus. A candidate with zero corroboration is capped at 45 points and automatically fails.

5. Promote or Reject

Candidates scoring ≥ 60 (pass score) are promoted. The promotion effect depends on the candidate type:

Candidate TypePromotion Action
Prompt FragmentInject guidance into the Adaptation Overlay Store
Fix TemplateCreate TDD repair playbook entry
Routing HintModify semantic router bias settings
Validator Policy TweakOverride validation parameters for the run kind
Flow OffloadRegister a new deterministic flow definition

Adaptation Overlay Store

Promoted candidates are stored in the Adaptation Overlay Store (adaptation_overlay_store.py), which provides a persistent registry of active system modifications:

entries = overlay_store.get_prompt_fragments(
    agent_id="introspection_loop",
    run_kind="introspection_evaluation",
)
# Returns: [{"guidance": ["Prefer concrete patterns...", "Only suggest..."]}]

These entries are queried at runtime and injected into the appropriate prompts, creating a feedback loop where past failures inform future reasoning.


Autonomous Mode

In Autonomous Mode, the Adaptation Engine auto-approves candidates that score above the pass threshold without waiting for human review. This enables fully self-improving operation where the system evolves its own behaviour based on observed outcomes.

In manual mode, candidates are queued in the dashboard for operator review before promotion.