Adaptation Engine

The Adaptation Engine is Clawpy's autonomous self-improvement system. It captures runtime outcomes (failures, budget incidents, human approvals), synthesises them into improvement candidates, evaluates each candidate against held-out evidence, and — upon approval — promotes changes into the live system.

This is how Clawpy learns from its own mistakes without human intervention.

The Adaptation Pipeline

Runtime Events (15+ types)
       │
       ▼
┌────────────────────────┐
│  Reflection Service     │  ← Captures high-signal events
│  (reflection_service.py)│     into structured learning records
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Learning Digest        │  ← Synthesises learning records
│  (learning_digest.py)   │     into candidate proposals
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Candidate Evaluation   │  ← Scores candidates against
│                         │     held-out evidence
└──────────┬─────────────┘
           │
           ▼
┌────────────────────────┐
│  Promotion              │  ← Approved candidates become
│                         │     active system modifications
└────────────────────────┘

Event Types

The Reflection Service captures 15+ distinct event types that signal learning opportunities:

Event Type	Source	Signal
`failed_tdd`	Validation Loop	A test-driven repair cycle failed
`successful_heal`	Validation Loop	A validation failure was self-healed
`repeated_retry`	Validation Loop	Same error type recurred across multiple runs
`task_conflict`	Task Board	Two agents attempted conflicting work
`budget_incident`	Budget Service	An agent exhausted its budget
`human_approval`	Dashboard	A human approved a pending action
`agent_error`	Tool Executor	A tool call failed unexpectedly
`successful_bubble`	Bubble Filter	A worker learning was promoted to a leader
`guidance_applied`	Wisdom Cascade	Leadership guidance was injected into a prompt
`guidance_noisy`	Wisdom Cascade	Injected guidance was counterproductive
`stale_guidance`	Wisdom Cascade	Leadership guidance was outdated
`teaching_refresh`	Wisdom Teacher	A teaching cycle completed successfully
`cloud_sync_failure`	Supabase Sync	Cloud memory sync failed
`cloud_memory_recall_hit`	Context Engine	Cloud memory was successfully recalled
`tool_sequence`	Tool Executor	A repeated multi-step tool pattern was detected

Candidate Types

Learning records are synthesised into five candidate types:

1. Prompt Fragments

Injected into the system prompt for specific run kinds. Used to evolve how agents reason about tasks.

Sources: Research summaries, blueprint drafts, auto-reply evaluations, memory fact extraction, guidance events.

2. Fix Templates

Reusable repair playbooks for recurring failure patterns. Capture the successful repair strategy so it can be applied automatically next time.

Sources: Failed TDD runs, successful heals.

3. Routing Hints

Adjustments to the semantic router's behaviour. For example, routing ambiguous tasks to cheaper models after detecting budget pressure.

Sources: Budget incidents, task conflicts, successful bubbles.

4. Validator Policy Tweaks

Changes to validation parameters — tighter retry caps, adjusted timeout values, stricter guardrails.

Sources: Repeated retries, stale guidance, cloud sync failures, noisy guidance.

5. Flow Offloads

Proposals to convert repeated tool-call sequences into deterministic flows (see Flow Sequence Detector).

Sources: Detected tool sequence repetitions.

Candidate Lifecycle

detect → draft → score → corroborate → review → promote/reject

1. Detect

The Learning Digest scans recent learning records and identifies patterns:

def digest_reflection_opportunities(opportunities, *, limit=5, min_score=60.0):
    # Filter: only records scoring above the threshold
    # Deduplicate by (candidate_type, scope, source_label)
    # Return top N candidate drafts

2. Draft

Each candidate is assigned a dedupe key to prevent duplicate proposals:

dedupe_key = "{candidate_type}:{scope_type}:{scope_id}:{source_label}"
# e.g., "routing_hint:agent:cto:budget_incident"

3. Score

Initial scoring uses the learning record's own score (0–100).

4. Corroborate

The candidate is evaluated against held-out evidence — learning records that weren't used to generate the candidate:

# Match: same source_label, same scope, different record IDs
avg_score = mean(matching_record_scores)
corroboration_bonus = min(15.0, num_matches * 5.0)
candidate_score = min(100, avg_score * 0.85 + corroboration_bonus)

A candidate with 3 corroborating records gets a +15 bonus. A candidate with zero corroboration is capped at 45 points and automatically fails.

5. Promote or Reject

Candidates scoring ≥ 60 (pass score) are promoted. The promotion effect depends on the candidate type:

Candidate Type	Promotion Action
Prompt Fragment	Inject guidance into the Adaptation Overlay Store
Fix Template	Create TDD repair playbook entry
Routing Hint	Modify semantic router bias settings
Validator Policy Tweak	Override validation parameters for the run kind
Flow Offload	Register a new deterministic flow definition

Adaptation Overlay Store

Promoted candidates are stored in the Adaptation Overlay Store (adaptation_overlay_store.py), which provides a persistent registry of active system modifications:

entries = overlay_store.get_prompt_fragments(
    agent_id="introspection_loop",
    run_kind="introspection_evaluation",
)
# Returns: [{"guidance": ["Prefer concrete patterns...", "Only suggest..."]}]

These entries are queried at runtime and injected into the appropriate prompts, creating a feedback loop where past failures inform future reasoning.

Autonomous Mode

In Autonomous Mode, the Adaptation Engine auto-approves candidates that score above the pass threshold without waiting for human review. This enables fully self-improving operation where the system evolves its own behaviour based on observed outcomes.

In manual mode, candidates are queued in the dashboard for operator review before promotion.