Skip to main content

Q1+Q2 Sprint — Deep Dive

Six priorities (P1–P6) from the harnessed-LLM-agent reference matrix, all shipped in a single afternoon by parallelising five worktree agents and converging into main.

TL;DR — what changed

MetricBefore sprintAfter sprint
Reference matrix coverage13 ✅ + 5 ⚠ + 1 ❌ (~82%)18 ✅ + 1 ⚠ + 0 ❌ (~95%)
pytest pass count18652065 (+200)
vitest pass count100112 (+12 from RAG UI)
New core abstractionsKnowledgeStore, Guardrail, Evaluator, PersonalizedMemory + typed cooperation messages
New REST endpoints/api/knowledge/*, /api/evals/*, /api/user-memory/*
New optional extras[rag], [langfuse], [phoenix]

Status snapshot graph

Green = done before the sprint, blue = shipped during the sprint.

Growth graph — what each priority unlocks

Sprint timeline

Five worktree agents, ~10 minutes each, then 15 minutes of convergence. Total: under one afternoon.

Per-priority cards

Each card collapsed by default. Click to expand for design notes, code locations, and copy-pasteable try-it commands.

P1 — Semantic Knowledge / RAG   ✅ shipped   ·   Effort: M   ·   Impact: 🔥🔥🔥

Where it lives

  • src/agent_orchestrator/core/knowledge/ — EmbeddingProvider ABC + 3 impls (HashEmbedder, sentence-transformers, OpenAI), Chunker ABC, KnowledgeStore (ISP-split), Ingester, Retriever
  • src/agent_orchestrator/skills/retrieval_skill.pyknowledge_retrieve skill for agents
  • src/agent_orchestrator/dashboard/knowledge_routes.py/api/knowledge/{ingest,search,namespaces,health}
  • frontend/src/components/chat/ChatInput.tsx — RAG checkbox + namespace input
  • frontend/src/hooks/useWebSocket.ts — handles type: "rag" frame, emits "RAG: namespace · N chunks" system bubble

Try it (60 seconds)

# Ingest
curl -sX POST http://localhost:5005/api/knowledge/ingest \
-H 'Content-Type: application/json' \
-d '{"source_id":"auth-doc","namespace":"shared",
"text":"# Auth\n\nUse JWT tokens. Sessions are stateless."}'

# Search
curl -sX POST http://localhost:5005/api/knowledge/search \
-H 'Content-Type: application/json' \
-d '{"query":"how do tokens work?","namespace":"shared","k":3}'

# Chat with auto-injection (or flip the RAG checkbox in the UI)
curl -sX POST http://localhost:5005/api/prompt \
-H 'Content-Type: application/json' \
-d '{"prompt":"How do auth tokens work?","model":"openai/gpt-4o",
"provider":"openrouter","rag_enabled":true,"rag_namespace":"shared"}'

Production swap-in

EmbedderSwitchInstall
Hash (default, dev)(built-in)none
sentence-transformersRAG_EMBEDDING_PROVIDER=local RAG_LOCAL_MODEL=all-MiniLM-L6-v2pip install -e ".[rag]"
OpenAIRAG_EMBEDDING_PROVIDER=openai RAG_OPENAI_MODEL=text-embedding-3-smallpip install -e ".[openai]"
P2 — Evaluator framework   ✅ shipped   ·   Effort: M   ·   Impact: 🔥🔥🔥

Where it lives

  • src/agent_orchestrator/core/evaluator.py — Evaluator ABC, LLMJudge, RubricEvaluator (regex/contains/JSON-schema/length), EvalSuite, JsonDataset, EvalReport
  • evals/datasets/smoke.json — 5 hand-picked smoke cases
  • evals/runners/cli.pypython -m evals.runners.cli --suite ... --dry-run
  • src/agent_orchestrator/dashboard/evals_routes.py/api/evals/{run,runs,runs/{id},compare}

Try it

# Local dry-run (no LLM call)
python -m evals.runners.cli --suite evals/datasets/smoke.json --dry-run

# REST (background)
curl -sX POST http://localhost:5005/api/evals/run \
-H 'Content-Type: application/json' \
-d '{"suite_path":"evals/datasets/smoke.json","agent":"team-lead",
"model":"openai/gpt-4o","provider":"openrouter"}'
P3 — Guardrails layer   ✅ shipped   ·   Effort: S   ·   Impact: 🔥🔥

Where it lives

  • src/agent_orchestrator/core/guardrails.py — Guardrail ABC + GuardrailManager + PIIScanner, SecretsScanner, PromptInjectionDetector, OutputSchemaGuard, CostGuard
  • src/agent_orchestrator/core/agent.pyAgent.execute() calls run_input pre-LLM and run_output post-LLM
  • orchestrator.yaml.example — YAML config block
  • Events: guardrail.checked / blocked / redacted

Try it (Python)

from agent_orchestrator.core.guardrails import GuardrailManager, PIIScanner, SecretsScanner

mgr = GuardrailManager()
mgr.register(PIIScanner(action="redact"))
mgr.register(SecretsScanner(action="block"))

agent = Agent(config=..., provider=..., skill_registry=..., guardrails=mgr)
# Now every Agent.execute() runs input/output checks automatically.
P4 — Personalized Memory   ✅ shipped   ·   Effort: S   ·   Impact: 🔥🔥

Where it lives

  • src/agent_orchestrator/core/personalized_memory.pyPersonalizedMemory(BaseStore) facade with put/get/list/delete/wipe
  • src/agent_orchestrator/skills/profile_extractor_skill.py — extracts preferences from history
  • src/agent_orchestrator/dashboard/personalized_memory_routes.py/api/user-memory/users/*
  • src/agent_orchestrator/core/agent.py<user_profile> block in system prompt when user_id + personalized_memory are set

Try it

# Save
curl -sX PUT http://localhost:5005/api/user-memory/users/u-123/style \
-H 'Content-Type: application/json' \
-d '{"value":{"prefers":"concise, code blocks > prose"}}'

# Read
curl -s http://localhost:5005/api/user-memory/users/u-123

# GDPR wipe
curl -sX DELETE http://localhost:5005/api/user-memory/users/u-123
P5a — Cooperation typed messages + spec   ✅ shipped   ·   Effort: S   ·   Impact: 🔥

Where it lives

  • src/agent_orchestrator/core/cooperation_messages.py — frozen dataclasses (DelegateMessage, ResultMessage, CapabilityQueryMessage, CapabilityResponseMessage, ConflictMessage) + parse_message() dispatcher
  • The legacy dict-based callers in core/cooperation.py keep working — typed classes are additive
  • Full sequence + state diagrams: docs/cooperation-protocol.md (top-level, GitHub-rendered)

P5b status — parked. Google's A2A spec is still moving (April 2026). Re-evaluate in Q3.

P6 — Observability sinks (Langfuse + Phoenix)   ✅ shipped   ·   Effort: S   ·   Impact: 🔥

Where it lives

  • src/agent_orchestrator/core/observability/ — LangfuseSpanExporter + PhoenixSpanExporter, both opt-in
  • src/agent_orchestrator/core/tracing.pysetup_tracing() calls register_optional_exporters()
  • pyproject.toml — new [langfuse] and [phoenix] extras (rolled into [all])
  • Existing Tempo/OTel pipeline keeps working alongside

Turn on (env-driven)

# Langfuse
pip install -e ".[langfuse]"
export LANGFUSE_PUBLIC_KEY=pk-… LANGFUSE_SECRET_KEY=sk-… LANGFUSE_HOST=https://cloud.langfuse.com

# Phoenix (local)
pip install -e ".[phoenix]"
docker run -d -p 6006:6006 arizephoenix/phoenix:latest
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006

Why parallel beat sequential

The five priorities are mostly disjoint. Where they overlap (core/agent.py, dashboard/app.py, dashboard/events.py, CLAUDE.md, docs/abstractions.md), every edit is additive — each agent appends, none rewrite. SOLID compliance pays off at convergence: the new abstractions plug into existing seams (Agent.__init__ kwargs, app.state.*, EventBus) without colliding.

Convergence in practice = three-way merge with two short conflict resolutions on agent.py (combined kwargs) and app.py (combined router includes). Less than 15 minutes of manual work.

What's next

  1. Hook P3 Guardrails into production agents — pick a default-on safe set (PII redact + Secrets block) for multi-tenant deployments.
  2. Wire P2 Evaluator into CI — add a smoke suite as a GitHub Action gate that fails PRs on regression > 5%.
  3. Swap RAG defaults — production should use LocalEmbeddingProvider (sentence-transformers) or OpenAIEmbeddingProvider, with a PgVector backend instead of InMemoryKnowledgeStore once corpus grows.
  4. Re-evaluate P5b A2A in Q3 once the Google A2A spec stabilises.