Skip to main content

Roadmap

Philosophy

AWS infrastructure and monitoring come first — the product must be live and observable before iterating on features. Then validate agent autonomy with real sprints and sandboxed output preview. Only invest in scaling when revenue justifies it.

Latest sprint — Q1+Q2 (May 2026)

All six priorities from the harnessed-LLM-agent reference matrix shipped in a single afternoon, parallelised across 5 worktree agents.

PriorityStatusWhat changed
P1 — RAG✅ shippedEmbeddingProvider/Chunker/KnowledgeStore ABCs, ingestion + retrieval, UI toggle, log-highlighting
P2 — Evaluator✅ shippedLLMJudge + RubricEvaluator + EvalSuite, smoke dataset, CLI runner, REST API
P3 — Guardrails✅ shippedPIIScanner / SecretsScanner / PromptInjectionDetector / OutputSchemaGuard / CostGuard
P4 — Personalized Memory✅ shipped("user", id) namespace, profile extractor, GDPR wipe
P5a — Cooperation messages✅ shippedtyped dataclasses + sequence/state spec
P6 — Observability sinks✅ shippedLangfuse + Phoenix optional exporters
P5b — A2A adapterparkedGoogle A2A spec still moving

Q1+Q2 sprint deep-dive — collapsible per-priority cards with copy-pasteable try-it examples, growth graphs, and SOLID rationale.

Coverage of the harnessed-LLM-agent reference

Up from 13 ✅ + 5 ⚠ + 1 ❌ at the start of the sprint (~82%) to 18 ✅ + 1 ⚠ + 0 ❌ today (~95%). Tests: 2065 / 2065 pass (was 1865, +200 from this sprint).

Earlier state (still relevant for context)

Done before the Q1+Q2 sprint:

  • Core: Provider, Agent, Skill, Orchestrator, Cooperation, StateGraph engine
  • 5 providers: Anthropic, OpenAI, Google, Ollama, OpenRouter (free models + fallback chains)
  • 23 agents across 5 categories (software-engineering, data-science, finance, marketing, tooling)
  • SkillKit scout agent for marketplace skill discovery (15,000+ skills)
  • Dashboard: streaming, multi-turn chat, presets, file context, agent execution, cost tracking
  • Checkpointing: InMemory, SQLite, PostgreSQL
  • Docker/OrbStack: dashboard, postgres, test, lint, format
  • 1865+ tests at the start of the sprint

What's missing for production today:

  • Cloud deployment (AWS) — see Phase 0
  • Infrastructure monitoring (Prometheus + Grafana) — see Phase 0
  • Default-on Guardrails set (multi-tenant rollout)
  • CI eval gate (drop-in via python -m evals.runners.cli)

Phases

PhaseFocusTimelinePage
Phase 0AWS Infrastructure + Prometheus/GrafanaNOWDetails
Phase 1Agent Autonomy Lab (sandbox, sprints, observability)Month 1Details
Phase 2Optimization & First RevenueMonth 2-4Details
Phase 3Platform MaturityMonth 4-6Details
Phase 4Hybrid GPU ScalingMonth 6+Details

Pre-MVP Versions (in progress)

VersionFocusPage
v0.4.0Multi-Agent CooperationDetails
v0.5.0Smart Routing & Cost OptimizationDetails
v0.6.0Production HardeningDetails
v0.7.0Advanced Graph PatternsDetails
v0.8.0External IntegrationsDetails
v1.0.0General AvailabilityDetails
v1.1LangGraph-Inspired Improvements (channels, HITL, caching, conformance)Details
v1.2Dynamic Team Routing (team-lead selects agents per task)Details

Post-MVP: Scaling

VersionTriggerFocusPage
ScalingRevenue > 600 EUR/mo x 2 monthsGPU infra, fine-tuning, enterpriseDetails