Skip to main content

Phase 1 — Agent Autonomy Lab (Month 1)

Goal: Understand how agents actually perform, test their output safely, build confidence in autonomous execution.

1A — Agent Output Sandbox (Preview & Test)

TaskDetail
E2B or Docker sandboxIsolated environment where agents run code before it touches real files
Output previewAgent generates code/changes → preview diff → human approves or rejects
Auto-validation pipelineLint + test + security scan on every agent output before merge
Artifact stagingAgent output goes to a staging branch/directory, not directly to main
Dashboard integrationShow preview diffs in the dashboard UI, approve/reject with one click

Flow:

1B — Agile Team Experiment

TaskDetail
Sprint simulationGive agents a backlog of tasks, see what they can deliver in a "sprint"
Team-lead as Scrum MasterTeam-lead decomposes epics into stories, assigns to agents
Velocity trackingMeasure: tasks completed, quality score, rework rate
Autonomy levelsL1: human approves everything, L2: auto-merge if tests pass, L3: full autonomy
Retrospective dataWhat tasks agents handle well vs. where they fail

1C — Agent Behavior Observability

TaskDetail
LangFuse integrationTrace every LLM call: prompt, response, latency, tokens, cost
Agent decision logWhy did team-lead route to agent X? Why did agent choose approach Y?
Failure analysisCategorize failures: wrong approach, hallucination, tool misuse, timeout
Quality scoringAuto-score agent output: does it compile? pass tests? follow conventions?

KPIs

  • Sandbox preview working end-to-end
  • First "sprint" completed with measurable velocity
  • Agent success rate measured per category
  • Clear data on which tasks agents handle autonomously vs. need human help