Agent Framework Landscape Lab
Overview
The Problem
Agent framework comparisons often stop at simple examples: create an agent, attach a tool, ask a question, show a response.
That is useful for learning syntax, but it does not answer the question engineering leaders actually need answered:
Can this framework support governed software delivery?
For AI-native SDLC work, the important differences are not only developer ergonomics. They are orchestration control, human approval, structured outputs, MCP and tool integration, observability, deployment, evaluation, and how well the framework can support evidence-backed delivery decisions.
The Solution
The Agent Framework Landscape Lab is a hands-on comparison of agent and multi-agent frameworks using SDLC automation tasks rather than generic demos.
Each framework is assessed against the same reference workflow:
- Capture a messy software change request.
- Produce a structured delivery spec.
- Identify required context, policies, and constraints.
- Generate an implementation and test plan.
- Produce security and governance risk notes.
- Prepare approval-ready evidence.
The aim is not to crown a single winner. The aim is to understand which frameworks fit which parts of governed AI-native software delivery.
Devos-Informed Loop
The lab is designed as a feedback loop between public framework evaluation and private product learning:
- AFLL defines framework-neutral comparison tasks.
- Each framework implements the same governed SDLC contract.
- A devos-informed benchmark harness evaluates runs for quality, evidence, governance, cost, and failure mode.
- AFLL publishes anonymised scorecards, evidence packs, and framework notes.
- Gaps found in the comparison feed back into the control-plane, benchmark, evidence, and governance model.
The public story is Pathfinder. The operating model is informed by real work on governed agent execution, spec gates, evidence packs, benchmark scoring, and production readiness.
Lab Template
Every framework is assessed with the same template so the comparison is evidence-led rather than opinion-led.
1. Framework Snapshot
The snapshot captures the basic facts that affect adoption and comparison:
- name and version tested
- language and runtime
- licence and hosting model
- primary abstraction
- maturity signals
- ecosystem fit
- best-known use case
- current positioning
2. Mental Model
This section explains how the framework wants builders to think about agent systems:
- graph or workflow
- crew or team
- handoff chain
- typed agent
- tool-calling runtime
- autonomous coding worker
- orchestration layer
This matters because the mental model shapes how easy it is to represent real SDLC controls such as gates, evidence, retries, escalation, and approval.
3. Same Task Implementation
Each framework runs the same SDLC scenario:
- messy requirement
- structured spec
- architecture note
- QA and test plan
- implementation or simulated implementation
- review
- approval gate
- evidence pack
The goal is not to produce a perfect application in every framework. The goal is to expose how each framework handles the same governed delivery workflow.
4. Governed SDLC Fit
The first score asks whether the framework supports the operating model:
| Capability | What is assessed |
|---|---|
| Spec gate | Can work be blocked until the requirement is specific enough? |
| Role separation | Can specialist agents have distinct responsibilities, tools, and context? |
| Human approval | Can the workflow pause for explicit review or escalation? |
| Evidence capture | Can outputs, decisions, risks, and gate results be recorded? |
| Provenance | Can the system explain where outputs and decisions came from? |
| Retry and recovery | Can failed steps recover without losing state or hiding failure? |
| Tool permissioning | Can tools be scoped by role, task, or risk class? |
| Eval integration | Can quality checks and benchmark results become part of the workflow? |
| Long-running workflow support | Can the framework support multi-step SDLC work beyond one prompt? |
| Audit-friendly run record | Can a reviewer understand what happened after the run completes? |
5. Software Engineering Performance
The second score is SWE-bench-inspired, but it is not presented as SWE-bench unless the lab is actually running SWE-bench-compatible tasks.
For realistic software engineering tasks, the lab uses an EvoScore-style breakdown inspired by the devos benchmark work:
| Dimension | What is measured |
|---|---|
| Story completion | How much of the requested work was completed, including partial completion. |
| Test coverage | Whether meaningful tests were added and whether relevant test coverage improved. |
| Code quality | Whether lint, typecheck, and targeted test checks pass. |
| Governance | Whether commits, evidence, and required controls satisfy the benchmark rules. |
| Efficiency | Turns, runtime, intervention count, and cost per completed unit of work. |
| Reference parity | How close the result is to a reference implementation or human baseline. |
The lab also records lower-level task signals:
| Dimension | What is measured |
|---|---|
| Task completion | Did the framework implementation complete the requested work? |
| Spec conformance | Did the output satisfy the structured spec? |
| Test pass rate | Did relevant tests pass? |
| Regression rate | Did it break existing behaviour? |
| First-pass success | Did it work before human correction? |
| Rework required | How much intervention was needed? |
| Time and latency | How long did the run take? |
| Cost | What token, API, or runtime cost was measurable? |
| Change quality | Was the result small, reviewable, and maintainable? |
| Evidence quality | Did it leave enough proof for a reviewer to trust or reject the work? |
6. Operational Notes
The lab also records what the framework is like to operate:
- setup complexity
- local developer experience
- CI suitability
- observability and debugging
- dependency weight
- deployment model
- state persistence
- failure recovery
- security posture
- maintainability
7. Framework Scorecard
Each framework receives a concise scorecard:
| Category | Score |
|---|---|
| Governed SDLC fit | /10 |
| Software engineering performance | /10 |
| Evidence and observability | /10 |
| Human control | /10 |
| Developer experience | /10 |
| Operational readiness | /10 |
The scores are useful only when paired with the evidence pack and failure notes. A high score without inspectable evidence is not treated as a strong result.
8. Runner Discipline
Framework comparisons are only useful if failed runs are classified honestly. The lab therefore treats benchmark execution as a stateful process rather than a loose script.
| Run State | Meaning |
|---|---|
pending | The run has not started. |
started | The run is active or needs classification. |
completed | The run finished with valid artifacts and can be scored. |
excluded_contaminated | Artifacts are preserved, but the result is excluded from clean comparison. |
failed_final | The run failed and is not retryable. |
blocked_budget | The run was not launched because the budget or limit was reached. |
The lab keeps campaign isolation, variant filtering, targeted reruns, artifact preservation, and contamination handling as first-class benchmark concerns.
9. Model API Boundary
The model layer is normalised enough that framework behaviour is not confused with provider behaviour.
Each run records:
- model provider and model name
- temperature and token limits
- retry policy
- timeout policy
- prompt or input hash
- token usage and cost estimate
- tool-call format and normalised tool calls
- provider errors, timeouts, or malformed responses
Failures are classified separately as framework orchestration failure, model or tool-call failure, benchmark infrastructure failure, task quality failure, or contaminated run.
10. Best Fit And Weak Fit
The verdict is not a single ranking. Each framework is assessed for where it fits best:
- rapid MAS prototyping
- durable workflow orchestration
- typed and spec-first agents
- enterprise governance
- coding-worker integration
- weak fit cases where the abstraction hides too much state, evidence, or control
11. Failure Modes Observed
The lab records failure modes directly:
- lost state
- weak handoff
- unverifiable output
- hidden retries
- poor tool boundaries
- hard-to-debug agent behaviour
- brittle prompt coupling
- missing approval points
Failure modes are part of the result, not an embarrassment to remove from the comparison.
12. Evidence Pack
Each framework run should link to:
- source implementation
- run logs
- generated spec
- test output
- review notes
- approval decision
- benchmark score
- screenshots or demo output
13. Pathfinder Verdict
Each page ends with one direct judgement:
This framework is or is not a good fit for governed AI-native SDLC automation because…
Frameworks In Scope
The first pass focuses on frameworks that are relevant to agentic SDLC work:
- LangGraph
- OpenAI Agents SDK
- Microsoft Agent Framework
- Google ADK
- CrewAI
- Pydantic AI
- LlamaIndex
- Strands Agents
- OpenHands Software Agent SDK
Additional systems such as Devin, Claude Code, GitHub Copilot coding agent, Rovo Dev, GitLab Duo with Amazon Q, Factory Droid, and Amazon Q Developer are treated as packaged SDLC agents or workflow products rather than general-purpose MAS frameworks.
Evaluation Rubric
The lab compares each framework across criteria that matter for governed software delivery:
| Criterion | Why it matters |
|---|---|
| Orchestration model | Whether the workflow can represent long-running SDLC stages, retries, branching, and state. |
| Multi-agent support | Whether specialist agents can collaborate without becoming chat theatre. |
| Structured outputs | Whether specs, plans, risks, and evidence can be validated rather than parsed loosely. |
| Human-in-the-loop control | Whether approval, escalation, and review points are first-class design concerns. |
| MCP and tool integration | Whether the framework can consume real project context and expose useful tools safely. |
| Observability and tracing | Whether agent decisions and tool calls can be inspected after the fact. |
| Evaluation support | Whether the workflow can be tested for spec conformance, policy compliance, and regression risk. |
| Deployment model | Whether the framework has a credible path from experiment to controlled production use. |
| Governed SDLC fit | Whether it supports requirements, code, tests, governance, evidence, approval, and release readiness. |
| Software engineering performance | Whether the framework implementation can complete realistic software tasks with acceptable quality, cost, and rework. |
| Runner reliability | Whether runs can be supervised, resumed, retried, excluded, and audited without corrupting benchmark results. |
| Model API boundary | Whether model/provider behaviour can be normalised and separated from framework behaviour. |
Reference Tasks
The lab uses two task tracks.
Track 1: Synthetic Explanation Task
The synthetic task is used to explain the methodology clearly and keep public examples easy to understand:
SterlingPay wants to add round-ups into savings pots. The feature must support card transaction events, configurable rounding rules, customer opt-in, auditability, failure handling, and release controls.
This is useful for early workflow design, diagrams, screenshots, and blog explanations.
Track 2: Realistic Software Benchmark
The serious benchmark track uses the Flowglad / Series B Fintech profile from the devos benchmark work:
- TypeScript and Next.js payments platform.
- tRPC backend and PostgreSQL database with Drizzle ORM.
- GitHub Actions and Vercel-style delivery environment.
- Eight-story sprint profile covering bug fixes, backend changes, UI work, pricing model logic, and integration work.
- Reference implementation baseline derived from real merged PRs.
This track is used for software engineering performance scoring, EvoScore-style comparison, and failure-mode analysis.
For each framework, the lab will ask:
- Can it turn this request into a structured spec?
- Can it identify missing requirements and risks?
- Can it retrieve or accept relevant policy and architecture context?
- Can it coordinate specialist agents for QA, architecture, security, and implementation planning?
- Can it produce an evidence pack suitable for human review?
- Can the output be evaluated repeatably?
- Can it complete realistic software tasks with clean tests, lint, typecheck, and reviewable changes?
System Design
flowchart LR
A["Reference task"] --> B["Framework adapter"]
B --> C["Spec gate"]
B --> D["Agent run"]
B --> E["Human gate"]
B --> F["Model boundary"]
C --> G["Evidence bundle"]
D --> G
E --> G
F --> G
G --> H["Runner state and classification"]
H --> I["Governed SDLC score"]
H --> J["Software engineering score"]
I --> K["Framework comparison"]
J --> K
Adapter Contract
Each framework adapter maps its native concepts into the same benchmark contract:
| Contract Object | Purpose |
|---|---|
SpecInput | The structured requirement, constraints, acceptance criteria, and risk class. |
AgentRun | The framework-specific run, including agents, tools, messages, state, and outputs. |
GateDecision | Human approval, rejection, escalation, or automated gate result. |
EvidenceBundle | Logs, traces, specs, tests, review notes, tool calls, and benchmark artifacts. |
EvalResult | Governed SDLC score, software engineering score, failure class, and run metadata. |
RunState | The lifecycle state used to decide whether the run is scorable, retryable, excluded, or failed. |
ModelClient | A thin boundary for invoking models, normalising tool calls, recording usage, and classifying provider failures. |
What This Proves
- Practical fluency with current agent and MAS frameworks.
- Ability to evaluate frameworks through engineering operating needs, not hype.
- Understanding of why SDLC automation requires specs, evidence, governance, and evals.
- Judgement about where packaged agents, orchestration frameworks, and custom control planes fit together.
- Ability to connect framework evaluation to production-grade benchmark discipline.
Limitations
This is not a claim that any framework can safely automate the whole SDLC. The lab is deliberately designed to expose limits: weak structured output, poor observability, awkward human approval, fragile tool integration, missing evaluation support, model API mismatches, and benchmark contamination.
The results are also time-sensitive. Agent frameworks are moving quickly, so the lab should be treated as a dated snapshot with repeatable criteria, not a permanent ranking.
Next Iteration
The first implementation pass will compare two or three frameworks against the reference task, then expand the matrix as the workflow and scoring model stabilise.
The first group is:
- LangGraph for stateful, controlled workflows.
- OpenAI Agents SDK for lightweight orchestration and tracing.
- CrewAI for role-based crews and a useful comparison point against LangChain and LangGraph.
Related Skills
Multi-agent systems, agent frameworks, AI-native SDLC, LangGraph, OpenAI Agents SDK, CrewAI, MCP, context engineering, structured outputs, evaluation, governance, human-in-the-loop workflows, benchmark design, evidence packs, model API boundaries.