Overview

The Problem

Agent framework comparisons often stop at simple examples: create an agent, attach a tool, ask a question, show a response.

That is useful for learning syntax, but it does not answer the question engineering leaders actually need answered:

Can this framework support governed software delivery?

For AI-native SDLC work, the important differences are not only developer ergonomics. They are orchestration control, human approval, structured outputs, MCP and tool integration, observability, deployment, evaluation, and how well the framework can support evidence-backed delivery decisions.

The Solution

The Agent Framework Landscape Lab is a hands-on comparison of agent and multi-agent frameworks using SDLC automation tasks rather than generic demos.

Each framework is assessed against the same reference workflow:

Capture a messy software change request.
Produce a structured delivery spec.
Identify required context, policies, and constraints.
Generate an implementation and test plan.
Produce security and governance risk notes.
Prepare approval-ready evidence.

The aim is not to crown a single winner. The aim is to understand which frameworks fit which parts of governed AI-native software delivery.

Devos-Informed Loop

The lab is designed as a feedback loop between public framework evaluation and private product learning:

AFLL defines framework-neutral comparison tasks.
Each framework implements the same governed SDLC contract.
A devos-informed benchmark harness evaluates runs for quality, evidence, governance, cost, and failure mode.
AFLL publishes anonymised scorecards, evidence packs, and framework notes.
Gaps found in the comparison feed back into the control-plane, benchmark, evidence, and governance model.

The public story is Pathfinder. The operating model is informed by real work on governed agent execution, spec gates, evidence packs, benchmark scoring, and production readiness.

Lab Template

Every framework is assessed with the same template so the comparison is evidence-led rather than opinion-led.

1. Framework Snapshot

The snapshot captures the basic facts that affect adoption and comparison:

name and version tested
language and runtime
licence and hosting model
primary abstraction
maturity signals
ecosystem fit
best-known use case
current positioning

2. Mental Model

This section explains how the framework wants builders to think about agent systems:

graph or workflow
crew or team
handoff chain
typed agent
tool-calling runtime
autonomous coding worker
orchestration layer

This matters because the mental model shapes how easy it is to represent real SDLC controls such as gates, evidence, retries, escalation, and approval.

3. Same Task Implementation

Each framework runs the same SDLC scenario:

messy requirement
structured spec
architecture note
QA and test plan
implementation or simulated implementation
review
approval gate
evidence pack

The goal is not to produce a perfect application in every framework. The goal is to expose how each framework handles the same governed delivery workflow.

4. Governed SDLC Fit

The first score asks whether the framework supports the operating model:

Capability	What is assessed
Spec gate	Can work be blocked until the requirement is specific enough?
Role separation	Can specialist agents have distinct responsibilities, tools, and context?
Human approval	Can the workflow pause for explicit review or escalation?
Evidence capture	Can outputs, decisions, risks, and gate results be recorded?
Provenance	Can the system explain where outputs and decisions came from?
Retry and recovery	Can failed steps recover without losing state or hiding failure?
Tool permissioning	Can tools be scoped by role, task, or risk class?
Eval integration	Can quality checks and benchmark results become part of the workflow?
Long-running workflow support	Can the framework support multi-step SDLC work beyond one prompt?
Audit-friendly run record	Can a reviewer understand what happened after the run completes?

5. Software Engineering Performance

The second score is SWE-bench-inspired, but it is not presented as SWE-bench unless the lab is actually running SWE-bench-compatible tasks.

For realistic software engineering tasks, the lab uses an EvoScore-style breakdown inspired by the devos benchmark work:

Dimension	What is measured
Story completion	How much of the requested work was completed, including partial completion.
Test coverage	Whether meaningful tests were added and whether relevant test coverage improved.
Code quality	Whether lint, typecheck, and targeted test checks pass.
Governance	Whether commits, evidence, and required controls satisfy the benchmark rules.
Efficiency	Turns, runtime, intervention count, and cost per completed unit of work.
Reference parity	How close the result is to a reference implementation or human baseline.

The lab also records lower-level task signals:

Dimension	What is measured
Task completion	Did the framework implementation complete the requested work?
Spec conformance	Did the output satisfy the structured spec?
Test pass rate	Did relevant tests pass?
Regression rate	Did it break existing behaviour?
First-pass success	Did it work before human correction?
Rework required	How much intervention was needed?
Time and latency	How long did the run take?
Cost	What token, API, or runtime cost was measurable?
Change quality	Was the result small, reviewable, and maintainable?
Evidence quality	Did it leave enough proof for a reviewer to trust or reject the work?

6. Operational Notes

The lab also records what the framework is like to operate:

setup complexity
local developer experience
CI suitability
observability and debugging
dependency weight
deployment model
state persistence
failure recovery
security posture
maintainability

7. Framework Scorecard

Each framework receives a concise scorecard:

Category	Score
Governed SDLC fit	/10
Software engineering performance	/10
Evidence and observability	/10
Human control	/10
Developer experience	/10
Operational readiness	/10

The scores are useful only when paired with the evidence pack and failure notes. A high score without inspectable evidence is not treated as a strong result.

8. Runner Discipline

Framework comparisons are only useful if failed runs are classified honestly. The lab therefore treats benchmark execution as a stateful process rather than a loose script.

Run State	Meaning
`pending`	The run has not started.
`started`	The run is active or needs classification.
`completed`	The run finished with valid artifacts and can be scored.
`excluded_contaminated`	Artifacts are preserved, but the result is excluded from clean comparison.
`failed_final`	The run failed and is not retryable.
`blocked_budget`	The run was not launched because the budget or limit was reached.

The lab keeps campaign isolation, variant filtering, targeted reruns, artifact preservation, and contamination handling as first-class benchmark concerns.

9. Model API Boundary

The model layer is normalised enough that framework behaviour is not confused with provider behaviour.

Each run records:

model provider and model name
temperature and token limits
retry policy
timeout policy
prompt or input hash
token usage and cost estimate
tool-call format and normalised tool calls
provider errors, timeouts, or malformed responses

Failures are classified separately as framework orchestration failure, model or tool-call failure, benchmark infrastructure failure, task quality failure, or contaminated run.

10. Best Fit And Weak Fit

The verdict is not a single ranking. Each framework is assessed for where it fits best:

rapid MAS prototyping
durable workflow orchestration
typed and spec-first agents
enterprise governance
coding-worker integration
weak fit cases where the abstraction hides too much state, evidence, or control

11. Failure Modes Observed

The lab records failure modes directly:

lost state
weak handoff
unverifiable output
hidden retries
poor tool boundaries
hard-to-debug agent behaviour
brittle prompt coupling
missing approval points

Failure modes are part of the result, not an embarrassment to remove from the comparison.

12. Evidence Pack

Each framework run should link to:

source implementation
run logs
generated spec
test output
review notes
approval decision
benchmark score
screenshots or demo output

13. Pathfinder Verdict

Each page ends with one direct judgement:

This framework is or is not a good fit for governed AI-native SDLC automation because…

Frameworks In Scope

The first pass focuses on frameworks that are relevant to agentic SDLC work:

LangGraph
OpenAI Agents SDK
Microsoft Agent Framework
Google ADK
CrewAI
Pydantic AI
LlamaIndex
Strands Agents
OpenHands Software Agent SDK

Additional systems such as Devin, Claude Code, GitHub Copilot coding agent, Rovo Dev, GitLab Duo with Amazon Q, Factory Droid, and Amazon Q Developer are treated as packaged SDLC agents or workflow products rather than general-purpose MAS frameworks.

Evaluation Rubric

The lab compares each framework across criteria that matter for governed software delivery:

Criterion	Why it matters
Orchestration model	Whether the workflow can represent long-running SDLC stages, retries, branching, and state.
Multi-agent support	Whether specialist agents can collaborate without becoming chat theatre.
Structured outputs	Whether specs, plans, risks, and evidence can be validated rather than parsed loosely.
Human-in-the-loop control	Whether approval, escalation, and review points are first-class design concerns.
MCP and tool integration	Whether the framework can consume real project context and expose useful tools safely.
Observability and tracing	Whether agent decisions and tool calls can be inspected after the fact.
Evaluation support	Whether the workflow can be tested for spec conformance, policy compliance, and regression risk.
Deployment model	Whether the framework has a credible path from experiment to controlled production use.
Governed SDLC fit	Whether it supports requirements, code, tests, governance, evidence, approval, and release readiness.
Software engineering performance	Whether the framework implementation can complete realistic software tasks with acceptable quality, cost, and rework.
Runner reliability	Whether runs can be supervised, resumed, retried, excluded, and audited without corrupting benchmark results.
Model API boundary	Whether model/provider behaviour can be normalised and separated from framework behaviour.

Reference Tasks

The lab uses two task tracks.

Track 1: Synthetic Explanation Task

The synthetic task is used to explain the methodology clearly and keep public examples easy to understand:

SterlingPay wants to add round-ups into savings pots. The feature must support card transaction events, configurable rounding rules, customer opt-in, auditability, failure handling, and release controls.

This is useful for early workflow design, diagrams, screenshots, and blog explanations.

Track 2: Realistic Software Benchmark

The serious benchmark track uses the Flowglad / Series B Fintech profile from the devos benchmark work:

TypeScript and Next.js payments platform.
tRPC backend and PostgreSQL database with Drizzle ORM.
GitHub Actions and Vercel-style delivery environment.
Eight-story sprint profile covering bug fixes, backend changes, UI work, pricing model logic, and integration work.
Reference implementation baseline derived from real merged PRs.

This track is used for software engineering performance scoring, EvoScore-style comparison, and failure-mode analysis.

For each framework, the lab will ask:

Can it turn this request into a structured spec?
Can it identify missing requirements and risks?
Can it retrieve or accept relevant policy and architecture context?
Can it coordinate specialist agents for QA, architecture, security, and implementation planning?
Can it produce an evidence pack suitable for human review?
Can the output be evaluated repeatably?
- Can it complete realistic software tasks with clean tests, lint, typecheck, and reviewable changes?

System Design

flowchart LR
  A["Reference task"] --> B["Framework adapter"]
  B --> C["Spec gate"]
  B --> D["Agent run"]
  B --> E["Human gate"]
  B --> F["Model boundary"]
  C --> G["Evidence bundle"]
  D --> G
  E --> G
  F --> G
  G --> H["Runner state and classification"]
  H --> I["Governed SDLC score"]
  H --> J["Software engineering score"]
  I --> K["Framework comparison"]
  J --> K

Adapter Contract

Each framework adapter maps its native concepts into the same benchmark contract:

Contract Object	Purpose
`SpecInput`	The structured requirement, constraints, acceptance criteria, and risk class.
`AgentRun`	The framework-specific run, including agents, tools, messages, state, and outputs.
`GateDecision`	Human approval, rejection, escalation, or automated gate result.
`EvidenceBundle`	Logs, traces, specs, tests, review notes, tool calls, and benchmark artifacts.
`EvalResult`	Governed SDLC score, software engineering score, failure class, and run metadata.
`RunState`	The lifecycle state used to decide whether the run is scorable, retryable, excluded, or failed.
`ModelClient`	A thin boundary for invoking models, normalising tool calls, recording usage, and classifying provider failures.

What This Proves

Practical fluency with current agent and MAS frameworks.
Ability to evaluate frameworks through engineering operating needs, not hype.
Understanding of why SDLC automation requires specs, evidence, governance, and evals.
Judgement about where packaged agents, orchestration frameworks, and custom control planes fit together.
Ability to connect framework evaluation to production-grade benchmark discipline.

Limitations

This is not a claim that any framework can safely automate the whole SDLC. The lab is deliberately designed to expose limits: weak structured output, poor observability, awkward human approval, fragile tool integration, missing evaluation support, model API mismatches, and benchmark contamination.

The results are also time-sensitive. Agent frameworks are moving quickly, so the lab should be treated as a dated snapshot with repeatable criteria, not a permanent ranking.

Next Iteration

The first implementation pass will compare two or three frameworks against the reference task, then expand the matrix as the workflow and scoring model stabilise.

The first group is:

LangGraph for stateful, controlled workflows.
OpenAI Agents SDK for lightweight orchestration and tracing.
CrewAI for role-based crews and a useful comparison point against LangChain and LangGraph.

Multi-agent systems, agent frameworks, AI-native SDLC, LangGraph, OpenAI Agents SDK, CrewAI, MCP, context engineering, structured outputs, evaluation, governance, human-in-the-loop workflows, benchmark design, evidence packs, model API boundaries.

Agent Framework Landscape Lab