Skip to main content
Agent framework comparison matrix for governed SDLC automation.

Agent Framework Landscape Lab

In Progress
Multi-Agent Systems AI-Native SDLC Agent Frameworks Evaluation Context Engineering

Overview

The Problem

Agent framework comparisons often stop at simple examples: create an agent, attach a tool, ask a question, show a response.

That is useful for learning syntax, but it does not answer the question engineering leaders actually need answered:

Can this framework support governed software delivery?

For AI-native SDLC work, the important differences are not only developer ergonomics. They are orchestration control, human approval, structured outputs, MCP and tool integration, observability, deployment, evaluation, and how well the framework can support evidence-backed delivery decisions.

The Solution

The Agent Framework Landscape Lab is a hands-on comparison of agent and multi-agent frameworks using SDLC automation tasks rather than generic demos.

Each framework is assessed against the same reference workflow:

  1. Capture a messy software change request.
  2. Produce a structured delivery spec.
  3. Identify required context, policies, and constraints.
  4. Generate an implementation and test plan.
  5. Produce security and governance risk notes.
  6. Prepare approval-ready evidence.

The aim is not to crown a single winner. The aim is to understand which frameworks fit which parts of governed AI-native software delivery.

Devos-Informed Loop

The lab is designed as a feedback loop between public framework evaluation and private product learning:

  1. AFLL defines framework-neutral comparison tasks.
  2. Each framework implements the same governed SDLC contract.
  3. A devos-informed benchmark harness evaluates runs for quality, evidence, governance, cost, and failure mode.
  4. AFLL publishes anonymised scorecards, evidence packs, and framework notes.
  5. Gaps found in the comparison feed back into the control-plane, benchmark, evidence, and governance model.

The public story is Pathfinder. The operating model is informed by real work on governed agent execution, spec gates, evidence packs, benchmark scoring, and production readiness.

Lab Template

Every framework is assessed with the same template so the comparison is evidence-led rather than opinion-led.

1. Framework Snapshot

The snapshot captures the basic facts that affect adoption and comparison:

  • name and version tested
  • language and runtime
  • licence and hosting model
  • primary abstraction
  • maturity signals
  • ecosystem fit
  • best-known use case
  • current positioning

2. Mental Model

This section explains how the framework wants builders to think about agent systems:

  • graph or workflow
  • crew or team
  • handoff chain
  • typed agent
  • tool-calling runtime
  • autonomous coding worker
  • orchestration layer

This matters because the mental model shapes how easy it is to represent real SDLC controls such as gates, evidence, retries, escalation, and approval.

3. Same Task Implementation

Each framework runs the same SDLC scenario:

  • messy requirement
  • structured spec
  • architecture note
  • QA and test plan
  • implementation or simulated implementation
  • review
  • approval gate
  • evidence pack

The goal is not to produce a perfect application in every framework. The goal is to expose how each framework handles the same governed delivery workflow.

4. Governed SDLC Fit

The first score asks whether the framework supports the operating model:

CapabilityWhat is assessed
Spec gateCan work be blocked until the requirement is specific enough?
Role separationCan specialist agents have distinct responsibilities, tools, and context?
Human approvalCan the workflow pause for explicit review or escalation?
Evidence captureCan outputs, decisions, risks, and gate results be recorded?
ProvenanceCan the system explain where outputs and decisions came from?
Retry and recoveryCan failed steps recover without losing state or hiding failure?
Tool permissioningCan tools be scoped by role, task, or risk class?
Eval integrationCan quality checks and benchmark results become part of the workflow?
Long-running workflow supportCan the framework support multi-step SDLC work beyond one prompt?
Audit-friendly run recordCan a reviewer understand what happened after the run completes?

5. Software Engineering Performance

The second score is SWE-bench-inspired, but it is not presented as SWE-bench unless the lab is actually running SWE-bench-compatible tasks.

For realistic software engineering tasks, the lab uses an EvoScore-style breakdown inspired by the devos benchmark work:

DimensionWhat is measured
Story completionHow much of the requested work was completed, including partial completion.
Test coverageWhether meaningful tests were added and whether relevant test coverage improved.
Code qualityWhether lint, typecheck, and targeted test checks pass.
GovernanceWhether commits, evidence, and required controls satisfy the benchmark rules.
EfficiencyTurns, runtime, intervention count, and cost per completed unit of work.
Reference parityHow close the result is to a reference implementation or human baseline.

The lab also records lower-level task signals:

DimensionWhat is measured
Task completionDid the framework implementation complete the requested work?
Spec conformanceDid the output satisfy the structured spec?
Test pass rateDid relevant tests pass?
Regression rateDid it break existing behaviour?
First-pass successDid it work before human correction?
Rework requiredHow much intervention was needed?
Time and latencyHow long did the run take?
CostWhat token, API, or runtime cost was measurable?
Change qualityWas the result small, reviewable, and maintainable?
Evidence qualityDid it leave enough proof for a reviewer to trust or reject the work?

6. Operational Notes

The lab also records what the framework is like to operate:

  • setup complexity
  • local developer experience
  • CI suitability
  • observability and debugging
  • dependency weight
  • deployment model
  • state persistence
  • failure recovery
  • security posture
  • maintainability

7. Framework Scorecard

Each framework receives a concise scorecard:

CategoryScore
Governed SDLC fit/10
Software engineering performance/10
Evidence and observability/10
Human control/10
Developer experience/10
Operational readiness/10

The scores are useful only when paired with the evidence pack and failure notes. A high score without inspectable evidence is not treated as a strong result.

8. Runner Discipline

Framework comparisons are only useful if failed runs are classified honestly. The lab therefore treats benchmark execution as a stateful process rather than a loose script.

Run StateMeaning
pendingThe run has not started.
startedThe run is active or needs classification.
completedThe run finished with valid artifacts and can be scored.
excluded_contaminatedArtifacts are preserved, but the result is excluded from clean comparison.
failed_finalThe run failed and is not retryable.
blocked_budgetThe run was not launched because the budget or limit was reached.

The lab keeps campaign isolation, variant filtering, targeted reruns, artifact preservation, and contamination handling as first-class benchmark concerns.

9. Model API Boundary

The model layer is normalised enough that framework behaviour is not confused with provider behaviour.

Each run records:

  • model provider and model name
  • temperature and token limits
  • retry policy
  • timeout policy
  • prompt or input hash
  • token usage and cost estimate
  • tool-call format and normalised tool calls
  • provider errors, timeouts, or malformed responses

Failures are classified separately as framework orchestration failure, model or tool-call failure, benchmark infrastructure failure, task quality failure, or contaminated run.

10. Best Fit And Weak Fit

The verdict is not a single ranking. Each framework is assessed for where it fits best:

  • rapid MAS prototyping
  • durable workflow orchestration
  • typed and spec-first agents
  • enterprise governance
  • coding-worker integration
  • weak fit cases where the abstraction hides too much state, evidence, or control

11. Failure Modes Observed

The lab records failure modes directly:

  • lost state
  • weak handoff
  • unverifiable output
  • hidden retries
  • poor tool boundaries
  • hard-to-debug agent behaviour
  • brittle prompt coupling
  • missing approval points

Failure modes are part of the result, not an embarrassment to remove from the comparison.

12. Evidence Pack

Each framework run should link to:

  • source implementation
  • run logs
  • generated spec
  • test output
  • review notes
  • approval decision
  • benchmark score
  • screenshots or demo output

13. Pathfinder Verdict

Each page ends with one direct judgement:

This framework is or is not a good fit for governed AI-native SDLC automation because…

Frameworks In Scope

The first pass focuses on frameworks that are relevant to agentic SDLC work:

  • LangGraph
  • OpenAI Agents SDK
  • Microsoft Agent Framework
  • Google ADK
  • CrewAI
  • Pydantic AI
  • LlamaIndex
  • Strands Agents
  • OpenHands Software Agent SDK

Additional systems such as Devin, Claude Code, GitHub Copilot coding agent, Rovo Dev, GitLab Duo with Amazon Q, Factory Droid, and Amazon Q Developer are treated as packaged SDLC agents or workflow products rather than general-purpose MAS frameworks.

Evaluation Rubric

The lab compares each framework across criteria that matter for governed software delivery:

CriterionWhy it matters
Orchestration modelWhether the workflow can represent long-running SDLC stages, retries, branching, and state.
Multi-agent supportWhether specialist agents can collaborate without becoming chat theatre.
Structured outputsWhether specs, plans, risks, and evidence can be validated rather than parsed loosely.
Human-in-the-loop controlWhether approval, escalation, and review points are first-class design concerns.
MCP and tool integrationWhether the framework can consume real project context and expose useful tools safely.
Observability and tracingWhether agent decisions and tool calls can be inspected after the fact.
Evaluation supportWhether the workflow can be tested for spec conformance, policy compliance, and regression risk.
Deployment modelWhether the framework has a credible path from experiment to controlled production use.
Governed SDLC fitWhether it supports requirements, code, tests, governance, evidence, approval, and release readiness.
Software engineering performanceWhether the framework implementation can complete realistic software tasks with acceptable quality, cost, and rework.
Runner reliabilityWhether runs can be supervised, resumed, retried, excluded, and audited without corrupting benchmark results.
Model API boundaryWhether model/provider behaviour can be normalised and separated from framework behaviour.

Reference Tasks

The lab uses two task tracks.

Track 1: Synthetic Explanation Task

The synthetic task is used to explain the methodology clearly and keep public examples easy to understand:

SterlingPay wants to add round-ups into savings pots. The feature must support card transaction events, configurable rounding rules, customer opt-in, auditability, failure handling, and release controls.

This is useful for early workflow design, diagrams, screenshots, and blog explanations.

Track 2: Realistic Software Benchmark

The serious benchmark track uses the Flowglad / Series B Fintech profile from the devos benchmark work:

  • TypeScript and Next.js payments platform.
  • tRPC backend and PostgreSQL database with Drizzle ORM.
  • GitHub Actions and Vercel-style delivery environment.
  • Eight-story sprint profile covering bug fixes, backend changes, UI work, pricing model logic, and integration work.
  • Reference implementation baseline derived from real merged PRs.

This track is used for software engineering performance scoring, EvoScore-style comparison, and failure-mode analysis.

For each framework, the lab will ask:

  • Can it turn this request into a structured spec?
  • Can it identify missing requirements and risks?
  • Can it retrieve or accept relevant policy and architecture context?
  • Can it coordinate specialist agents for QA, architecture, security, and implementation planning?
  • Can it produce an evidence pack suitable for human review?
  • Can the output be evaluated repeatably?
    • Can it complete realistic software tasks with clean tests, lint, typecheck, and reviewable changes?

System Design

flowchart LR
  A["Reference task"] --> B["Framework adapter"]
  B --> C["Spec gate"]
  B --> D["Agent run"]
  B --> E["Human gate"]
  B --> F["Model boundary"]
  C --> G["Evidence bundle"]
  D --> G
  E --> G
  F --> G
  G --> H["Runner state and classification"]
  H --> I["Governed SDLC score"]
  H --> J["Software engineering score"]
  I --> K["Framework comparison"]
  J --> K

Adapter Contract

Each framework adapter maps its native concepts into the same benchmark contract:

Contract ObjectPurpose
SpecInputThe structured requirement, constraints, acceptance criteria, and risk class.
AgentRunThe framework-specific run, including agents, tools, messages, state, and outputs.
GateDecisionHuman approval, rejection, escalation, or automated gate result.
EvidenceBundleLogs, traces, specs, tests, review notes, tool calls, and benchmark artifacts.
EvalResultGoverned SDLC score, software engineering score, failure class, and run metadata.
RunStateThe lifecycle state used to decide whether the run is scorable, retryable, excluded, or failed.
ModelClientA thin boundary for invoking models, normalising tool calls, recording usage, and classifying provider failures.

What This Proves

  • Practical fluency with current agent and MAS frameworks.
  • Ability to evaluate frameworks through engineering operating needs, not hype.
  • Understanding of why SDLC automation requires specs, evidence, governance, and evals.
  • Judgement about where packaged agents, orchestration frameworks, and custom control planes fit together.
  • Ability to connect framework evaluation to production-grade benchmark discipline.

Limitations

This is not a claim that any framework can safely automate the whole SDLC. The lab is deliberately designed to expose limits: weak structured output, poor observability, awkward human approval, fragile tool integration, missing evaluation support, model API mismatches, and benchmark contamination.

The results are also time-sensitive. Agent frameworks are moving quickly, so the lab should be treated as a dated snapshot with repeatable criteria, not a permanent ranking.

Next Iteration

The first implementation pass will compare two or three frameworks against the reference task, then expand the matrix as the workflow and scoring model stabilise.

The first group is:

  • LangGraph for stateful, controlled workflows.
  • OpenAI Agents SDK for lightweight orchestration and tracing.
  • CrewAI for role-based crews and a useful comparison point against LangChain and LangGraph.

Multi-agent systems, agent frameworks, AI-native SDLC, LangGraph, OpenAI Agents SDK, CrewAI, MCP, context engineering, structured outputs, evaluation, governance, human-in-the-loop workflows, benchmark design, evidence packs, model API boundaries.