Godfrey Carnegie

Benchmark harness diagram for measuring context quality.

Context Quality Benchmark Harness

May 6, 2026 May 6, 2026 concept

Context Engineering Evaluation RAG MCP AI-Native SDLC

Overview

The Problem

When software agents fail, the explanation is often too vague: the model was wrong, the prompt was bad, or the task was too hard.

In practice, a large part of agent reliability depends on context quality. Did the agent receive the right policies, prior decisions, codebase facts, constraints, and examples? Did retrieval return relevant material? Did the answer stay faithful to the supplied context? Did performance drift as the knowledge base changed?

Without measurement, teams are guessing.

The Solution

The Context Quality Benchmark Harness is a pattern for evaluating context pipelines used by software delivery agents.

It measures whether context is:

relevant to the task
faithful to source material
complete enough for the requested decision
current relative to the repository and governance rules
usable by the agent in a real SDLC workflow

Benchmark Model

flowchart TD
  A[SDLC task set] --> B[Context retrieval]
  C[Policy and architecture sources] --> B
  D[Repository and delivery metadata] --> B
  B --> E[Agent execution]
  E --> F[Task outcome]
  E --> G[Faithfulness eval]
  E --> H[Relevancy eval]
  E --> I[Drift signal]
  F --> J[Benchmark report]
  G --> J
  H --> J
  I --> J

Example Task Types

Review a pull request against engineering standards.
Identify missing test evidence for a change.
Summarise architectural constraints for a planned feature.
Check whether a release candidate has approval gaps.
Retrieve the relevant governance rule for an exception request.
Compare agent recommendations against source policy text.

Evaluation

The benchmark should report:

retrieval precision and recall for known tasks
faithfulness against source context
relevancy of retrieved documents
task success rate
unsupported claims
stale context usage
policy coverage gaps
recurring failure modes

Leadership Angle

This gives engineering leaders a way to make AI governance empirical.

Instead of debating whether agents are “good enough”, the team can ask:

Which task classes are reliable?
Which context sources are weak?
Which policies are not retrievable?
Which failures are model issues and which are context issues?
What changed between benchmark runs?

What This Proves

Context engineering needs evaluation, not vibes.
RAG quality should be measured against real SDLC tasks.
Agent governance improves when leaders can see drift, gaps, and failure modes.

What I Would Do Differently

The next iteration would add longitudinal reporting so context quality can be tracked across policy changes, repository changes, and model upgrades.

Context engineering, RAG, pgvector-style retrieval, MCP, AI evaluation, faithfulness, relevancy, SDLC automation, engineering governance.