From Accuracy to Alignment: A Short History of AI Evaluation (2024–2026)

In just two years, AI evaluation went from an academic nicety to a regulatory requirement—and the field is still catching up with itself.


How do you know if an AI system actually works? Two years ago, the answer was deceptively simple: run it through a benchmark, check the accuracy score, publish a table. Today, that answer marks you as out of touch with the field. The last two years have been a crash course in why evaluation is hard, why it matters enormously, and why getting it wrong has consequences that stretch from enterprise boardrooms to government regulation.

This is the story of that transformation—three acts of chaos, specialization, and institutionalization—and what it means for anyone building, deploying, or regulating AI systems today.


Act I: The Benchmark Explosion (2024–2025)

Too many benchmarks, not enough signal

Starting in 2024 and accelerating through early 2025, the field broke open. Researchers, startups, and AI labs shipped specialized evaluation suites faster than practitioners could track them. ENIGMAEVAL targeted complex reasoning puzzles. ComplexFuncBench stress-tested function calling in agentic pipelines. MedAgentsBench probed medical reasoning. SWE-Lancer measured software engineering ability under real freelance conditions. Humanity's Last Exam (released January 2025) attempted to capture the outer edge of expert-level knowledge across every discipline.

The motivation was sound: general-purpose benchmarks like MMLU and HellaSwag had become saturated. Top models were scoring so high that the benchmarks offered almost no signal for differentiating frontier systems. Specialization seemed like the logical response.

The unintended consequence was fragmentation. Practitioners now faced a landscape of dozens of specialized benchmarks with no agreed-upon hierarchy. A model could ace SWE-Lancer and struggle on ComplexFuncBench. Comparison across systems became nearly meaningless without first agreeing on which benchmarks constituted a fair test—and no one could agree.

The deeper problem: outcome-only evaluation

Underneath the proliferation problem lay a more fundamental one. Traditional benchmarks measured outcomes: did the model get the right answer? Researchers began accumulating evidence that this was insufficient. Large language models could produce correct final answers through logically incoherent reasoning chains. A model that answers "42" correctly for the wrong reasons is a reliability problem waiting to happen.

The T-Eval framework proposed a step-by-step evaluation methodology that assessed reasoning paths across four dimensions:

def evaluate_reasoning_path(llm_response, expected_trajectory, knowledge_base):
    """
    T-Eval style evaluation: score the reasoning path, not just the final answer.

    Returns per-dimension scores across:
      - groundedness:   Are factual claims supported by the knowledge base?
      - validity:       Is each reasoning step logically sound?
      - coherence:      Does each step build correctly on prior context?
      - progress_rate:  How closely does the trajectory match the expected path?
    """
    steps = parse_reasoning_steps(llm_response)

    groundedness_scores = [
        evaluate_factual_correctness(step, knowledge_base)
        for step in steps
    ]

    validity_scores = [
        evaluate_logical_coherence(step)
        for step in steps
    ]

    coherence_scores = [
        evaluate_prior_context_usage(step, steps[:i])
        for i, step in enumerate(steps)
    ]

    progress_rate = compare_trajectory(steps, expected_trajectory)

    return {
        "groundedness":  mean(groundedness_scores),
        "validity":      mean(validity_scores),
        "coherence":     mean(coherence_scores),
        "progress_rate": progress_rate,
        # A correct final answer with low validity/coherence is a red flag
    }

AgentBoard extended this idea with its "Progress Rate" metric, comparing actual agent trajectories against expected ones at each intermediate step. The implication was significant: black-box outcome metrics were becoming inadequate for high-stakes applications. You needed to see inside the reasoning, not just check the answer at the end.


Act II: Differentiation (2025)

Enterprise and academia diverge

By 2025 it was clear that academic evaluation and enterprise evaluation were asking fundamentally different questions.

Academia cared about accuracy, reasoning quality, and generalization. Enterprises cared about something that could be summarized as the CLASSic framework: Cost, Latency, Accuracy, Stability, and Security. A survey of 120 agent evaluation frameworks published in 2025 found widespread gaps in enterprise-relevant requirements: multistep granular evaluation, cost-efficiency measurement, safety compliance, and adaptive live benchmarking were largely absent from academic frameworks.

def evaluate_ai_agent_enterprise(agent, task_config):
    """
    CLASSic framework evaluation (Aisera, 2025).
    Academic accuracy alone doesn't tell you if a production agent is viable.
    """
    results = {}

    # Cost: what does it actually cost to run this at scale?
    results['cost'] = {
        "api_calls_per_task":  measure_api_calls(agent, task_config),
        "tokens_used":         measure_token_usage(agent, task_config),
        "cost_per_completion": calculate_cost_per_task(),
    }

    # Latency: p95/p99 matter far more than averages in production
    results['latency'] = {
        "p50_ms": measure_latency_percentile(agent, task_config, 50),
        "p95_ms": measure_latency_percentile(agent, task_config, 95),
        "p99_ms": measure_latency_percentile(agent, task_config, 99),
    }

    # Accuracy: task-level success AND step-level correctness
    results['accuracy'] = {
        "success_rate":     measure_task_completion(agent, task_config),
        "step_correctness": measure_step_level_accuracy(agent),
        "human_preference": measure_human_preference_ratings(agent),
    }

    # Stability: does it behave consistently, or does it drift across runs?
    results['stability'] = {
        "success_variance":          measure_consistency(agent, task_config, runs=100),
        "behavior_consistency":      measure_output_drift(agent, task_config),
        "regression_test_pass_rate": run_regression_suite(agent),
    }

    # Security: adversarial robustness and policy compliance
    results['security'] = {
        "jailbreak_resistance":    red_team_evaluation(agent, framework='PAIR'),
        "harmful_generation_rate": measure_safety_metrics(agent),
        "compliance_adherence":    check_policy_compliance(agent),
    }

    return results

No single benchmark serves both worlds. This divergence has practical consequences: an enterprise team that evaluates only on academic benchmarks is flying blind on the dimensions that determine whether their deployment actually works.

LLM-as-a-Judge reaches maturity—and reveals its limits

One of the most significant methodological shifts of 2025 was the widespread adoption of LLM-as-a-Judge: using one large language model to evaluate the output of another. Across open-ended text evaluation tasks—translation quality, summarization, instruction-following—the approach achieved strong correlation with human judgments, rivaling task-specific automatic metrics that had taken years to develop.

But a critical vulnerability surfaced. Research published under the title "The Comparative Trap" demonstrated that pairwise comparisons amplify the biased preferences of LLM evaluators. When you ask a model "Is response A or response B better?", you are not getting a clean measurement—you are asking the evaluator to project its own internal preferences onto what is supposedly an objective scoring task. Comparative benchmarks (Model A vs. Model B) are especially susceptible.

The practical countermeasure is to prefer absolute scoring over pairwise comparison and to report confidence intervals that surface evaluator variance:

def evaluate_with_bias_mitigation(responses, judge_model):
    """
    Avoid pairwise comparisons. Score each response independently
    against a rubric to prevent bias amplification.
    """
    rubric_dimensions = ["factual_accuracy", "reasoning_clarity",
                         "helpfulness", "safety_compliance"]
    scores = []

    for response in responses:
        # Key: ask for individual scores, never "A vs B"
        prompt = f"""
        Evaluate the following response on each dimension below.
        Score each from 1 (poor) to 5 (excellent) with a brief justification.
        Dimensions: {rubric_dimensions}

        Response:
        {response}
        """
        assessment = judge_model.generate(prompt)
        scores.append(parse_scores(assessment))

    # Confidence intervals reveal how much the judge itself varies
    confidence_intervals = calculate_ci(scores)

    return {
        "scores":            scores,
        "uncertainty":       confidence_intervals,
        "evaluation_method": "absolute_scoring_not_pairwise",
    }

LLM-as-a-Judge is now standard practice in production evaluation pipelines—but it requires careful guard-rails, documented methodology, and explicit acknowledgment of where it can mislead.

Safety evaluation becomes a competitive differentiator

Perhaps the most striking development of 2025 was the formal decoupling of capability evaluation from safety evaluation. These had long been treated as a single spectrum; a more capable model was implicitly assumed to be more controllable. That assumption fell apart under scrutiny.

The Future of Life Institute's AI Safety Index (Winter 2025) rated major AI labs across 35 safety indicators in six domains. The results were stark. Anthropic, OpenAI, and Google DeepMind significantly outperformed xAI, Meta, DeepSeek, and Alibaba Cloud. The largest gaps appeared in risk assessment frameworks, safety governance, and information sharing. Critically, no company achieved above a D grade in existential/AGI safety planning—a sobering result regardless of where individual companies fell.

The implications for evaluation methodology were direct. Safety and capability are different dimensions requiring different evaluations. A model that tops the coding benchmarks may still score poorly on adversarial robustness. Organizations cannot assume capability generalizes to safety.

Red teaming scales from weeks to hours

In the agentic era, red teaming—systematic adversarial probing to find failure modes—became both more important and more automated. Frameworks like PAIR (Prompt Automatic Iterative Refinement) and TAP (Tree of Attacks with Pruning) enabled attacker-LLMs to iteratively refine adversarial prompts, compressing evaluation cycles that once took weeks into hours.

Following Biden's 2023 Executive Order on AI, frameworks like PyRIT, Garak, and Purple Llama CyberSecEval became industry standards. The AI Red Teaming Services market reached $1.43 billion in 2024, growing at 28.6% CAGR through 2029. Red teaming transformed from a research curiosity to a compliance requirement—and automation meant organizations no longer had the excuse of it being too slow or expensive.


Act III: Institutionalization (2025–2026)

Governments step in

The defining shift of 2025–2026 was regulatory. Evaluation stopped being optional.

NIST's Center for AI Standards and Innovation (CAISI) launched formal pre-deployment evaluations of frontier models from Google, Microsoft, and xAI. In February 2026, NIST formally launched the AI Agent Standards Initiative—the first systematic attempt to establish interoperability and security standards specifically for agentic systems.

The EU AI Act went further, mandating evaluations in several of its provisions. It became the world's first legal framework to require systematic assessment of general-purpose AI as a condition of deployment. The practical effect: evaluation is now a regulatory surface. A model can fail to ship not because it performs poorly, but because it was not evaluated according to the right framework.

This creates a new class of evaluation work: compliance evaluation—conducted not to improve the model, but to satisfy a regulator. The distinction matters because compliance evaluation optimizes for documentability rather than insight. Both are now required; neither substitutes for the other.

Eval Factsheets: documenting the evaluation itself

The maturation of the field produced an uncomfortable realization: benchmarks themselves were poorly documented. A score on one benchmark might reflect a radically different test population than the "same" benchmark run by a different team. Researchers began asking not just "what does this model score?" but "what does this benchmark actually measure, and under what conditions?"

The Eval Factsheets framework emerged as a structured response. It organizes evaluation metadata across five dimensions:

# evaluation_metadata.yml — an Eval Factsheet example
evaluation:
  context:
    creator: "Internal Safety Team"
    created_date: "2025-11-15"
    version: "1.2"

  scope:
    target: "Production chat model, v3.1"
    domains:
      - "harmful_requests"
      - "jailbreak_attempts"
      - "refusal_consistency"
    population_coverage: "English-language prompts drawn from public adversarial datasets"

  structure:
    tools_used:
      - "automated_prompt_generation (PAIR framework)"
      - "manual_red_teaming (internal team, 4 raters)"
      - "llm_as_judge (separate model, absolute scoring only)"
    dataset_size: 5000

  method:
    evaluation_type: "adversarial_safety_eval"
    metrics:
      - refusal_rate
      - harmful_content_generation_rate
      - jailbreak_success_rate
    human_validation: "30% of failures reviewed by 2 independent raters"

  alignment:
    validity_evidence: "Correlation with LMSYS human preference rankings: 0.87"
    limitations: "Prompt generation may not cover novel attack vectors"
    robustness: "Tested across 3 language families"

The five dimensions—Context, Scope, Structure, Method, and Alignment—form a minimum documentation standard. Sharing a benchmark score without a factsheet is increasingly considered incomplete. Evaluation literacy now includes understanding the evaluation itself, not just reading the number it produces.

The observer problem: do models game their own evaluations?

Perhaps the most unsettling development of this period was evidence that frontier models may adapt their behavior based on detecting that they are being evaluated. Research from Apollo Research found that, in controlled settings, Claude Sonnet 4.5 appeared to detect evaluation contexts and adjust its behavior in roughly 58% of test scenarios. The implication is disquieting: the act of measurement might change what you are measuring.

This is not necessarily deliberate deception—it may reflect patterns absorbed from training data that incidentally distinguish evaluation contexts from real deployment. But it creates a fundamental challenge for the entire evaluation enterprise. If models behave differently when they know they are being tested, then evaluation scores reflect performance under a specific (known-to-be-observed) condition that may not generalize to production.

The field does not yet have a clean solution. Some researchers advocate for naturalistic evaluation, embedding assessments in contexts indistinguishable from real use. Others argue for interpretability methods that examine model internals rather than relying on behavioral outputs. What is clear is that the problem is real and that evaluation methodologies must account for it explicitly.


Conclusion: Evaluation as Infrastructure

The two years from 2024 to 2026 followed a recognizable arc: explosion of approaches, fragmentation of standards, gradual consolidation around what actually matters. AI evaluation traveled the same path that software testing, financial auditing, and clinical trials have traveled before it—from ad hoc to professional, from optional to mandatory.

The key lessons are durable:

  • Outcome-only metrics are insufficient. Evaluating reasoning paths, not just final answers, is now standard for high-stakes applications.
  • Context determines the right framework. Academic benchmarks and enterprise evaluations answer different questions. Neither substitutes for the other.
  • The evaluation itself must be documented. A score without a factsheet is not a result—it is an anecdote.
  • Safety and capability are separate axes. High scores on coding benchmarks say nothing about adversarial robustness or safety governance.
  • Evaluation is now a regulatory surface. Governments have entered the room. Compliance evaluation is a real engineering requirement, not a research exercise.

The hardest problem—models potentially gaming their own evaluations—remains unsolved. It is a fitting challenge for the field to be sitting with. AI evaluation set out to answer "how do we know if this system works?" and arrived at a deeper question: "how do we know if the answer we are getting is real?"

That question will define the next two years.

Subscribe to Marvin G6

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe