Back to blog

Friday, June 19, 2026

The AA-Briefcase Benchmark: When Frontier AI Meets Real Knowledge Work

cover

The AA-Briefcase Benchmark: When Frontier AI Meets Real Knowledge Work

On June 18, 2026, Artificial Analysis released the AA-Briefcase benchmark — an evaluation that tests AI models on realistic, multi-week knowledge work projects built from fragmented source files by industry experts. The results are a sobering reality check for anyone building agent-powered workflows for knowledge-intensive domains.

Claude Fable 5, the current frontier model leader, achieves a perfect score on exactly 3% of tasks. On 31 out of 91 tasks, no model scores above 50%. The cost gap between the cheapest viable option and the best performer spans more than 800x.

This piece breaks down the methodology, the model rankings, the cost-performance landscape, and what it all means for teams deploying AI in knowledge work contexts.

What AA-Briefcase Actually Tests

Most AI benchmarks evaluate narrow, well-scoped tasks: answer a math question, write a function, translate a sentence. The model gets a clean prompt and produces a clean output. AA-Briefcase is built to be the opposite of that.

Each of the 91 tasks simulates a multi-week knowledge work project. The model is dropped into an agentic workspace containing a deliberately messy collection of inputs:

  • Company documents and policy manuals
  • Meeting transcripts spanning weeks
  • Large-scale data exports (spreadsheets, databases, CSVs)
  • Over 25,000 Slack messages and 3,500+ emails

The sources are fragmented, contradictory, and full of realistic ambiguity. A memo from week one might directly conflict with an email from week three. The relevant signal is buried under irrelevant noise. Tasks require models to synthesize across these sources, produce structured deliverables (reports, presentations, analyses), and handle the kind of open-ended direction that real knowledge workers deal with daily.

Tasks span four project scenarios, each designed by industry experts who collectively invested over 4,000 hours in task construction. The scenarios and their tasks remain private to prevent contamination. A public fifth scenario (AA-Briefcase Lite) is available on Hugging Face for researchers who want to study the structure, but it doesn't count toward official results.

How Scoring Works

AA-Briefcase uses a three-axis scoring system:

MetricWhat It Measures
Rubric Pass RateDoes the model's deliverable satisfy all criteria for a task? Binary pass/fail per criterion, aggregated.
Analytical Quality EloPairwise comparisons between model outputs on reasoning depth, accuracy, and insight quality.
Presentation EloPairwise comparisons on formatting, clarity, visual design, and professional polish.

These three scores are combined into a single AA-Briefcase Elo rating. The rubric score gives a hard measure of task completion, while the two Elo scores capture the qualitative dimensions that matter in professional knowledge work.

The Results: Frontier Models Still Can't Do the Job

The headline numbers tell a stark story.

ModelAA-Briefcase EloRubric Pass RateMedian Turns/TaskCost/Task
Claude Fable 515873% (perfect)~63~$31.00
Claude Opus 4.8 (max)1356~52
GLM-5.2 (max)
GPT-5.5 (xhigh)
MiniMax M31116~26 min avg
DeepSeek V4 Flash (Max)~$0.04

Claude Fable 5 leads the overall leaderboard with an AA-Briefcase Elo of 1587, followed by Claude Opus 4.8 (max) at 1356, GLM-5.2 (max), and GPT-5.5 (xhigh) in fourth. Opus 4.8 ties Fable 5 for the lead on Presentation Elo.

But "leading" is a relative term. Under the strict rubric scoring scheme that grades whether a model satisfies all criteria correctly per task, Claude Fable 5 — the most capable publicly available model as of June 2026 — posts a perfect score on just 3% of tasks. On 31 of the 91 tasks, no model cleared 50% of the rubric criteria. These are not trick questions or edge cases; they're realistic projects where the information is scattered, contradictory, and voluminous.

The failure modes break down along predictable lines:

  • Weaker models miss relevant files entirely. They don't identify which of the thousands of documents contain the information needed for a given task, so they work with incomplete context from the start.
  • Stronger models find the right files but miss subtle multi-source dependencies. They extract the right information from one document but fail to notice that a later email contradicts or qualifies it. They produce well-written answers built on incomplete synthesis.
  • Turn volume doesn't correlate with quality. Gemini 3.5 Flash averages about 88 turns per task — more than any frontier model — yet scores well below the leaders. More work doesn't mean better work.

The 800x Cost Spread

The cost data is where AA-Briefcase delivers its most practical punch.

Running Claude Fable 5 costs over $31 per task on average, largely because knowledge work requires long contexts and many reasoning turns (median ~63 turns per task, with models allowed up to 500). At the other end of the spectrum, DeepSeek V4 Flash (Max) runs at approximately $0.04 per task — an 800x difference.

No model in the lowest-cost tier reaches frontier-level AA-Briefcase performance. But for teams whose workloads don't require frontier capability, the smaller models offer a dramatically better cost-to-value ratio. The question is not "which model is best" but "how much capability do you actually need, and what are you willing to pay for incremental gains?"

This is the kind of data that should inform procurement decisions. A team processing routine knowledge work tasks might get 80% of the utility at 0.1% of the cost by choosing a mid-tier model with good retrieval augmented generation (RAG) scaffolding and focused prompt strategies.

Time and Work Patterns

Frontier models take roughly 20 minutes per task on AA-Briefcase. Claude Opus 4.8 (max) averages ~24 minutes of wall-clock time per task, GLM-5.2 (max) ~19 minutes. MiniMax M3 actually takes longer (~26 minutes on average) but scores 240 Elo points behind Opus — demonstrating that raw runtime is not a proxy for quality.

Tool-use behavior varies significantly by model family:

  • Anthropic models (Claude Fable 5, Opus 4.8) make heavy use of visual inspection, averaging 21 and 12 view-image calls per task respectively. These models lead in Presentation Elo, suggesting that repeatedly inspecting rendered outputs is integral to producing polished deliverables.
  • Google models (Gemini 3.5 Flash, Gemma 4 31B) are compute-heavy, averaging ~60 and ~43 script execution calls per task respectively. They run more code, not better code.
  • MiniMax models follow Anthropic's pattern of frequent visual inspection, while Google's approach toggles more toward computation than review.

What This Means for Practitioners

AA-Briefcase is not a gotcha exercise. It's a well-constructed evaluation that reveals genuine capability boundaries. Here are the practical takeaways:

No model handles realistic knowledge work autonomously. Not Claude Fable 5, not GPT-5.5, not any of them. If your deployment involves messy, multi-source knowledge worker tasks, you need human-in-the-loop validation, not blind automation.

Cost-performance optimization is not optional. With an 800x spread between the cheapest and best models, the default choice of "use the frontier model" is economically indefensible for most workloads. Profile your tasks, measure required capability thresholds, and select the cheapest model that clears the bar.

Retrieval and scaffolding matter as much as the model. The failure mode where weak models miss relevant files is addressable with good RAG architecture. The failure mode where strong models miss multi-source contradictions is harder — it requires structured context management, explicit cross-referencing, and iterative verification loops that most current agent frameworks don't provide.

Turn count is not engagement. Models that take more steps don't produce better results. When evaluating agent frameworks, measure output quality against task complexity, not step count.

The Bottom Line

AA-Briefcase is the most grounded evaluation of AI knowledge-work capability to date, for precisely the same reason that makes its results uncomfortable: it tests models on the kind of work actual knowledge professionals do, not carefully curated benchmark items. The finding that frontier models can barely clear 3% perfect-task completion on realistic projects should give any team pause before marketing an agent as capable of replacing knowledge workers.

The benchmark's open-source release of AA-Briefcase Lite on Hugging Face and its implementation in the Stirrup agent framework mean that teams can run their own evaluations. If you're building agent-powered tools for knowledge-intensive domains, you should.

The gap between benchmark hype and real-world capability just got measurable. Now the question is what we build to close it.