Back to blog

Wednesday, June 17, 2026

GLM-5.2 — The New Leading Open Weights Model Is Built for Long-Horizon Agentic Tasks

cover

The Headline

On June 13, Z.ai (formerly Zhipu AI) released GLM-5.2 — a 753-billion-parameter Mixture-of-Experts model that immediately became the highest-scoring open-weights model on the Artificial Analysis Intelligence Index, with a score of 51 (v4.1). It sits on the Pareto frontier of Intelligence vs Cost per Task, meaning no other model delivers more capability per dollar at this intelligence level.

The model is already topping LMSys frontend coding leaderboards, beating GPT-5.5 on multiple long-horizon coding benchmarks for roughly 1/6th the cost, and shipping with a usable 1-million-token context window under an MIT open-weights license.

This isn't just a GLM-5.1 refresh with bigger numbers. The architecture changes — IndexShare for sparse attention, improved Multi-Token Prediction, agentic RL training — are specifically designed for long-horizon agentic tasks: the kind where an agent iterates on code, makes dozens of tool calls, and maintains coherence across 100K+ token contexts. If you're building coding agents, this is the open model to watch.

Where GLM-5.2 Lands on the Intelligence Index

The Artificial Analysis Intelligence Index aggregates nine standardized benchmarks covering reasoning, mathematics, coding, and knowledge. GLM-5.2 (max mode) scores 51, a leap of 11 points over GLM-5.1 (40) and comfortably ahead of every other open-weights model:

ModelAA Intelligence Index Score
GPT-5.5 (xhigh)67
Claude Opus 4.763
Gemini 3.1 Pro (reasoning)60
GLM-5.2 (max)51
MiniMax-M344
DeepSeek V4 Pro (max)44
Kimi K2.643
GLM-5.140

GLM-5.2 leads the open-weights pack by a seven-point margin over the nearest competitors (MiniMax-M3 and DeepSeek V4 Pro at 44). The jump from 40 to 51 is the largest single-generation improvement on the Index among open models. It places 4th overall, behind only the proprietary frontier models.

On the cost side, GLM-5.2 is on the Pareto frontier: at ~$0.46 per task, it delivers the best intelligence-to-cost ratio among models at its capability level. For context, MiniMax-M3 costs ~$0.18 per task and DeepSeek V4 Pro max ~$0.05, but both score seven points lower.

Third-Party Coding Benchmarks

Beyond the Artificial Analysis composite, independent results are coming in:

  • LMSys Frontend Coding: GLM-5.2 ranks #1 among all models (including closed ones) on the LMSys frontend coding arena.
  • Long-horizon coding tasks: Multiple reports show GLM-5.2 (max mode) matching or exceeding GPT-5.5 on tasks requiring 50-100+ iterative tool calls with repository-scale context.
  • Coded Arena and Agent Arena: Strong showings that place GLM-5.2 in the top tier of agent-capable models.

Third-party SWE-bench verified, LiveCodeBench, and HumanEval runs are expected within 1-2 weeks of the weight release, which happened on June 17.

The "Long-Horizon Task" Architecture — What Makes GLM-5.2 Different from GLM-4 and GLM-5.1

Most model releases focus on raw benchmark scores. GLM-5.2's story is different: the architecture is purpose-built for sustained autonomous work — the kind where an agent reads an entire repository, plans a change, writes tests, implements code, and iterates based on test failures, all within a single context window.

IndexShare: Sparse Attention at 1M Context

The most significant architectural innovation is IndexShare, Z.ai's approach to sparse attention at scale. GLM-5.2 uses Dynamic Sparse Attention (DSA), where a lightweight indexer predicts which tokens each attention head should attend to, avoiding the quadratic cost of full attention. The problem: at 1M-token context, even the indexer becomes expensive.

IndexShare solves this by reusing the same lightweight indexer across every four transformer layers, rather than maintaining one per layer. At maximum 1M-token context length, this single change reduces per-token FLOPs by 2.9×. Combined with KVShare for efficient KV-cache management, this is what makes a usable 1M-token context window practical for real inference workloads.

Improved Multi-Token Prediction (MTP) Speculative Decoding

GLM-5.2 extends its MTP-based speculative decoding from 3 draft tokens (GLM-5/5.1) to 5 draft tokens. The acceptance length — the average number of draft tokens the model agrees with — increases by roughly 20% compared to GLM-5.1. This directly translates to higher end-to-end throughput on reasoning and coding tasks, where the bottleneck is often generation speed rather than prompt processing.

Agentic RL with Anti-Hacking

The post-training recipe for GLM-5.2 involves reinforcement learning specifically designed for long-horizon agentic tasks — code editing, tool use, multi-step problem solving — with what Z.ai calls "anti-hacking" to prevent reward hacking during RL training. This is a notable departure from the standard instruction-tuning + RLHF pipeline. The model is explicitly trained to maintain coherence over long trajectories, not just produce good single-turn responses.

Two Thinking Effort Modes

GLM-5.2 ships with two reasoning modes:

  • High: Fast generation for everyday tasks, comparable to GLM-5.1's standard mode.
  • Max: Deep reasoning with extended chain-of-thought, optimized for complex coding, multi-step tool use, and agentic workflows. This is where the model's 51 Intelligence Index score was measured.

The Max mode is non-negotiable for long-horizon agent work. Running GLM-5.2 in High mode will be faster and cheaper, but for the tasks the model was designed for, you want Max.

Practical Implications for Developers Building Agentic Workflows

GLM-5.2 is not a general-purpose chatbot model. It's an agent-first foundation model, and the practical implications flow from that design choice.

1. Repository-Scale Context in a Single Window

The 1M-token context window can hold roughly 750,000 words — enough to load an entire mid-sized codebase, including documentation, source files, test suites, and build configuration, all within a single inference call. For agent frameworks like Claude Code, Codex, or OpenCode that rely on large context windows for task planning and execution, this means fewer context-window truncations and less need for sliding-window hacks.

Pricing at $1.40/1M input tokens and $4.40/1M output tokens means a full 1M-token prompt costs about $1.40 in input and scales predictably.

2. Sustained Tool-Use Chains

GLM-5.2 supports chains with 100+ tool calls in Max mode, according to early reports. For agent developers, this is the critical metric: not single-turn accuracy, but the model's ability to maintain a coherent plan across dozens of read-file, edit-file, run-command, and observe-output cycles without losing track of the original goal. This is where GLM-5.2's agentic RL training shows its value.

3. Open Weights Change the Economics

The MIT-licensed weight release (available from June 17 on Hugging Face) means you can run GLM-5.2 on your own hardware or through any inference provider. Ollama support was available day one. vLLM already has recipes for serving it. The 40B active parameters mean it's inference-addressable on high-end consumer GPUs with quantization, though the full 753B weights require multi-GPU setups.

For teams that can't or won't send code to US-based API providers, GLM-5.2's open-weights MIT license removes that constraint entirely.

4. Availability

GLM-5.2 is available through:

  • GLM Coding Plan: Subscription tiers at ~$10/month (Lite), ~$30/month (Pro), and ~$80/month (Max), with a Team tier for organizations. Includes access inside Claude Code, Cursor, and other IDEs.
  • Direct API: $1.40/1M input tokens, $4.40/1M output tokens (same pricing as GLM-5.1).
  • Third-party providers: DeepInfra, Novita, Nebius, Baseten, Fireworks, Parasail, SiliconFlow, GMI Cloud.
  • Self-hosted: MIT-licensed weights on Hugging Face, Ollama support, vLLM recipes.

How It Compares to Other Open Models

vs. MiniMax-M3

MiniMax-M3 (score 44) launched two weeks before GLM-5.2 and was briefly the top open model. It's a strong general-purpose model with competitive coding benchmarks, but GLM-5.2's seven-point lead on the Intelligence Index and purpose-built agentic training give it a clear edge for long-horizon tasks. MiniMax-M3 is cheaper per task (~$0.18 vs ~$0.46), so for simpler workloads it may be the better price-performance choice.

vs. DeepSeek V4 Pro

DeepSeek V4 Pro (max mode, score 44) matches MiniMax-M3 on the Index and is dramatically cheaper (~$0.05 per task). DeepSeek's ecosystem is mature with excellent vLLM and SGLang support. However, GLM-5.2's 1M-token context (vs DeepSeek's 128K) and agentic RL training make it the better choice for agent workflows that need sustained context.

vs. Kimi K2.6

Kimi K2.6 (score 43) is Moonshot's contender in the open-weights space. It's competitive on general benchmarks but doesn't have the long-context or agentic focus that defines GLM-5.2.

vs. Proprietary Models (GPT-5.5, Claude Opus 4.7)

GLM-5.2 trails proprietary leaders by 10-16 points on the Intelligence Index. This is expected — closed models benefit from larger training budgets, proprietary data, and more aggressive RL pipelines. What narrows the gap is cost and control: at $0.46 per task with MIT weights, GLM-5.2 offers roughly 60-70% of the intelligence of GPT-5.5 for about 1/6th the cost, with zero vendor lock-in.

MetricGLM-5.2 (max)GPT-5.5 (xhigh)Claude Opus 4.7
AA Intelligence Index516763
Context window1M tokens1M tokens*200K tokens
API price (input)$1.40/M~$10/M**$15/M
Open weightsMITNoNo
Self-hostableYesNoNo

*GPT-5.5's effective context varies by mode. **GPT-5.5 pricing not final at time of writing; estimate based on GPT-5.2 pricing trajectory.

Pitfalls

1. Max mode is required for the reported benchmarks. The 51 Intelligence Index score was achieved in Max mode, which is slower and more expensive than High mode. In High mode, GLM-5.2 is comparable to GLM-5.1. If you're comparing benchmark numbers, make sure you know which mode was used.

2. Open-source benchmarks are still rolling in. Z.ai shipped GLM-5.2 without publishing SWE-bench Verified, LiveCodeBench, or HumanEval numbers at launch. Third-party results will take 1-2 weeks post-weight-release. The Artificial Analysis Index and LMSys results are reliable, but you should wait for independent coding benchmarks before making procurement decisions.

3. 1M-token context is usable but expensive. A full 1M-token prompt costs ~$1.40 in input tokens alone. Output at that context length also runs slower due to the sparse attention mechanics. Plan your context budgets carefully — most agentic tasks can be done in 100K-200K tokens, which is where GLM-5.2 hits its efficiency sweet spot.

4. Self-hosting requires significant hardware. 40B active parameters is manageable on a single H100 with 4-bit quantization, but the full 753B weights need multi-GPU setups (8×H100 minimum). The MIT license removes software restrictions, but the hardware cost is real.

5. The geopolitical context matters. GLM-5.2 is a Chinese AI model released under MIT license. For teams with export control compliance requirements or that operate in restricted jurisdictions, verify that self-hosting and API usage are permitted under your legal framework. Z.ai has stated the weights ship with "no regional limits," but self-hosted deployment decisions remain your responsibility.

The Bottom Line

GLM-5.2 is the strongest open-weights model for long-horizon agentic coding tasks available today. The IndexShare architecture makes 1M-token context practical, the agentic RL training produces coherent multi-step behavior, and the MIT open-weights license gives developers deployment flexibility that closed models can't match.

It's not a GPT-5.5 or Claude Opus killer — it trails those models on raw intelligence by a comfortable margin. But for the specific use case of building autonomous coding agents that operate at repository scale, GLM-5.2 is currently the best option in the open-weights ecosystem, and by a meaningful margin.

If you're running open-source coding agents and need a context window that actually fits a real codebase, start with GLM-5.2 in Max mode. Watch the third-party benchmarks, plan for the hardware requirements if self-hosting, and treat the High/Max mode distinction as a critical configuration decision rather than an afterthought.