What Changed

On April 29, 2026, DeepSeek began limited testing of a Vision mode in its flagship chat product — the first time multimodal capabilities appeared in DeepSeek's consumer-facing chat interface. Users with access see a third mode selector alongside Fast Mode and Expert Mode, enabling image upload and analysis directly in the conversation.

This is significant because DeepSeek was the last major frontier lab without vision in its chat product. GPT-4V arrived in late 2023. Gemini has been multimodal from day one. Claude added vision in March 2024. DeepSeek's text-only chat was a conspicuous gap given the strength of its model family — and the gap is now closed.

Before Vision mode, DeepSeek's multimodal capabilities existed only as separate model families — DeepSeek-VL, DeepSeek-VL2, and Janus — requiring dedicated API endpoints and different integration patterns. The new Vision mode brings image understanding into the same chat interface and the same V4 model family, accessible with the same API calls.

The Rollout: Grayscale Testing, Not a Full Launch

DeepSeek is using a grayscale rollout (灰度测试, huidu ceshi in Chinese tech media) — a controlled release to a subset of users. This is the same pattern DeepSeek used for V4's early preview access.

What's available:

Web UI: Vision mode appears as a toggle or mode selector alongside Fast/Expert
Mobile app: Available after updating, same mode selector pattern
Access: Some users get it immediately; others see the option missing entirely

What's not yet available:

No API endpoint for vision input — the mode is chat-only for now
No disclosure of the underlying model name, parameter count, or training methodology powering the vision mode
No benchmarks or technical report specific to the chat vision feature

Chinese tech media describes the rollout as "gray-scale testing" rather than a beta, which signals two things: DeepSeek is gathering production-quality feedback rather than feature validation, and the company is managing inference infrastructure buildout gradually. Vision models are compute-heavy — DeepSeek's reported infrastructure ramp with domestic AI chips (Huawei Ascend, Cambricon) suggests capacity is a real constraint.

The DeepSeek API documentation as of late April still showed no vision-related endpoints — only V4-Pro, V4-Flash, 1M context, and tool calling. API-based vision capabilities are expected to follow once the grayscale test stabilizes.

What Vision Mode Can Do

Based on user reports from the grayscale test and DeepSeek's own demonstrations, Vision mode handles the standard multimodal benchmarks.

Image Understanding

Object recognition: Identify and describe objects in photographs, screenshots, and diagrams
Document analysis: Extract and reason over text from scanned documents, screenshots, and PDF pages
Chart and graph interpretation: Read numerical data from charts, graphs, and tables embedded in images
Visual Q&A: Answer questions about image content — "how many people are in this photo?", "what's the license plate number?"
Handwritten text recognition: OCR of handwritten notes and forms

What It Doesn't Do

No video understanding — unlike Gemini's native video support
No image generation — DeepSeek's Janus models cover text-to-image, but Vision mode is understanding-only
No multi-image workflows — initial reports suggest single-image-per-conversation mode, not multi-image batch processing
No tool-augmented vision — you can't, for example, search the web for an image and then analyze it in one chain (yet)

Supported Image Types

Based on the V4 model family's underlying capabilities, expect standard image format support (JPEG, PNG, WebP) with per-image size limits similar to V4's 1M-token context. Image resolution handling details haven't been published — the DeepSeek-VL2 models supported 1024×1024 inputs, but Vision mode may use a different image preprocessing pipeline.

Under the Hood: How DeepSeek's Vision Works

The most interesting technical detail comes from analysis of DeepSeek V4's vision pipeline: the model processes images natively without requiring a separate vision encoder pipeline.

Native Multimodal Architecture

Most multimodal models bolt a vision encoder (CLIP, SigLIP, etc.) onto a language model. The encoder converts images into embeddings, and those embeddings are fed into the LLM as prepended or interleaved tokens. This is how GPT-4V, Gemini, and Claude all work.

DeepSeek V4's architecture appears to differ. The model directly ingests image tokens into its MoE transformer layers without a separate encoder step. This means:

Fewer KV cache entries per image: DeepSeek V4 uses approximately 90 KV cache entries per image vs 870 for Claude — roughly 10x fewer per-image overhead
Lower vision processing cost: Less compute per image means lower inference cost, which feeds directly into DeepSeek's aggressive pricing
Tighter text-image integration: No encoder bottleneck between vision and language processing — the MoE layers route vision tokens to relevant experts just like text tokens

MoE Efficiency at Scale

DeepSeek V4 is a Mixture-of-Experts architecture — 1.6 trillion total parameters with 49 billion active per token for V4-Pro, and 284B total / 13B active for V4-Flash. Vision tokens are routed through the same expert selection mechanism as text tokens. This is architecturally different from models like GPT-4V where the vision encoder and language model are separate components with different routing logic.

The practical impact: DeepSeek's vision capabilities are likely to be significantly cheaper to deploy than competitors' at scale. The KV cache savings alone translate to lower per-query costs, which matter for high-volume vision workloads like receipt processing, document classification, and visual data extraction.

How It Compares

DeepSeek Vision vs GPT-4V

Dimension	DeepSeek Vision	GPT-4V
Model architecture	Native multimodal MoE (1.6T total, 49B active)	Unknown sparse architecture (speculated 8×220B MoE)
Context window	1M tokens (shared with V4)	256K tokens
Image processing	Native — no separate vision encoder	Separate vision encoder (CLIP-derived)
Input pricing	$0.435/M tokens (V4-Pro) / $0.09/M (V4-Flash)	$5.00/M tokens
Output pricing	$0.87/M tokens (V4-Pro) / $0.18/M (V4-Flash)	$15.00/M tokens
API availability	Chat-only (grayscale beta)	Full API + ChatGPT
Video support	No	Yes (limited)
Image generation	No (Janus models separate)	Yes (DALL-E integration)
Tool use with vision	Not yet	Yes (function calling + vision)

DeepSeek wins on: Pricing (10-15x cheaper), context window (4x larger), native multimodal architecture.

GPT-4V wins on: API availability (mature, production-ready), video support, tool-augmented vision pipelines, image generation.

DeepSeek Vision vs Gemini Vision

Dimension	DeepSeek Vision	Gemini Vision
Model	V4-based (MoE)	Gemini 2.5 Pro (dense MoE hybrid)
Context window	1M tokens	2M tokens
Image pricing	Same as text ($0.435/M input)	$1.25/M input
Video understanding	No	Yes (native, full-length)
Audio understanding	No	Yes
MCP integration	No	Yes (Vertex AI)
API availability	Chat-only beta	Full API + AI Studio
Deployment	Open weights (self-host)	Google Cloud only

DeepSeek wins on: Pricing (~3x cheaper), open-weight deployment, MIT license.

Gemini wins on: Context window (2x), video and audio natively, MCP integration, production API readiness.

DeepSeek Vision vs Claude Vision (Opus 4.6 / Sonnet 4.6)

Dimension	DeepSeek Vision	Claude Vision
Architecture	1.6T MoE, 49B active	Undisclosed (dense or MoE, likely 2T+)
Context window	1M tokens	200K tokens
Image pricing	$0.435/M input	$3.00/M input
Document analysis	Yes	Yes (excellent — Claude leads on document understanding)
Tool use with vision	Not yet	Yes (mature, reliable)
MCP support	No	Yes (first-class)
Self-hostable	Yes (MIT, open weights)	No
API maturity	Beta (chat-only)	GA (full API)

DeepSeek wins on: Pricing (~7x cheaper input), context window (5x), open-weight deployment.

Claude wins on: Document understanding reliability, tool use with vision (production-tested), MCP integration, API maturity.

Pricing Deep Dive

DeepSeek's vision pricing is notable because there's no separate image-pricing tier — images are billed as input tokens at the standard V4 rate.

Per-Image Cost Estimates

A typical vision query involves:

1-2K tokens of image encoding (at the V4 native rate)
~500 tokens of text prompt
Variable output tokens

Model	1 image + 500-token prompt	10 images + 1K prompt	50 images batch
DeepSeek V4-Flash	~$0.001	~$0.005	~$0.02
DeepSeek V4-Pro	~$0.002	~$0.01	~$0.04
GPT-4V	~$0.02	~$0.10	~$0.45
Gemini 2.5 Pro	~$0.005	~$0.03	~$0.12
Claude Opus 4.6	~$0.01	~$0.06	~$0.25

The gap widens with scale because DeepSeek doesn't charge a premium for image tokens. For high-volume vision workloads (document processing, receipt scanning, visual data extraction pipelines), DeepSeek at V4-Flash pricing is 10-30x cheaper than GPT-4V.

The trade-off: You're betting on a beta feature with no API endpoint, no SLA, and no guarantee of the pricing structure when it reaches GA. If you're building a vision pipeline today, the cost savings are real but the availability risk is real too.

What This Means for the Open-Weight Model Landscape

DeepSeek's decision to add vision to its chat product matters beyond the feature itself. Here's why:

The V4 Family Already Sets the Open-Weight Benchmark

DeepSeek V4-Pro and V4-Flash, released April 24, 2026 under MIT license with open weights on Hugging Face, set new standards for what open-weight models can do — 1M-token context, 1.6T parameter MoE, competitive pricing. Adding vision to that stack means the most capable open-weight model family now has multimodal capabilities included.

Pressure on Proprietary Models

The pricing gap shown above creates real pressure. When an open-weight model offers 90% of GPT-4V image understanding at 10-15x lower cost, enterprise procurement teams notice. For deployment scenarios where self-hosting is acceptable (data residency, air-gapped environments, high-throughput batch processing), DeepSeek's vision capabilities make the open-weight option more attractive than ever.

The Landscape Shifts

Before Vision mode, the open-weight multimodal landscape had options like LLaVA, Idefics, and DeepSeek's own VL/VL2 models — but none integrated into a chat product with production-grade infrastructure and pricing. DeepSeek Vision changes this by offering open-weight vision capabilities that are accessible through a chat UI and (eventually) an API, with known pricing and clear documentation.

The next competitive milestones to watch:

API endpoint launch for vision input (expected Q3 2026)
Video understanding (likely later in the V4 lifecycle)
Benchmark scores on standard multimodal evals (MMMU, MATH-V, ChartQA)

Practical Implications for Developers

What You Can Do Today

If you have Vision mode access on chat.deepseek.com:

Upload screenshots, photos, and documents for analysis
Ask questions about image content in natural language
Use it alongside V4's 1M-token context for document-heavy workflows

The feature integrates with V4's existing chat interface — upload an image, type a prompt, and the model processes both. There's no separate tooling required.

What You Should Wait For

For production use, wait for these before building on DeepSeek Vision:

API endpoint — The chat-only access doesn't support programmatic use. Without an API, you can't build pipelines.
Pricing confirmation — Current pricing is inferred from text token rates. Image pricing may differ when the API launches.
Benchmark data — Without published scores on MMMU, MathVista, or ChartQA, you're evaluating based on early impressions, not data.
Rate limits and SLA — Grayscale testing means no guarantees on availability or throughput.

Integration Path (When Available)

DeepSeek's API already supports OpenAI ChatCompletions format for text. When vision is added to the API, the integration path should be straightforward:

# Expected integration pattern (once API vision is available)
from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com",
    api_key="your-deepseek-key",
)

response = client.chat.completions.create(
    model="deepseek-v4-pro",  # or deepseek-v4-flash
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/jpeg;base64,..."
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

This is the standard OpenAI multimodal format. If DeepSeek maintains API compatibility — which they've done for text — migrating existing GPT-4V pipelines to DeepSeek Vision will be a model-name change and a base_url change.

Pitfalls

1. It's a Beta — Treat It Like One

Vision mode is in grayscale testing. Not all users have access. The feature may break, change, or disappear between sessions. Do not build production dependencies on a feature that hasn't reached API GA.

2. No API Access Yet

The most limiting factor: you cannot use Vision mode programmatically. Every interaction goes through the chat UI. This means no automation, no batch processing, no integration into existing pipelines. If your need is "build a vision pipeline today," DeepSeek Vision is not the answer yet.

3. Unknown Reliability at Scale

DeepSeek's text V4 models have been rigorously evaluated (SWE-bench, GPQA, AIME, etc.). Vision mode has no published benchmarks. Early user reports suggest competent image understanding, but there's no data on failure modes — how it handles adversarial images, low-resolution inputs, complex diagrams, or multilingual visual text. Until benchmark scores are published, treat its vision capabilities as unvalidated.

4. Image Pricing Is Assumed, Not Confirmed

The cost estimates above assume DeepSeek bills image tokens at the same rate as text tokens. This is a reasonable assumption given the native architecture (no separate encoder cost), but it's not confirmed. DeepSeek may introduce vision-specific pricing tiers when the API launches. Build your business case on the assumption that image pricing could be 2-3x text pricing — if it comes in lower, that's upside.

5. No Tool-Augmented Vision

A powerful pattern with GPT-4V and Claude is tool-calling on visual input — upload a screenshot of a bug and ask the model to generate a fix, or show a chart and ask the model to query a database for more data. DeepSeek Vision doesn't support this yet. The V4 model family supports tool calling, but not in combination with vision input.

6. AI Chip Constraints May Limit Rollout

DeepSeek's reported infrastructure buildout on domestic AI chips (Huawei Ascend, Cambricon) means capacity expansion follows a different trajectory than labs using NVIDIA hardware. If Vision mode proves popular, rollout delays due to infrastructure constraints are likely — the grayscale approach itself signals this.

7. Compliance and Data Privacy Considerations

DeepSeek is a Chinese company subject to Chinese AI regulations. For enterprise deployments in regulated industries, using DeepSeek's cloud API for vision workloads means sending image data to servers in China with Chinese data sovereignty rules. The open-weight models mitigate this for self-hosted deployments, but the chat product routes through DeepSeek's infrastructure. Check your compliance requirements before uploading sensitive images.

The Bottom Line

DeepSeek Vision closes the multimodal gap in the DeepSeek ecosystem. The native architecture is technically interesting — fewer KV cache entries, no separate encoder, MoE routing for vision tokens — and if the pricing holds at text rates, it will be the cheapest production-scale vision API by a wide margin.

But as of June 2026, Vision mode is a beta feature with no API, no published benchmarks, and uncertain availability. The value proposition for developers today is limited to evaluating the feature manually through the chat interface. The real competitive impact will come when DeepSeek launches API-based vision access with confirmed pricing and published benchmark scores.

For the open-weight ecosystem, this is a significant step. The most capable open-weight model family now has vision capabilities, and those capabilities are built into the same architecture that delivers 1M-token context and $0.09/M input tokens. When the API launches, the multimodal AI pricing landscape shifts.

For a deeper look at DeepSeek V4's architecture, benchmarks, and pricing, see the DeepSeek V4 Preview documentation. For the broader open-weight model landscape, the DeepSeek V4 comparison on MorphLlm offers a useful benchmark reference against Kimi K2.6 and other open-weight contenders.

DeepSeek Introduces Vision — What It Adds to the Chat Experience