Thursday, June 18, 2026
DeepSeek Introduces Vision — What It Adds to the Chat Experience
Posted by

What Changed
On April 29, 2026, DeepSeek began limited testing of a Vision mode in its flagship chat product — the first time multimodal capabilities appeared in DeepSeek's consumer-facing chat interface. Users with access see a third mode selector alongside Fast Mode and Expert Mode, enabling image upload and analysis directly in the conversation.
This is significant because DeepSeek was the last major frontier lab without vision in its chat product. GPT-4V arrived in late 2023. Gemini has been multimodal from day one. Claude added vision in March 2024. DeepSeek's text-only chat was a conspicuous gap given the strength of its model family — and the gap is now closed.
Before Vision mode, DeepSeek's multimodal capabilities existed only as separate model families — DeepSeek-VL, DeepSeek-VL2, and Janus — requiring dedicated API endpoints and different integration patterns. The new Vision mode brings image understanding into the same chat interface and the same V4 model family, accessible with the same API calls.
The Rollout: Grayscale Testing, Not a Full Launch
DeepSeek is using a grayscale rollout (灰度测试, huidu ceshi in Chinese tech media) — a controlled release to a subset of users. This is the same pattern DeepSeek used for V4's early preview access.
What's available:
- Web UI: Vision mode appears as a toggle or mode selector alongside Fast/Expert
- Mobile app: Available after updating, same mode selector pattern
- Access: Some users get it immediately; others see the option missing entirely
What's not yet available:
- No API endpoint for vision input — the mode is chat-only for now
- No disclosure of the underlying model name, parameter count, or training methodology powering the vision mode
- No benchmarks or technical report specific to the chat vision feature
Chinese tech media describes the rollout as "gray-scale testing" rather than a beta, which signals two things: DeepSeek is gathering production-quality feedback rather than feature validation, and the company is managing inference infrastructure buildout gradually. Vision models are compute-heavy — DeepSeek's reported infrastructure ramp with domestic AI chips (Huawei Ascend, Cambricon) suggests capacity is a real constraint.
The DeepSeek API documentation as of late April still showed no vision-related endpoints — only V4-Pro, V4-Flash, 1M context, and tool calling. API-based vision capabilities are expected to follow once the grayscale test stabilizes.
What Vision Mode Can Do
Based on user reports from the grayscale test and DeepSeek's own demonstrations, Vision mode handles the standard multimodal benchmarks.
Image Understanding
- Object recognition: Identify and describe objects in photographs, screenshots, and diagrams
- Document analysis: Extract and reason over text from scanned documents, screenshots, and PDF pages
- Chart and graph interpretation: Read numerical data from charts, graphs, and tables embedded in images
- Visual Q&A: Answer questions about image content — "how many people are in this photo?", "what's the license plate number?"
- Handwritten text recognition: OCR of handwritten notes and forms
What It Doesn't Do
- No video understanding — unlike Gemini's native video support
- No image generation — DeepSeek's Janus models cover text-to-image, but Vision mode is understanding-only
- No multi-image workflows — initial reports suggest single-image-per-conversation mode, not multi-image batch processing
- No tool-augmented vision — you can't, for example, search the web for an image and then analyze it in one chain (yet)
Supported Image Types
Based on the V4 model family's underlying capabilities, expect standard image format support (JPEG, PNG, WebP) with per-image size limits similar to V4's 1M-token context. Image resolution handling details haven't been published — the DeepSeek-VL2 models supported 1024×1024 inputs, but Vision mode may use a different image preprocessing pipeline.
Under the Hood: How DeepSeek's Vision Works
The most interesting technical detail comes from analysis of DeepSeek V4's vision pipeline: the model processes images natively without requiring a separate vision encoder pipeline.
Native Multimodal Architecture
Most multimodal models bolt a vision encoder (CLIP, SigLIP, etc.) onto a language model. The encoder converts images into embeddings, and those embeddings are fed into the LLM as prepended or interleaved tokens. This is how GPT-4V, Gemini, and Claude all work.
DeepSeek V4's architecture appears to differ. The model directly ingests image tokens into its MoE transformer layers without a separate encoder step. This means:
- Fewer KV cache entries per image: DeepSeek V4 uses approximately 90 KV cache entries per image vs 870 for Claude — roughly 10x fewer per-image overhead
- Lower vision processing cost: Less compute per image means lower inference cost, which feeds directly into DeepSeek's aggressive pricing
- Tighter text-image integration: No encoder bottleneck between vision and language processing — the MoE layers route vision tokens to relevant experts just like text tokens
MoE Efficiency at Scale
DeepSeek V4 is a Mixture-of-Experts architecture — 1.6 trillion total parameters with 49 billion active per token for V4-Pro, and 284B total / 13B active for V4-Flash. Vision tokens are routed through the same expert selection mechanism as text tokens. This is architecturally different from models like GPT-4V where the vision encoder and language model are separate components with different routing logic.
The practical impact: DeepSeek's vision capabilities are likely to be significantly cheaper to deploy than competitors' at scale. The KV cache savings alone translate to lower per-query costs, which matter for high-volume vision workloads like receipt processing, document classification, and visual data extraction.
How It Compares
DeepSeek Vision vs GPT-4V
| Dimension | DeepSeek Vision | GPT-4V |
|---|---|---|
| Model architecture | Native multimodal MoE (1.6T total, 49B active) | Unknown sparse architecture (speculated 8×220B MoE) |
| Context window | 1M tokens (shared with V4) | 256K tokens |
| Image processing | Native — no separate vision encoder | Separate vision encoder (CLIP-derived) |
| Input pricing | $0.435/M tokens (V4-Pro) / $0.09/M (V4-Flash) | $5.00/M tokens |
| Output pricing | $0.87/M tokens (V4-Pro) / $0.18/M (V4-Flash) | $15.00/M tokens |
| API availability | Chat-only (grayscale beta) | Full API + ChatGPT |
| Video support | No | Yes (limited) |
| Image generation | No (Janus models separate) | Yes (DALL-E integration) |
| Tool use with vision | Not yet | Yes (function calling + vision) |
DeepSeek wins on: Pricing (10-15x cheaper), context window (4x larger), native multimodal architecture.
GPT-4V wins on: API availability (mature, production-ready), video support, tool-augmented vision pipelines, image generation.
DeepSeek Vision vs Gemini Vision
| Dimension | DeepSeek Vision | Gemini Vision |
|---|---|---|
| Model | V4-based (MoE) | Gemini 2.5 Pro (dense MoE hybrid) |
| Context window | 1M tokens | 2M tokens |
| Image pricing | Same as text ($0.435/M input) | $1.25/M input |
| Video understanding | No | Yes (native, full-length) |
| Audio understanding | No | Yes |
| MCP integration | No | Yes (Vertex AI) |
| API availability | Chat-only beta | Full API + AI Studio |
| Deployment | Open weights (self-host) | Google Cloud only |
DeepSeek wins on: Pricing (~3x cheaper), open-weight deployment, MIT license.
Gemini wins on: Context window (2x), video and audio natively, MCP integration, production API readiness.
DeepSeek Vision vs Claude Vision (Opus 4.6 / Sonnet 4.6)
| Dimension | DeepSeek Vision | Claude Vision |
|---|---|---|
| Architecture | 1.6T MoE, 49B active | Undisclosed (dense or MoE, likely 2T+) |
| Context window | 1M tokens | 200K tokens |
| Image pricing | $0.435/M input | $3.00/M input |
| Document analysis | Yes | Yes (excellent — Claude leads on document understanding) |
| Tool use with vision | Not yet | Yes (mature, reliable) |
| MCP support | No | Yes (first-class) |
| Self-hostable | Yes (MIT, open weights) | No |
| API maturity | Beta (chat-only) | GA (full API) |
DeepSeek wins on: Pricing (~7x cheaper input), context window (5x), open-weight deployment.
Claude wins on: Document understanding reliability, tool use with vision (production-tested), MCP integration, API maturity.
Pricing Deep Dive
DeepSeek's vision pricing is notable because there's no separate image-pricing tier — images are billed as input tokens at the standard V4 rate.
Per-Image Cost Estimates
A typical vision query involves:
- 1-2K tokens of image encoding (at the V4 native rate)
- ~500 tokens of text prompt
- Variable output tokens
| Model | 1 image + 500-token prompt | 10 images + 1K prompt | 50 images batch |
|---|---|---|---|
| DeepSeek V4-Flash | ~$0.001 | ~$0.005 | ~$0.02 |
| DeepSeek V4-Pro | ~$0.002 | ~$0.01 | ~$0.04 |
| GPT-4V | ~$0.02 | ~$0.10 | ~$0.45 |
| Gemini 2.5 Pro | ~$0.005 | ~$0.03 | ~$0.12 |
| Claude Opus 4.6 | ~$0.01 | ~$0.06 | ~$0.25 |
The gap widens with scale because DeepSeek doesn't charge a premium for image tokens. For high-volume vision workloads (document processing, receipt scanning, visual data extraction pipelines), DeepSeek at V4-Flash pricing is 10-30x cheaper than GPT-4V.
The trade-off: You're betting on a beta feature with no API endpoint, no SLA, and no guarantee of the pricing structure when it reaches GA. If you're building a vision pipeline today, the cost savings are real but the availability risk is real too.
What This Means for the Open-Weight Model Landscape
DeepSeek's decision to add vision to its chat product matters beyond the feature itself. Here's why:
The V4 Family Already Sets the Open-Weight Benchmark
DeepSeek V4-Pro and V4-Flash, released April 24, 2026 under MIT license with open weights on Hugging Face, set new standards for what open-weight models can do — 1M-token context, 1.6T parameter MoE, competitive pricing. Adding vision to that stack means the most capable open-weight model family now has multimodal capabilities included.
Pressure on Proprietary Models
The pricing gap shown above creates real pressure. When an open-weight model offers 90% of GPT-4V image understanding at 10-15x lower cost, enterprise procurement teams notice. For deployment scenarios where self-hosting is acceptable (data residency, air-gapped environments, high-throughput batch processing), DeepSeek's vision capabilities make the open-weight option more attractive than ever.
The Landscape Shifts
Before Vision mode, the open-weight multimodal landscape had options like LLaVA, Idefics, and DeepSeek's own VL/VL2 models — but none integrated into a chat product with production-grade infrastructure and pricing. DeepSeek Vision changes this by offering open-weight vision capabilities that are accessible through a chat UI and (eventually) an API, with known pricing and clear documentation.
The next competitive milestones to watch:
- API endpoint launch for vision input (expected Q3 2026)
- Video understanding (likely later in the V4 lifecycle)
- Benchmark scores on standard multimodal evals (MMMU, MATH-V, ChartQA)
Practical Implications for Developers
What You Can Do Today
If you have Vision mode access on chat.deepseek.com:
- Upload screenshots, photos, and documents for analysis
- Ask questions about image content in natural language
- Use it alongside V4's 1M-token context for document-heavy workflows
The feature integrates with V4's existing chat interface — upload an image, type a prompt, and the model processes both. There's no separate tooling required.
What You Should Wait For
For production use, wait for these before building on DeepSeek Vision:
- API endpoint — The chat-only access doesn't support programmatic use. Without an API, you can't build pipelines.
- Pricing confirmation — Current pricing is inferred from text token rates. Image pricing may differ when the API launches.
- Benchmark data — Without published scores on MMMU, MathVista, or ChartQA, you're evaluating based on early impressions, not data.
- Rate limits and SLA — Grayscale testing means no guarantees on availability or throughput.
Integration Path (When Available)
DeepSeek's API already supports OpenAI ChatCompletions format for text. When vision is added to the API, the integration path should be straightforward:
# Expected integration pattern (once API vision is available)
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key="your-deepseek-key",
)
response = client.chat.completions.create(
model="deepseek-v4-pro", # or deepseek-v4-flash
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
]
)
print(response.choices[0].message.content)
This is the standard OpenAI multimodal format. If DeepSeek maintains API compatibility — which they've done for text — migrating existing GPT-4V pipelines to DeepSeek Vision will be a model-name change and a base_url change.
Pitfalls
1. It's a Beta — Treat It Like One
Vision mode is in grayscale testing. Not all users have access. The feature may break, change, or disappear between sessions. Do not build production dependencies on a feature that hasn't reached API GA.
2. No API Access Yet
The most limiting factor: you cannot use Vision mode programmatically. Every interaction goes through the chat UI. This means no automation, no batch processing, no integration into existing pipelines. If your need is "build a vision pipeline today," DeepSeek Vision is not the answer yet.
3. Unknown Reliability at Scale
DeepSeek's text V4 models have been rigorously evaluated (SWE-bench, GPQA, AIME, etc.). Vision mode has no published benchmarks. Early user reports suggest competent image understanding, but there's no data on failure modes — how it handles adversarial images, low-resolution inputs, complex diagrams, or multilingual visual text. Until benchmark scores are published, treat its vision capabilities as unvalidated.
4. Image Pricing Is Assumed, Not Confirmed
The cost estimates above assume DeepSeek bills image tokens at the same rate as text tokens. This is a reasonable assumption given the native architecture (no separate encoder cost), but it's not confirmed. DeepSeek may introduce vision-specific pricing tiers when the API launches. Build your business case on the assumption that image pricing could be 2-3x text pricing — if it comes in lower, that's upside.
5. No Tool-Augmented Vision
A powerful pattern with GPT-4V and Claude is tool-calling on visual input — upload a screenshot of a bug and ask the model to generate a fix, or show a chart and ask the model to query a database for more data. DeepSeek Vision doesn't support this yet. The V4 model family supports tool calling, but not in combination with vision input.
6. AI Chip Constraints May Limit Rollout
DeepSeek's reported infrastructure buildout on domestic AI chips (Huawei Ascend, Cambricon) means capacity expansion follows a different trajectory than labs using NVIDIA hardware. If Vision mode proves popular, rollout delays due to infrastructure constraints are likely — the grayscale approach itself signals this.
7. Compliance and Data Privacy Considerations
DeepSeek is a Chinese company subject to Chinese AI regulations. For enterprise deployments in regulated industries, using DeepSeek's cloud API for vision workloads means sending image data to servers in China with Chinese data sovereignty rules. The open-weight models mitigate this for self-hosted deployments, but the chat product routes through DeepSeek's infrastructure. Check your compliance requirements before uploading sensitive images.
The Bottom Line
DeepSeek Vision closes the multimodal gap in the DeepSeek ecosystem. The native architecture is technically interesting — fewer KV cache entries, no separate encoder, MoE routing for vision tokens — and if the pricing holds at text rates, it will be the cheapest production-scale vision API by a wide margin.
But as of June 2026, Vision mode is a beta feature with no API, no published benchmarks, and uncertain availability. The value proposition for developers today is limited to evaluating the feature manually through the chat interface. The real competitive impact will come when DeepSeek launches API-based vision access with confirmed pricing and published benchmark scores.
For the open-weight ecosystem, this is a significant step. The most capable open-weight model family now has vision capabilities, and those capabilities are built into the same architecture that delivers 1M-token context and $0.09/M input tokens. When the API launches, the multimodal AI pricing landscape shifts.
For a deeper look at DeepSeek V4's architecture, benchmarks, and pricing, see the DeepSeek V4 Preview documentation. For the broader open-weight model landscape, the DeepSeek V4 comparison on MorphLlm offers a useful benchmark reference against Kimi K2.6 and other open-weight contenders.