upd

June 15, 2026

min

Qwen3.7-Plus: The Complete Guide to Alibaba's New Multimodal Agent Model

This guide covers how it compares to Qwen3.7-Max, benchmark results, strengths, weaknesses, and practical ways to use it today.

What Is Qwen3.7-Plus?

Qwen3.7-Plus is Alibaba's multimodal agent model built on top of the Qwen3.7 text backbone. Think of it as Qwen3.7-Max's sibling that learned to read screens, watch video, and look at images, while keeping most of the coding, reasoning, and tool-use abilities that made the Max version popular with developers.

There's one important distinction worth getting right from the start: Qwen3.7-Plus is a perception model, not a generation model. It accepts text, images, and video as input, but it only outputs text. So if you're hoping to use it to generate pictures or video clips, that's not what it's built for, Alibaba has separate image and video generation tools for that. What Qwen3.7-Plus is built for is understanding what's on a screen, reasoning about a scene, reading a diagram, watching a short clip, and then acting on that information — writing code, clicking through an interface, or answering questions about what it just saw.

In Alibaba's own framing, the model is positioned as a "multimodal interactive hybrid agent", one that can perceive real-world scenes, read screens, operate graphical interfaces, and generate code directly from visual references like mockups or screenshots.

Qwen3.7-Plus Specifications at a Glance

Here's the technical snapshot that most developers care about first:

Modalities: text, image, and video input → text output only
Context window: up to 1,000,000 tokens
Maximum output: 65,536 tokens
Internal reasoning budget: reportedly up to 256,000 tokens reserved for chain-of-thought
Base architecture: built on the Qwen3.7 text backbone (same lineage as Qwen3.7-Max)
Licensing: proprietary, API-only — no open-weight release at launch
API availability: Alibaba Cloud Model Studio (DashScope), plus a growing list of third-party model marketplaces

Pricing

List pricing on Alibaba Cloud Model Studio sits around $0.40 per million input tokens and $1.60 per million output tokens, with cached-input pricing reportedly somewhere in the $0.04–$0.08 per million range (different sources cite slightly different cached rates, so it's worth double-checking the live pricing page before you build a cost model).

To put that in perspective: Qwen3.7-Max, the text-only flagship released just weeks before, lists at roughly $2.50 input / $7.50 output per million tokens. That makes Qwen3.7-Plus somewhere around five to six times cheaper, while adding vision and video support on top. A handful of API aggregators list it at slightly different (often discounted) rates, sometimes closer to $0.32 input / $1.28 output, so shop around if you're cost-sensitive.

For agentic workloads specifically, this pricing structure matters a lot. Agents that take screenshots, re-read a growing scratchpad, and loop through tool calls burn through input tokens far faster than output tokens. Cutting the input price by roughly six times is the difference between an agent loop that's economically viable to run at scale and one that's only good for demos.

What Makes Qwen3.7-Plus Different

A Genuine Jump in Visual Understanding

The headline story for this release isn't really about raw intelligence scores, it's about how dramatically the visual reasoning side of the model improved compared to its predecessor. On internal visual-understanding evaluations, Alibaba reports scores roughly 75% higher than the previous-generation Plus model on tasks involving complex scene understanding, and similarly large jumps on multimodal benchmark suites that test combined image-and-text reasoning.

That kind of leap suggests the vision encoder and the language backbone are working together much more tightly than in earlier Qwen multimodal releases, rather than vision being bolted on as an afterthought.

GUI Grounding: Reading and Operating Screens

One of the more practically useful capabilities is GUI grounding, the ability to look at a screenshot and identify the exact pixel location of a button, field, or icon described in natural language. This is the foundation of any agent that needs to "click" things autonomously, whether that's testing a mobile app, automating a repetitive web workflow, or navigating a cloud console.

On the ScreenSpot Pro benchmark, which specifically measures this kind of grounding accuracy, Qwen reports a score in the high 70s, which according to the company's own comparison table puts it ahead of several frontier-tier competitors on this particular task. We'll dig into how to interpret these numbers responsibly in the benchmarks section below.

preserve_thinking: Reasoning That Survives Across Turns

A smaller but technically significant addition is the preserve_thinking API parameter. In long agent loops, a model normally has to "re-derive" its reasoning context every time it's called again — which wastes tokens and can cause it to lose track of a multi-step plan. The preserve_thinking parameter lets Qwen3.7-Plus retain its internal <think> reasoning blocks across conversation turns, so a long-horizon task doesn't reset the model's train of thought every time a tool finishes running.

This mirrors a broader trend across the industry — other major labs have shipped similar mechanisms for carrying reasoning state between turns — but Qwen3.7-Plus offers it at a noticeably lower price point than comparable options.

Code From Screenshots and Mockups

Because the model can interpret images directly, it's been demonstrated generating working code from a screenshot of a UI, a hand-drawn wireframe, or a design mockup. In Alibaba's demo material, the model was shown exploring an unfamiliar codebase on its own, writing a technical specification for it, and separately reconstructing a working app interface purely from a visual reference — all without step-by-step human guidance.

Qwen3.7-Plus Benchmark Results

Before diving into numbers, it's worth setting expectations honestly: most of the benchmark figures circulating right now come directly from Alibaba's own launch materials. That doesn't make them meaningless, but it does mean they should be read as directional signals, not settled facts, especially since some were run with the model's "thinking" mode disabled, which can change results meaningfully.

Where Qwen3.7-Plus Leads (Vendor-Reported)

According to Alibaba's own comparison tables, Qwen3.7-Plus performs particularly well on:

Agentic coding in a terminal environment — outperforming several competing models on Terminal-Bench style evaluations
Mobile navigation tasks — leading on AndroidWorld-style benchmarks that test whether an agent can complete real tasks inside a mobile app
GUI grounding — the ScreenSpot Pro score mentioned earlier, where it's reported ahead of multiple frontier competitors
Long-context retrieval — strong results on multi-needle retrieval tests across a 128K-token context, supporting the model's claims around its million-token window
Tool-server integration — leading scores on benchmarks that test how well a model can work with external tool/MCP-style servers

Where Qwen3.7-Plus Trails

To its credit, Alibaba's own materials don't pretend the model wins everything. On pure-text software engineering benchmarks like SWE-Bench style evaluations, Qwen3.7-Plus sits a few points behind its own text-only sibling, Qwen3.7-Max, and behind a couple of competing reasoning models. On desktop GUI tasks measured by OSWorld-style benchmarks, at least one frontier competitor edges it out as well.

That's a coherent trade-off, honestly: you're getting vision, video, and a roughly six-times lower price tag, and in exchange you give up a small amount of pure-text coding depth compared to the flagship.

Independent Verification

The two data points that don't come from Alibaba itself are worth paying close attention to:

Artificial Analysis placed Qwen3.7-Plus in roughly the top third of all models on its Intelligence Index — described as "well above average" for its price tier, though not at the absolute frontier. The same evaluation flagged the model as comparatively slow (around 50 tokens per second) and notably verbose, generating substantially more output tokens during testing than the typical model in its class.
LM Arena community rankings placed the model in the mid-teens for both text and coding categories, and similarly placed for vision tasks — solid, competitive positioning, but again not class-leading.

The honest summary: independent data suggests Qwen3.7-Plus is a genuinely capable, cost-effective, somewhat slow mid-tier model, not the across-the-board leader that a quick skim of the launch announcement might suggest. For budget-sensitive, vision-heavy, agentic workloads, that combination can still be extremely compelling.

Qwen3.7-Plus vs. Qwen3.7-Max: Which One Should You Use?

Feature	Qwen3.7-Plus	Qwen3.7-Max
Input modalities	Text, image, video	Text only
Output	Text	Text
Context window	~1M tokens	~1M tokens
Input price (per 1M tokens)	~$0.40	~$2.50
Output price (per 1M tokens)	~$1.60	~$7.50
Pure-text coding depth	Slightly behind	Strongest in the family
GUI / visual grounding	Strong	Not applicable
Open weights	No	No

The decision is fairly straightforward once you frame it around your actual workload:

If your agent needs to see anything — a screenshot, a video frame, a UI mockup — Qwen3.7-Plus is the only option in the family that supports that.
If your workload is pure text and code, and squeezing out the last few points of coding benchmark performance matters more than cost, Qwen3.7-Max is the stronger (and pricier) choice.
If you're running high-volume agent loops where input tokens dominate your bill — think browser automation, RPA, or long-document processing — the roughly six-times price gap in Qwen3.7-Plus's favor is hard to ignore.

Practical Use Cases for Qwen3.7-Plus

Browser and Desktop Automation

Combine GUI grounding with the model's agentic tool-use abilities, and you get a foundation for agents that can navigate web apps, fill out forms, click through multi-step workflows, and verify the results — all by looking at screenshots the same way a human would.

Long-Document and Research Agents

With a million-token context window and strong long-context retrieval scores, Qwen3.7-Plus is well-suited to ingesting entire reports, contracts, or technical manuals in a single pass — and because it can also read embedded charts, scanned pages, or diagrams, it can work across mixed text-and-image documents without a separate OCR pipeline.

Mobile App Testing and Navigation

The strong mobile-navigation benchmark results suggest real promise for QA automation — agents that can be handed a task description ("add this item to the cart and check out") and complete it inside a real mobile app by reasoning over what's on screen.

UI-to-Code and Design Handoff

For teams that want to go from a Figma export, a hand-sketched wireframe, or a competitor's screenshot straight to working front-end code, the model's visual-reference coding ability can meaningfully shorten that loop.

Limitations Worth Keeping in Mind

No model is a universal answer, and a few caveats are worth flagging clearly before you commit to Qwen3.7-Plus for a production workload:

It's proprietary and API-only. Unlike many earlier Qwen releases, there are no open weights available at launch. If self-hosting or air-gapped deployment is a hard requirement, this isn't currently an option — and there's no confirmed roadmap for that to change.
It's on the slower side. Independent throughput testing put it noticeably behind faster models in its price class, which matters for latency-sensitive applications.
It can be verbose. The same independent testing found it generates considerably more output tokens than average for equivalent tasks — and since output tokens are the pricier side of the bill, that verbosity can quietly eat into the cost advantage if you don't add response-length constraints.
Vendor benchmarks need independent validation. The standout GUI-grounding numbers were measured on Alibaba's own harness with reasoning disabled. Before wiring this into a critical automation pipeline, it's worth running your own evaluation on your own screenshots and workflows.

How to Start Using Qwen3.7-Plus Today

If you've read this far, you're probably ready to actually try the model rather than read about it. The fastest way to get production-ready access — without setting up a separate Alibaba Cloud account, navigating regional endpoints, or managing a second billing relationship — is through a unified AI API platform.

AI/ML API gives you access to Qwen3.7-Plus alongside hundreds of other leading models through a single API key and a familiar, OpenAI-compatible request format. That means you can drop Qwen3.7-Plus into an existing integration in minutes, compare it side-by-side against other models for your specific use case, and switch between models without rewriting your application logic. If the pricing, multimodal capabilities, and agentic features described above sound like a fit for what you're building, head over to aimlapi.com and start experimenting with Qwen3.7-Plus today.

Frequently Asked Questions

Is Qwen3.7-Plus free to use?

A preview version is available to try for free through Alibaba's web chat interface. For API access at scale, you'll be billed per token through Alibaba Cloud Model Studio or a third-party API provider.

Can Qwen3.7-Plus generate images or video?

No. Despite accepting image and video as input, it only produces text output. It's a perception and reasoning model, not an image or video generator.

Is Qwen3.7-Plus open source?

Not at launch. It ships as a proprietary, API-only model — a notable shift from some of Alibaba's earlier open-weight Qwen releases.

How does Qwen3.7-Plus compare to GPT and Claude models on vision tasks?

On Alibaba's own GUI-grounding and visual-understanding benchmarks, Qwen3.7-Plus is reported ahead of several comparably-priced competitors. Independent rankings place it as a solid, competitive mid-tier performer rather than an outright leader — strong value for the price, but not necessarily the top score on every test.

Example H2

Share with friends

Ready to get started? Get Your API Key Now!

Get API Key