June 13, 2024

Stable Diffusion 3 Medium is out. The most compact image generation AI

Stable Diffusion 3 is here! Discover new image quality, text understanding, and challenges faced by this AI model.

Stable Diffusion 3 Medium Overview

Innovation in Image Quality

Stable Diffusion 3 Medium is a continuation of a trend for lighter AI models, with its 2 billion parameters. This is a "Medium" Option of Stable Diffusion 3 lineup, offering exceptional improvements in color, lighting, and photorealism. The common pitfalls found in other models, particularly in rendering realistic hands and faces were addressed with innovative tech such as 16-channel VAE (Variational Autoencoder), which significantly enhances the detail of images. What else does it address, and where does it fail? Let's dive into it.

Prompt comprehension Capabilities

The prompt understanding capabilities of Stable Diffusion 3 Medium are impressive. The model Stability AI excels in comprehending long and complex prompts that involve spatial reasoning, compositional elements, actions, and styles. This is achieved by utilizing all three text encoders or a combination thereof, allowing users to trade off performance for efficiency based on their specific needs.


Create a panoramic illustration of a bustling futuristic cityscape at twilight. 
In the foreground, depict a busy street market filled with diverse vendors selling exotic goods from different alien cultures under floating, holographic canopies. The market is illuminated by bioluminescent plants and neon signs in various alien languages.
To the left, show a towering, spiral skyscraper with transparent walls, revealing multiple floors of activity, including a high-tech lab on one floor and a luxurious dining area on another. Position a sleek, hovering tram emerging from a tunnel at the base of the skyscraper.


The output for a complex spatial prompt of a futuristic city

Stable Diffusion 3 Medium leverages the Diffusion Transformer architecture to achieve unprecedented text quality. This results in fewer errors in spelling, kerning, letter forming, and spacing, making it a robust tool for generating text-based images with high precision.

It also has pretty good text generation capabilities, which is good to see in a lightweight model like this one


Generate a high-resolution image of a bustling city street at night, illuminated by neon signs. Prominently display a large, multi-colored neon sign on a building that reads 
'Welcome to the Future' 
in a stylized font. Ensure that the text is sharp, legible, and seamlessly integrated into the environment. Additionally, include smaller signs and advertisements in the background with varied fonts and languages, adding to the complexity and authenticity of the urban setting."

We specifically made the prompt long and detailed to push the model to its limits, and out of 4 pics - this is the best result, which satisfies us:

"Welcome to the future" neon sign in a busy city

These capabilities make Stable Diffusion 3 Medium a powerful tool for users who need to generate images from complex text prompts, offering a level of detail and accuracy that sets it apart from other models in the market.

Criticisms and Challenges

Rendering Human Anatomy

Stable Diffusion 3 Medium, despite its impressive advancements in image quality, has faced significant criticism for its handling of human anatomy. Users have reported that the AI model often generates images of humans with distorted limbs and unnatural body parts. These issues are particularly noticeable in the rendering of hands, feet, and entire human bodies.

Image for prompt "girl lying on the grass" by Reddit user Weak_Ad4569

The root cause of these anatomical failures is attributed to Stability AI's decision to filter out NSFW content from the training data. This heavy censorship reportedly led to a lack of accurate human anatomy depiction in the generated images, causing significant issues in rendering human features.#

This problem is not new to the Stable Diffusion series. Its predecessor, Stable Diffusion 2.0, also faced similar issues with rendering human anatomy due to the removal of adult content from the training dataset. Stability AI had to address these challenges with subsequent releases like SD 2.1 and SD XL.

In conclusion, Stable Diffusion 3 Medium stands as a notable advancement in the realm of AI-driven image generation, offering substantial improvements in color fidelity, lighting, and photorealism through its innovative 16-channel VAE technology. Its commendable prompt comprehension capabilities make it a versatile tool for generating complex and detailed images, setting it apart in terms of precision and text quality. However, the model continues to grapple with significant challenges in accurately rendering human anatomy, a persistent issue stemming from the exclusion of NSFW content in its training data. While it excels in many areas, these anatomical distortions remain a significant hurdle to overcome, and we'll have to see whether the Stability dev team can tweak the model to solve this in the coming weeks.

