Z-Image-Turbo: open-source, fast, photorealistic text-to-image
Built by the Alibaba Tongyi team: a 6B single-stream DiT distilled with Decoupled-DMD + DMDR. Delivers high-fidelity images in 8 steps, renders Chinese/English text reliably, Apache-2.0 for commercial use, and runs comfortably on 16GB GPUs.
Overview & model family
Z-Image focuses on “high quality + low friction”: Turbo delivers fast 8-step generation; Edit handles instruction-driven edits on images; Base leaves headroom for deep finetuning.
Z-Image-Turbo
Distilled edition · 8-step sampler · bilingual. Strong at photorealism, portraits, products, and Chinese/English text rendering; daily use on a single 16GB GPU.
Z-Image-Edit
Editing edition with continued training: supports img2img, inpainting, background swaps while keeping subjects consistent. Shares the same backbone as Turbo, so deployment stays identical.
In-browser playground
Try the 8-step pipeline instantly—no installs or setup. Tweak the prompt, width, and height to match your use case and preview results in the live UI.
Install in minutes (CUDA or Apple Silicon)
CLI quickstart in 5 steps
conda create -n zimage python=3.11 -y
conda activate zimage
pip install torch torchvision torchaudio # pick CUDA or MPS wheels
pip install "modelscope>=1.18.0" pillow
python - <<'PY'
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import torch
dev = "cuda" if torch.cuda.is_available() else (
"mps" if torch.backends.mps.is_available() else "cpu")
pipe = pipeline(Tasks.text_to_image_synthesis,
model="Tongyi-MAI/Z-Image-Turbo",
device=dev)
img = pipe({"text":"a photorealistic cat on MacBook, 4k",
"width":768,"height":768,"num_inference_steps":8})["output_imgs"][0]
img.save("zimage_test.png")
PY
ComfyUI drop-in
- Download
z_image_turbo_bf16.safetensors→ drop intoComfyUI/models/checkpoints. - Add the Qwen text encoder & Flux VAE as in the official workflow.
- Use 8–9 steps with
guidance_scale=0. Start at 1024×1024; on 8–12GB cards use 768×768.
Quick APIs
- fal.ai:
/fal-ai/z-image/turbo, around $0.005 per megapixel. - Replicate:
prunaai/z-image-turbo, roughly 4s on H100. - HF / ModelScope Spaces: try it free in the browser.
torch.backends.mps.is_available() is True; set PYTORCH_ENABLE_MPS_FALLBACK=1 if needed. Stay plugged in and cooled—short bursts of full load are normal.What it does
Photorealism
Natural portraits, products, and scenes; skin texture and lighting hold up.
Chinese/English text
Signs, posters, road labels, and UI text stay legible—great for Chinese layouts and e-commerce KV.
Prompt fidelity
Understands prompts in both languages; layouts, elements, and colors are controllable.
Fast at low steps
High quality in 8 steps; sub-second on H800, so iteration feels instant.
Stable structure
Single-stream DiT keeps composition and perspective steady, reducing distortion.
Editing ready
Pair with Z-Image-Edit for inpainting or background swaps while keeping the subject consistent.
Use cases
E-commerce & branding
Batch product hero shots, campaign KV, Chinese copy out of the box.
Content creation
Covers, illustrations, thumbnails for Bilibili/WeChat, fast multi-version iterations.
Design & prototyping
Race from sketch to polished render; supports mixed Chinese/English mockups.
Teaching & research
Examples for low-step distillation, single-stream DiT, RL + DMD studies.
Private deployment
Apache-2.0 license; run on your own network so assets stay in-house.
ComfyUI workflows
Drag-and-drop nodes, 8-step quality, quick reference drafts.
FAQ (quick)
Is it OK for commercial use?
Yes—Apache-2.0. Keep the notices and follow local laws and platform rules.
Do I need negative prompts?
Turbo distillation bakes CFG in; we recommend guidance_scale=0 so you can focus on the positive prompt.
Can it run on 8GB VRAM?
Drop resolution to 640/768 and enable offload/low precision. For high-res batches, 16GB is more comfortable.
Best prompt recipe?
Subject + key details + lighting + lens/style + text content (if needed). Example:
“Night street, a vintage sign reading ‘Old Teahouse’ in gold 3D Chinese characters, neon glow, photoreal 4K.”
How does it compare to Flux / SDXL?
z-image-turbo is faster, steadier for Chinese text, and smaller; Flux/SDXL have larger ecosystems and richer style options.
Long prompts?
The online demo defaults to ~512 tokens; locally you can set max_sequence_length=1024 for longer descriptions.