What Are CLIP and T5xxl ? How Text Encoders Can Make Illustrations Stunning!

Anime illustration of a sparrow eating bread on a girl's hand 5.png (1600×1600)
  • CLIP is the foundational technology for image generation.
  • T5xxl enhances prompt comprehension.
  • Enhanced text encoders are publicly available and ready to use.

Introduction

Hello, I’m Easygoing.

Today, let’s explore the role of text encoders in image generation AI.

Anime illustration of a sparrow eating bread on a girl's hand 7.png (2576×2576)

Text Encoders as Dictionaries

AI translates the text we input into a machine-readable format.


flowchart LR
subgraph Prompt
A1(Text)
end
subgraph Text Encoder
B1(Words)
B2(Tokens)
B3(Vectors)
end
subgraph Transformer / UNET
C1(Generate Image)
end
A1-->B1
B1-->B2
B2-->B3
B3-->C1

Text encoders act as dictionaries that translate human language into machine language.

But how much do they affect image quality in image generation AI?

Let’s Compare Actual Images!

Let’s see how changing the text encoders affects the results in the Flux.1 image generation AI.

Flux.1 uses two types of text encoders:


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

  • T5xxl: Understands the context of prompts
  • CLIP-L: Converts words into vectors

For this experiment, we’ll replace T5xxl and CLIP-L with more precise versions.

T5xxl-FP16 + CLIP-L-FP16

T5xxl-FP16.png (1440×1440)
Original

Flan-T5xxl-FP16 + CLIP-L-FP16

Flan-FP16.png (1440×1440)
Improved prompt effectiveness

Flan-T5xxl-FP32 + CLIP-L-FP32

Flan_FP32_CLIP-L_FP32.png (1440×1440)
Enhanced image quality

Flan-T5xxl-FP32 + CLIP-GmP-ViT-L-14-FP32

Flan_FP32_CLIP-GmP_L_FP32.png (1440×1440)
Better background detail

Flan-T5xxl-FP32 + Long-CLIP-GmP-ViT-L-14-FP32

Flan_FP32_LongCLIP-GmP_L_FP32.png (1440×1440)
Even more detailed background

Note:

  • The Long-CLIP-L model can only be used with ComfyUI, not Stable Diffusion WebUI Forge.
  • To use the FP32 text encoder, the --fp32-text-enc setting mentioned later is required.

The lower you go in this list, the higher the performance of the text encoders.

Changing the text encoders clearly improves image quality, particularly in the detailed rendering of buildings on the right-hand side of the images.

Understanding Text Encoders in Depth

Now, let’s dive deeper into the text encoders.

Here’s how major image generation AIs integrate their text encoders:


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Image Generation
Y1(UNET)
Y2(Transformer)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
end
subgraph Stable Diffusion 3
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
end
subgraph Stable Diffusion 1
A1(CLIP-L)
end
X1-->A1
X1-->B1
X1-->B2
X1-->C1
X1-->C2
X1-->C3
X1-->D1
X1-->D2
C3-->C1
C3-->C2
D2-->D1
A1-->Y1
B1-->Y1
B2-->Y1
C1-->Y2
C2-->Y2
D1-->Y2

T5xxl and CLIP act as the text encoders, while UNET and Transformer handle the actual image generation.

CLIP: The Foundation of Everything

CLIP, developed by OpenAI, is the foundational method for associating images and text.

CLIP comes in different types based on performance:

Model Name Release Parameters Token Comprehensible Text
CLIP-B (Base) November 2021 149 million 77 Words & Short Sentences
CLIP-L (Large) January 2022 355 million 77 Words & Short Sentences
Long-CLIP-L April 2024 355 million 248 Long Sentences
CLIP-G (Giant) January 2023 750 million 77 Long Sentences

The LAION-5B dataset, which contains 5 billion captioned images, was labeled using CLIP-B.

Most image generation AIs today use CLIP-L, while Long-CLIP-L extends its capabilities for longer prompts.

CLIP-G improves overall performance and supports prompts exceeding 200 words.

T5xxl: Understanding Context

T5xxl, developed by Google, is a model designed to generate text from input text.
It’s the foundation for many modern AI services like chatbots and translation systems.

T5xxl can theoretically handle very long sentences, though its accuracy decreases with length.

Model Name Release Parameters Token Comprehensible Text
T5xxl October 2020 11 billion 32,000 Long Sentences & Context
T5xxl v1.1 June 2021 11 billion 32,000 Long Sentences & Context
Flan-T5xxl October 2022 11 billion 32,000 Long Sentences & Context

Both T5xxl v1.1 and Flan-T5xxl have improved accuracy due to additional fine-tuning.

Text Encoders Are Multiplying

Newer image generation AIs now incorporate multiple text encoders to enhance accuracy.

Stable Diffusion 1: Word-Based Understanding


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion 1
A1(CLIP-L)
Y1(UNET)
end
X1-->A1
A1-->Y1

Released in July 2022, Stable Diffusion 1 used CLIP-L as its sole text encoder.
Due to CLIP-L’s limited token capacity, users had to structure prompts as short keywords and place important ones at the start.

Stable Diffusion XL: Understanding Longer Prompts


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
Y1(UNET)
end
X1-->B1
X1-->B2
B1-->Y1
B2-->Y1

Stable Diffusion XL, launched in July 2023, added CLIP-G alongside CLIP-L.
CLIP-G improved prompt comprehension, enabling users to write longer natural-language prompts.

With a total model size of 7GB, 1.8GB is dedicated to text encoders, highlighting their significance.

Stable Diffusion 3: Contextual Understanding


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion 3
C1(CLIP-L)
C2(CLIP-G)
C3(T5xxl)
Y2(Transformer)
end
X1-->C1
X1-->C2
X1-->C3
C3-->C1
C3-->C2
C1-->Y2
C2-->Y2

In June 2024, Stable Diffusion 3 introduced three text encoders: CLIP-L, CLIP-G, and T5xxl.
This setup improved its ability to understand context.

T5xxl is powerful but large, requiring 9GB even in compressed FP16 format.

Anime illustration of a sparrow flying over the suburbs at dusk

The growing size of text encoders led to the practice of separating text encoders from the main model in Stable Diffusion 3.

Flux.1: Lacking CLIP-G


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

Released in August 2024, Flux.1 uses CLIP-L and T5xxl but does not include CLIP-G.
This may be because T5xxl covers much of CLIP-G’s functionality.

SD3.5 prompt_adherence graph.png (2500×1473)
https://stability.ai/news/introducing-stable-diffusion-3-5

Stability AI claims that Stable Diffusion 3.5 surpasses Flux.1 in language comprehension.

However, in practice, Flux.1 rarely struggles with prompt comprehension, even without CLIP-G.

Even older CLIP-G models handle long prompts adequately.

Enhanced Text Encoders!

Now, let’s explore the links to the enhanced text encoders used in this experiment.

Enhanced CLIP-L

CLIP-GmP-ViT-L-14

CLIP-GmP-ViT-L-14, developed by Zer0int, is a refined version of CLIP-L released for free.

The developer states that they made this because they simply love CLIP. They used an RTX 4090 on their home PC for training.

CLIP-GmP-ViT-L-14 improves the accuracy of CLIP-L using Global mean Pooling (GmP). It achieved a 90% accuracy on ImageNet/ObjectNet benchmarks compared to CLIP-L’s 85%.

Improvements to CLIP-GmP-ViT-L-14.png (1820×1731)
zer0int/CLIP-fine-tune: Fine-tuning code for CLIP models

According to Zer0int, CLIP-GmP-ViT-L-14 addresses the excessive focus of CLIP-L in image understanding.

The Hugging Face page for CLIP-GmP-ViT-L-14 includes both the original FP32 version and an improved ViT-L-14-BEST-smooth-GmP-TE-only-HF-format.safetensors FP16 version.

Screenshot of CLIP-GmP-ViT-L-14 download page with comment.png (3780×2260)

If you’re unsure which to choose, the FP16 version is a good starting point.

Long-CLIP-GmP-ViT-L-14 (ComfyUI Only)

This model extends the standard CLIP-L’s token limit from 77 to 248, allowing it to handle longer prompts.

Currently, it is only compatible with ComfyUI and cannot be used in Stable Diffusion WebUI Forge.

Screenshot of Long-CLIP-GmP-ViT-L-14 download page with comment.png (3780×2232)

The download page offers the original FP32 version, as well as the performance-enhanced FP16 version, Long-ViT-L-14-BEST-GmP-smooth-ft.safetensors.


Flan-T5xxl (Enhanced T5xxl)

Next, we have Flan-T5xxl, a fine-tuned version of the standard T5xxl.

Flan-T5xxl Original (Segmented Version)

Google’s original Flan-T5xxl is divided into segments due to its large size (44GB for FP32).

Flan-T5xxl Fused Version

This version consolidates the segments for easier use in image generation AI.

FP32 and FP16 formats are available.

Flan-T5xxl GGUF Version (Lightweight)

For instructions on using the GGUF format, refer to previous article.

Screenshot from flan t5xxl gguf download site English Comment.png (3777×2295)
Screenshot of download page of flan t5xxl gguf download site English comment.png (3795×2260)

When downloading the GGUF version, select a model suitable for your PC’s specifications from the available options on the right-hand side of the Hugging Face model download page.

flan-t5-xxl-gguf Model Selection Page

The Flan-T5xxl models introduced here are non-distilled, making them significantly larger in size.

However, in my environment with ComfyUI, 64GB RAM, and 16GB VRAM, I was able to use the FP32 version of Flan-T5xxl without any issues.

File Placement

Place the downloaded files in the following directory:

InstallFolder/Models/CLIP

Anime illustration of a sparrow flying over the suburbs at dusk 6.png (2576×2576)

When using these models, select them as replacements for T5xxl and CLIP-L during setup.

Using FP32 Text Encoders

Text encoders typically operate in FP16 format by default.

To process in FP32 format, you need to set --fp32-text-enc when launching ComfyUI.

Screenshot of enabling --fp32-text-enc in Stability Matrix with comment.png (1144×1866)

Example of the setting in Stability Matrix

With this configuration, if both FP32 and FP16 text encoders are being used, all encoders will process in FP32 format. However, since the encoding process usually completes within a few seconds, this should not pose a significant issue.

SDXL and the Difficulty of Changing Text Encoders

In this experiment, we upgraded the text encoders in Flux.1.
Since Flux.1 and SD 3.5 separate text encoders from other components, upgrading is straightforward.

Anime illustration of a sparrow flying over the suburbs at dusk 1.png (2579×2579)

However, SDXL and SD 1.5 integrate the text encoder with the model, making upgrades significantly more challenging.
We’ll explore this topic in detail in a future article.

Conclusion: Upgrade Your Text Encoder!

  • CLIP is the foundational technology for image generation.
  • T5xxl enhances prompt comprehension.
  • Enhanced text encoders are publicly available and ready to use.
sparrowss o wet road

When thinking about improving the quality of image generation, we often focus on the transformer that generates the images and neglect the text encoder.

However, this experiment demonstrated that text encoders significantly affect image quality.

Anime illustration of a sparrow flying over the suburbs at dusk 3.png (2579×2579)

The enhanced CLIP-L introduced in this article offers notable image quality improvements despite its relatively small size.

We highly recommend giving it a try!

Thank you for reading until the end!