Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data!

Anime illustration of a girl with white hair and blue eyes wearing an orange ski outfit with a blackboard with “Flan T5xxl” written on her chest looking at us with a smile on a ski slope.png (1600×1600)
  • A new image comparison tool is available.
  • Improved CLIP-L in FP32 format is highly recommended.
  • Choose Flan-T5xxl based on your system RAM capacity.

Introduction

Hello, this is Easygoing.

Today, we’ll explore methods to compare illustrations in terms of quality.

Theme: Winter Sports

The theme this time is winter sports.

Anime illustration of a girl with brown hair and blue eyes looking at you with a smile on a ski slope at sunset.png (2568×2568)

Let’s create illustrations that depict the joy of spending time with friends at a ski resort.

Image Difference Checker

To objectively evaluate differences between images, I created a new web tool.

You can find it here:

This tool allows you to input two images and provides the following outputs:

T5xxl_FP16_vs_Flan-T5xxl-FP32.png (2065×2710)
  • Difference Map (in color and black-and-white)
  • Mean Absolute Error (MAE): Measures the average pixel-wise difference.
  • Structural Similarity Index (SSIM): Evaluates structural similarity.

By comparing these outputs, you can objectively assess the differences between two images.

However, note that this tool only detects "differences," so determining which illustration is better ultimately requires human judgment.

A Practical Comparison

Let’s use this tool to compare illustrations. Since we’ve been discussing text encoders recently, we’ll continue along that theme.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Flux.1
D1(CLIP-L)
D2(T5xxl)
D3(Transformer)
end
X1-->D1
X1-->D2
D2-->D1
D1-->D3

Flan-T5xxl: The Latest T5xxl Version

  • An upgraded version of T5xxl_v1.1.
  • Enhanced through instruction-tuning with prompts and answers.
  • Expected to improve prompt comprehension.

Our first comparison involves Flan-T5xxl. Flan-T5xxl builds upon T5xxl by adding instruction-based training to further enhance its performance.

Given that image-generation prompts are essentially commands, this upgrade is expected to improve prompt fidelity.

Flan-T5xxl-FP32 (32-bit)

Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

Flan-T5xxl-FP32 is the highest-precision version available online, capable of producing incredibly detailed illustrations.

This serves as the baseline for our comparisons. The black image on the right represents the colored difference map.

Flan-T5xxl-FP16 (16-bit)

Flan-T5xxl-FP16_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color-Diff_Flan-T5xxl-FP16_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

There are minimal differences compared to FP32.

Flan-T5xxl-Q8_0.gguf (8-bit)

Flan-T5xxl-Q8_0.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q8_0.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The illustration is quite similar, but there’s a subtle difference in the depiction of the right hand's fingers.

Flan-T5xxl-Q5_K_M (5-bit)

Q5_K_M shows noticeable overall differences, with visible degradation.

Flan-T5xxl-Q5_K_M.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q5_K_M.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

Flan-T5xxl-Q3_K_L (3-bit)

Flan-T5xxl-Q3_K_L.gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_Flan-T5xxl-Q3_K_L.gguf_LongCLIP-SAE-ViT-L-14-FP32_Flan-T5xxl-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

At Q3_K_L compression, the text and other details change significantly.

Results: Graph Representation

Flan-T5xxl MAE and SSIM Similarity.png (1200×848)
Flan-T5xxl Size(GB) MAE Similarity SSIM Similarity
FP32 45.2 0.00 100.0 % 1.00 100.0 %
FP16 22.6 0.96 99.6 % 1.00 99.9 %
Q8_0 11.8 1.26 99.5 % 1.00 99.8 %
Q6_K 9.2 1.57 99.4 % 1.00 99.7 %
Q5_K_M 8 4.62 98.2 % 0.98 98.4 %
Q4_K_M 6.9 9.08 96.5 % 0.95 95.2 %
Q3_K_L 5.7 17.11 93.3 % 0.85 84.9 %
Q2_K 4.1 11.93 95.3 % 0.94 93.6 %

Flan-T5xxl shows performance corresponding to its model size.

Degradation becomes evident below Q6_K, so it’s advisable to use Q6_K or higher for optimal results.

T5xxl_v1.1: Flux.1’s Default T5xxl

  • A large-scale language model developed by Google.
  • Standard in Flux.1 and SD 3.5.
  • Designed to comprehend prompt contexts.

Next, let’s evaluate T5xxl_v1.1. Although an older version compared to Flan-T5xxl, it’s still widely used and warrants a closer look.

T5xxl_v1.1-FP32

T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The original FP32 version is publicly available on Google’s Hugging Face page.

T5xxl_v1.1-FP16

T5xxl-v1_1_original_FP16_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_FP16_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The FP16 version, distributed via ComfyUI’s Hugging Face page, is likely the most commonly used in Flux.1.

Compared to FP32, there are minor differences.

T5xxl_v1.1-FP8e4m3fn_scaled (8-bit)

T5xxl-v1_1_FP8e4mefn_scaled_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_FP8e4mefn_scaled_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png.png (1440×1440)

This version, also from ComfyUI’s page, refines the FP8e4m3 format for improved performance.

T5xxl_v1.1-FP8e4m3fn (8-bit)

T5xxl-v1_1_FP8e4mefn_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_DiffT5xxl-v1_1_FP8e4mefn_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The simpler FP8e4m3 format shows noticeable degradation.

T5xxl_v1.1-Q8_0.gguf (8-bit)

T5xxl-v1_1_Q8_0_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q8_0_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

This lightweight GGUF version, shared by City69, is also widely used.

T5xxl_v1.1-Q5_K_M.gguf (5-bit)

T5xxl-v1_1_Q5_K_M_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q5_K_M_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The 5-bit GGUF format, recommended by City69, represents the lowest acceptable quality for many users.

T5xxl_v1.1-Q3_K_L.gguf (3-bit)

T5xxl-v1_1_Q3_K_L_gguf_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440) Color_Diff_T5xxl-v1_1_Q3_K_L_gguf_LongCLIP-SAE-ViT-L-14-FP32_T5xxl-v1_1_original-FP32_LongCLIP-SAE-ViT-L-14-FP32.png (1440×1440)

The 3-bit GGUF format shows significant performance degradation for image-generation purposes but may still be usable for large-scale language model tasks.

Results: Graph Representation

T5xxl_v1.1 MAE and SSIM Similarity.png (1200×848)
T5xxl_v1.1 Size(GB) MAE Similarity SSIM Similarity
FP32 44.6 0 100.0 % 1 100.0 %
FP16 9.8 1.87 99.3 % 1 99.7 %
FP8_e4m3fn_scaled 5.2 3.82 98.5 % 0.99 99.1 %
FP8_e4m3fn 4.9 6.31 97.5 % 0.98 97.6 %
Q8_0 5.1 4.88 98.1 % 0.99 98.8 %
Q5_K_M 3.4 5.12 98.0 % 0.98 98.1 %
Q3_K_L 2.5 5.25 97.9 % 0.98 98.5 %

Actual Performance Ranking

  1. FP32
  2. FP16
  3. FP8e4m3fn_scaled
  4. Q8_0
  5. Q5_K_M
  6. Q3_K_L
  7. FP8e4m3fn

These results are somewhat surprising. Lightweight models generally show higher precision in GGUF formats compared to FP8 formats, but in this case, the refined FP8_e4m3fn_scaled outperformed the Q8_0.gguf format due to its post-tuning adjustments.

This highlights the importance of fine-tuning lightweight models after compression.

CLIP-L Comparison

Next, let’s compare CLIP-L. Unlike T5xxl, which processes prompts as text, CLIP-L directly connects text with images.

As a result, changes in CLIP-L have a more significant impact on the illustrations.

Here, we compare each CLIP model's FP32 and FP16 formats side by side.

Note: Long-CLIP-L models are currently exclusive to ComfyUI, and using FP32 models requires specific launch settings.

Long-CLIP-ViT-L-14-GmP-SAE (Released on 2024.12.19!)

Long-CLIP-ViT-L-14-GmP-SAE_Flan-T5xxl-FP32.png (2065×2710)
Left: FP32 format, Right: FP16 format

The Long-CLIP-ViT-L-14-GmP-SAE model is the latest release in the Long-CLIP-L series, debuting on 2024.12.19.

  • SAE: Sparse Autoencoder

This ambitious model prioritizes creativity, despite showing slightly lower benchmark scores than previous iterations.

Both FP32 and FP16 formats produce clear and beautiful illustrations, though noticeable differences exist between the two.

CLIP-SAE-GmP-ViT-L-14 (Released on 2024.12.8!)

CLIP-ViT-L-14-GmP-SAE_Flan-T5xxl-FP32.png (2065×2710)

This GmP-SAE model, released on 2024.12.8, is compatible with both ComfyUI and Stable Diffusion webUI Forge.

Similar differences are evident between its FP32 and FP16 formats. Compared to standard CLIP-L models, it stands out for its clarity and superior quality.

Standard CLIP-L

CLIP-L_Flan-T5xxl-FP32.png (2065×2645)

The standard CLIP-L model simplifies details, particularly around elements like skis, compared to the improved versions above.

As with the other models, the differences between FP32 and FP16 formats are noticeable.

Improved CLIP-L vs Standard CLIP-L

Finally, let's directly compare the improved CLIP-L with the standard CLIP-L.

CLIP-L-FP32_Flan-T5xxl-FP32_CLIP-ViT-L-14-GmP-SAE-FP32_Flan-T5xxl-FP32.png (1219×1600)

When comparing the illustrations, the one on the left, generated with the improved CLIP-L, appears overall clearer and more detailed. The difference is particularly noticeable in the depiction of the ski board area.

Among the two metrics calculated in this test, SSIM is said to correlate well with human perception, and the values differ significantly between the two models.

Although CLIP-L has a much smaller capacity compared to T5xxl, it directly influences image generation, proving that it has a significant impact on image quality.

Why FP32 Format for CLIP-L?

In this test, all text encoders showed noticeable differences between their FP32 and FP16 formats.

For CLIP-L specifically, upgrading to FP32 format significantly enhances image quality, making it a worthwhile improvement.

Anime illustration of a girl with orange hair and blue eyes looking at you with a smile on a ski slope at sunset.png (2568×2568)

T5xxl: Large but Powerful

On the other hand, T5xxl is a larger model, with the complete FP32 version requiring 45GB of storage.

Managing such a large model can be challenging. However, ComfyUI efficiently handles text encoders by loading only the necessary parts during processing and using system RAM when VRAM capacity is insufficient.

Even when system RAM is used, text encoding typically completes within a few seconds, so the time impact is minimal.

The Advantage of Caching

When generating images using the same prompt, encoded prompts are stored in the cache, eliminating the need for re-encoding in subsequent generations.

Anime illustration of a girl with brown hair and blue eyes walking down a ski slope holding a large blackboard with “Flan T5xxl” written on it High Contrast.png (2568×2568)

However, if using tools like auto-prompt generators that input a new prompt each time, encoding must be performed repeatedly.

While text encoding itself takes only a few seconds, loading and unloading large models (e.g., 10GB+) can take tens of seconds, even with high-speed m.2 SSDs.

If your system has 128GB of RAM, model unloading becomes unnecessary. However, for systems with less than 64GB, selecting a T5xxl model suited to your available RAM is advisable.

Anime illustration of a girl with brown hair and blue eyes holding a ski board with the word “Flan” written on it, looking surprised at a ski slope at sunset.png (2568×2568)

Conclusion: Mastering Text Encoders!

  • A new image comparison tool is available.
  • Improved CLIP-L in FP32 format is highly recommended.
  • Choose Flan-T5xxl based on your system RAM capacity.

By quantifying image accuracy, we’ve made it easier to evaluate and compare image quality.

While text encoders have often been overlooked in discussions about high-quality images, this test highlights their significant impact on results.

Anime illustration of a girl with orange hair and blue eyes smiling and looking at you with her friends in front of a snowy cabin.png (2568×2568)

Text encoders are at the very start of the image generation process, so any inaccuracies at this stage are likely to amplify downstream.

With free access to models like Flan-T5xxl and improved CLIP-L, I’m grateful for the opportunities they provide and look forward to continuing my exploration of image generation.

Thank you for reading to the end!