Comparing Flux.1 Compression Formats: The Impact on Composition!

Flux1_Redraw_ a young woman with short brown hair and blue eyes wears a white coat and green scarf standing in front of a traditional japanese torii gate surrounded by snow - covered trees and red tor.png (1600×1600)
  • Base model accuracy impacts composition quality.
  • Choose the base model based on available VRAM capacity.
  • Select the text encoder according to system RAM capacity.

Happy New Year!

Happy New Year! Wishing you all the best for the year ahead!

Theme: Hatsumode (New Year's Shrine Visit)

For the first post of the year, the theme is "Hatsumode."

SDXL_Refine_ a young woman in a traditional japanese kimono stands before a red torii gate in a snowy landscape with a traditional japanese structure and a red torii gate in in the background and a se.png (2456×2456)

Although the New Year holiday is coming to an end, let’s dive into this theme for the first post of 2024!

Does Model Compression Affect Image Quality?

In the previous post, we explored how upgrading text encoders can improve illustration quality.

This time, we'll examine how the compression of the transformer—the core of image generation—affects the quality of illustrations.

Actual Comparisons

Let’s start by comparing actual illustrations.

FP16 (16-bit)

blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

We begin with the original FP16 format. The FP16 format produces extremely clear and beautiful illustrations.

The black images shown on the right in the comparisons below are color difference maps relative to this FP16 baseline.

Q8_0.gguf (8-bit)

blue_pencil-flux1_v0.0.1-Q8_0.gguf.png (1440×1440) Color_Diff_blue_pencil-flux1_v0.0.1-Q8_0.gguf_vs_blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

The Q8_0.gguf format is an 8-bit GGUF format with half the file size of FP16. While there aren’t many noticeable differences in appearance, the difference map reveals some areas of change.

Q5_K_M.gguf (5-bit)

blue_pencil-flux1_v0.0.1-Q5_K_M.gguf.png (1440×1440) Color_Diff_blue_pencil-flux1_v0.0.1-Q5_K_M.gguf_vs_blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

The Q5_K_M.gguf format, at 5-bit compression, begins to show simpler overall compositions.

For example, lanterns become larger, and smaller figures are omitted.

FP8e4m3 (8-bit)

blue_pencil-flux1_v0.0.1-FP8e4m3.png (1440×1440) Color_Diff_blue_pencil-flux1_v0.0.1-FP8e4m3_vs_blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

The 8-bit FP8e4m3 format results in simplified background structures, turning the composition into a more symmetrical layout.

Q4_K_M.gguf (4-bit)

blue_pencil-flux1_v0.0.1-Q4_K_M.gguf.png (1440×1440) Color_Diff_blue_pencil-flux1_v0.0.1-Q4_K_M.gguf_vs_blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

In this format, gates are simplified into Torii (traditional Japanese shrine gates), reducing detail.

Q2_K.gguf (2-bit)

blue_pencil-flux1_v0.0.1-Q2_K.gguf.png (1440×1440) Color_Diff_blue_pencil-flux1_v0.0.1-Q2_K.gguf_vs_blue_pencil-flux1_v0.0.1-FP16.png (1440×1440)

At 2-bit compression, the composition becomes extremely simplified, omitting even the central figure.

Results in Graph Form

blue_pencil-flux1 MAE and SSIM Similarity adjust scale.png (1200×848)
Size(GB) MAE Similarity SSIM Similarity
FP16 22.1 0.00 100.0 % 1.00 100.0 %
Q8_0 11.8 5.10 98.0 % 0.99 98.6 %
Q6_K 9.2 11.03 95.7 % 0.95 95.4 %
Q5_K_M 7.9 19.10 92.5 % 0.90 90.2 %
FP8e4m3 11.5 38.94 84.8 % 0.76 76.1 %
Q4_K_M 6.5 41.90 83.6 % 0.73 72.9 %
Q3_K_L 5.0 33.25 87.0 % 0.80 80.1 %
Q2_K 3.8 61.23 76.1 % 0.54 54.2 %

The comparison results are visualized in a graph. As expected, image quality decreases as compression increases for the base model.

Notably, the FP8e4m3 format demonstrates significantly lower accuracy than the Q8_0.gguf format, despite both being 8-bit.

FP8e4m3’s performance aligns more closely with Q4_K_M, underscoring the superior balance of GGUF formats between file size and accuracy.

Composition Changes with Compression

One clear takeaway is that compressing the transformer results in simplified compositions.

As compression progresses, smaller figures are omitted, and compositions trend towards greater symmetry.

Flux1_hires_ a young female character with blue eyes and brown hair wears a white and red outfit standing in front of a traditional japanese building with red torii gates surrounded by snow - covered .png (2407×2407)
Dynamic composition SDXL -> Flux.1

In contrast to text encoders, where differences primarily manifest in fine details, transformers have a broader influence on the overall structure of the illustration.

The tendency towards symmetry in compressed models may be because asymmetry is computationally more challenging for AI to handle, requiring higher precision.

Prioritizing the Base Model or Text Encoder?

Let’s compare the results of the base model (transformer) and the text encoder from the previous post.

Flan-T5xxl MAE and SSIM Similarity Adjust scale.png (1200×848) blue_pencil-flux1 MAE and SSIM Similarity adjust scale.png (1200×848)

When comparing the two, it becomes apparent that the base model deteriorates more significantly with increased compression.

Therefore, if you need to prioritize, focus on optimizing the base model first.

Does VAE Impact Image Quality?

Lastly, we examine the Variational Autoencoder (VAE).

In both ComfyUI and Stable Diffusion webUI Forge, the VAE is typically processed in BF16 format.

Flux1_ae_FP32_vs_Flux1_ae_BF16.png (2065×2645)

When compared against the higher-precision FP32 format, the resulting images show almost identical MAE and SSIM metrics.

The default BF16 format is sufficient for VAE processing.

Separating the Model

Many Flux.1 models are distributed as combined packages with text encoder and transformer parts together.

If using Flan-T5xxl and refined CLIP-L, separating the transformer allows for more efficient storage usage.

In ComfyUI, the transformer can be separated using the following workflow:

Screenshot of ComfyUI workflow to separate Transformer.png (3567×900)

Connect the ModelSave node to the model you wish to separate and execute!

Base Model and Text Encoder Can Coexist!

Transformer calculations in image generation AI are computationally intensive, typically utilizing VRAM.

In contrast, text encoder processing in ComfyUI depends on system conditions:

When text encoders utilize VRAM:

  • VRAM must be at least 50% free relative to system RAM.
  • VRAM must also have 1.2× the required space for the model.

Given Flux.1’s large text encoder size, processing almost always utilizes system RAM.

By expanding sufficient RAM capacity, text encoder and transformer computations can coexist without resource conflicts.

Flux1_Refine_ a young woman with short brown hair and blue eyes dressed in a white coat and red tie stands in a traditional japanese setting with a red torii gate pointing upwards surrounded by snow -.png (2400×2400)
RAM Matters Too!

Since expanding RAM is relatively easier than upgrading GPUs, consider increasing RAM capacity to improve image quality.

Conclusion: Start with the Base Model!

  • Base model accuracy impacts composition quality.
  • Choose the base model based on available VRAM capacity.
  • Select the text encoder according to system RAM capacity.

Flux.1 allows for the separate use of base models and text encoders, which can often lead to indecision.

However, this comparison clarifies the optimal approach to prioritization.

Flux1_Redraw_ a young woman in traditional japanese attire stands before a red torii gate in a snowy landscape with a traditional japanese building in the background and people walking in the distance.png (2407×2407)

There’s no end to the pursuit of better image quality, but fine-tuning settings for the best output is one of the joys of local image generation.

Why not dive into the settings and aim for your perfect illustration?

Thank you for reading to the end!


Models

Flan-T5xxl

Modified Long-CLIP-L

blue_pencil-flux1-v0.0.1