Evaluating Image Quality is Tough! Exploring WaveSpeed's Dynamic Caching for Faster Generation

Flux1_heunpp2_ Anime-style drawing of a female pilot with blonde hair and purple eyes wearing a blue cap with a gold emblem a black vest with gold epaulettes and a white shirt with a black tie She has.png (1600×1600)
  • MAE and SSIM metrics have their limitations.
  • While start is the most critical factor, other parameters also influence the results.
  • Dynamic Caching refines large structures but omits fine details.
Anime manga illustration with an airplane and pilot drawn by blue_pencil-flux1 No1.png (2480×1753)

Introduction

Hello, this is Easygoing.

Continuing from the previous post, this article focuses on further testing WaveSpeed.

Once again, this post dives deep into technical details, so thank you for bearing with me.

The Importance of Early Stages in Image Generation

While testing WaveSpeed in the previous experiment, I noticed that adjusting the start value significantly impacts the illustrations.

Image generation starts with a noisy image and progressively removes noise to produce the final illustration. It seems that the initial accuracy of this process has a substantial effect on the resulting image quality.

Four Configuration Parameters

WaveSpeed's Apply First Block Cache node provides the following four adjustable parameters:

Apply First Block Cache Node.png (2272×1722)
  • residual_diff_threshold (RDT): The threshold for applying cached results.
  • start: Determines when caching begins.
  • end: Determines when caching ends.
  • max_consecutive_cache_hits (max hits): Limits the number of consecutive cache applications (default is -1 for unlimited).

This time, I'll adjust each parameter individually and compare the resulting impacts.

Results: Graphs at a Glance!

Here are the summarized results in graph form:

  • X-axis: Time required for image generation (seconds).
  • Y-axis: Image quality metrics (MAE/SSIM).

The red and blue points closest to the upper left corner indicate the best balance between quality and speed.

1. RDT (residual_diff_threshold)

Dynamic Cashing RDT (start=0, end=1, max hits=-1).png (1200×848)

As RDT increases, generation speed improves, but image quality decreases.

2. start

Graph showing speed and similarity between MAE and SSIM under Dynamic Cashing start (RDT=1, end=1, max hits=-1).png (1200×848)

The start parameter exhibits a similar trend to RDT but with higher points overall, indicating better quality distribution.

3. end

Graph showing speed and similarity between MAE and SSIM under Dynamic Cashing end (RDT=1, start=0, max hits=-1).png (1200×848)

The end parameter has lower overall points, signifying poorer quality.

4. max hits (max_consecutive_cache_hits)

Graph showing speed and similarity between MAE and SSIM under Dynamic Cashing max hits (RDT=1, start=0, end=1).png (1200×848)

The distribution for max hits closely resembles that of the initial RDT graph.

Start is the Most Critical Factor!

Among the four graphs, the second one for start shows results skewed towards the upper left corner.

Order of Influence on Quality:

  1. start > RDT ≒ max hits > end

In other words, start has the most significant impact on image quality.

Anime manga illustration with an airplane and pilot drawn by blue_pencil-flux1 No4.png (2480×1753)

The graph indicates that adjusting start between 0.2 and 0.3 results in substantial quality changes, making it a reasonable range to focus on when tuning.

Is Start All That Matters?

While start is clearly the most critical factor, do the other three parameters also affect image quality?

Let’s compare illustrations generated with half the usual processing time, adjusted for either RDT or start.

RDT = 0.08 (102 seconds)

residual_diff_threshold_0.08_start_0_end_1_diff.png (2065×1535)

start = 0.3 (108 seconds)

RDT1_start_0.3_end_1_max_hits_-1_diff.png (2065×1467)

Both illustrations were generated in nearly the same time, yet they look markedly different.

Looking at the MAE and SSIM values, the lower illustration has higher scores, meaning its overall structure is more accurate.

However, upon closer inspection, the lower illustration clearly lacks fine details.

In image generation, the early stages define the major structure, while finer details are filled in later. Since start focuses on the early stages, the resulting image has a stable structure but fewer intricate details.

Balance is Everything

Finally, let’s revisit the settings previously recommended by ChatGPT for balancing quality and speed:

  • start = 0.2
  • end = 0.8
  • max hits = 5
Graph showing speed and similarity between MAE and SSIM under Dynamic Cashing RDT (start=0.2, end=0.8, max hits=5).png (1200×848)

This configuration balances the start, end, and max hits parameters, producing the best results from all tests conducted so far.

Dynamic Caching likely places the greatest emphasis on start, but achieving optimal results seems to require balancing all the settings to some extent.

What Happens Without Dynamic Caching?

Finally, let’s see what happens when Dynamic Caching is not used by varying the total number of steps for image generation.

Graph showing speed and similarity between MAE and SSIM under No Dynamic Cashing Steps.png (1200×848)

Reducing the number of steps predictably lowers image quality.

Compared to the earlier graph with ChatGPT’s recommended settings, the results here are positioned lower overall.

This shows that Dynamic Caching, when appropriately configured, can maintain better image quality than simply reducing the step count.

Dynamic Caching Omits Fine Details

Lastly, let’s examine how Dynamic Caching affects rendering. To investigate, we compare two illustrations:

Regular 15 Steps (106 sec)

No_Dynamic_Cash_step_15_diff.png (2065×1467)

Regular 15 Steps + 15 Steps with Dynamic Caching (Total: 30 steps, 147 sec)

RDT1_start_0.5_end_1_max_hits_-1_diff.png (2065×1467)

The illustration with Dynamic Caching added to the regular 15 steps retains a solid overall structure but omits fine details.

Inferences using Dynamic Caching align broadly with the expected results but introduce errors corresponding to the RDT setting in finer details.

As a result, while the overall structure becomes clearer, the fine details are gradually omitted. Judging the superiority of either image is subjective, but depending on the context, the second image might be preferred for its simplicity and clarity.

Anime manga illustration with an airplane and pilot drawn by blue_pencil-flux1 No2.png (2480×1753)

Conclusion: Evaluating Image Quality Is Complex

  • MAE and SSIM metrics have their limitations.
  • While start is the most critical factor, other parameters also influence the results.
  • Dynamic Caching refines large structures but omits fine details.

This investigation proved to be quite nuanced.

Up until now, I’ve used MAE and SSIM to evaluate image differences. However, this analysis revealed their limitations.

Dynamic Caching performs a kind of broad-strokes inference, similar to human intuition, which makes its results intriguingly distinct.

Anime manga illustration with an airplane and pilot drawn by blue_pencil-flux1 No3.png (2480×1753)
Created by Manga Editor DESU!

Beyond image and video generation AI, Dynamic Caching represents a breakthrough technology that could find applications in other AI domains.

It’s exciting to think that the global development efforts in image generation AI are exposing us to the cutting edge of technology.

Thank you for reading to the end!

References

Here are the illustrations generated with different settings when the image generation time was roughly halved. These comparisons clarify the intention behind each parameter setting.

1. RDT = 0.08 (102 sec)

residual_diff_threshold_0.08_start_0_end_1_diff.png (2065×1535)

The rendering is somewhat incomplete.

2. start = 0.3 (108 sec)

RDT1_start_0.3_end_1_max_hits_-1_diff.png (2065×1467)

The structure is complete, but fine details are missing.

3. end = 0.4 (102 sec)

RDT_1_start_0_end_0.4_max_hits_-1_diff.png (2065×1467)

The structure is inaccurate, but there is more detailed rendering.

4. max hits (max_consecutive_cache_hits) = 1

RDT_1_start_0_end_1_max_hits_1_diff.png (2065×1467)

The rendering is generally insufficient.

5. No Dynamic Caching, steps = 15 (106 sec)

No_Dynamic_Cash_step_15_diff.png (2065×1467)

The rendering is generally insufficient.


model

blue_pencil-flux1-v0.0.1