Can Anime Otakus Get Smarter with a Dictionary? The Genealogy of the Latest noob_v_pencil-XL

SDXL #CLIP #SDXL

2024-12-132025-2-14

Anime illustration of a champagne gold night view of an amusement park SDXL 29.png (1600×1600)

Anime models have modified CLIP
Upgrading SDXL's CLIP is difficult
The noob-v-pencil-XL series features vibrant colors

Introduction

Hello, this is Easygoing.

Today, we'll be discussing the understanding of prompts in Stable Diffusion XL.

The Theme: Amusement Park Illuminations

The theme for today is the illuminations at an amusement park.

Anime illustration of a champagne gold night view of an amusement park SDXL 27.png (2456×2456) — anima_pencil-XL -> AuraFlow -> blue_pencil-flux1

We will recreate the champagne gold lights, which are a bit more mature and different from the usual sparkling LEDs.

Text Encoder as a Dictionary

In the previous article, I introduced the role of the text encoder as a dictionary that helps the AI understand the prompts we input.

In the new image generation AI models Flux.1 and Stable Diffusion 3.5, the text encoder can be selected separately from the base model, making it easy to upgrade.

On the other hand, the previous generation model SDXL has its text encoder integrated into the model.


flowchart LR
subgraph Input
X1(Prompt)
end
subgraph Stable Diffusion XL
B1(CLIP-L)
B2(CLIP-G)
Y1(UNET)
end
X1-->B1
X1-->B2
B1-->Y1
B2-->Y1

Now, can we upgrade CLIP in SDXL like we did in Flux.1?

What Happens if We Revert CLIP to the Original...

First, let's examine how the CLIP in the representative anime models has changed from the original CLIP.

Left: Anime model CLIP-L + CLIP-G
Right: SDXL_Base CLIP-L + CLIP-G

Animagine-XL 3.1

blue_pencil-XL_v7.0.0

anima_pencil-XL_v5.0.0

anime_pencil-XL_v7.0.0_original — Also close to the original

anime_pencil-XL_v7.0.0_SDXL_Base_CLIP — Also close to the original

Pony Diffusion V6 XL

Pony_Diffusion-XL_V6_original — Completely broken

Pony_Diffusion-XL_V6_SDXL_Base_CLIP — Completely broken

Illustrious-XL_v0.1

Anime Model CLIP has been Modified!

Looking at the generated images, it seems that Pony Diffusion V6 XL and Illustrious-XL_v0.1 have undergone advanced modifications to their CLIP, making them incompatible with the original CLIP, resulting in broken illustrations.

Animagine-XL 3.1 can generate images, but the illustrations also have distorted structures and reduced saturation.

On the other hand, blue_pencil-XL_v7.0.0 and anima_pencil-XL_v5.0.0 generate illustrations that are much closer to the original.

Natural Language Input Becomes More Challenging

The original CLIP in SDXL has a wide understanding ability, including support for natural language input.

Anime illustration of a champagne gold night view of an amusement park SDXL 1.png (2456×2456)

However, anime models, due to repeated additional learning with tagging data like Danbooru, have worsened responsiveness to natural language input.

This is similar to a student who, after studying only one subject, begins to forget other subjects.

Once content is forgotten, it's difficult to recall it easily, even with a high-performance dictionary.

The blue_pencil-XL Series Rewinds CLIP!

In the previous examples, both blue_pencil-XL and anima_pencil-XL produced illustrations quite close to the original CLIP. Here’s why:

The blue_pencil-XL series is a merged model made by combining a vast number of models. The merging recipe for the early blue_pencil-XL_v1.0.0 is publicly available.

Merge history of blue_pencil-XL_v1.0.0.png (1740×2320) — https://blue-pen5805.github.io/models/blue_pencil-XL-v1.0.0.html

What is surprising is that, in the early stages, more than 50 models, including real-life models, were merged. But the key point is that, during the merging process, the original SDXL_Base (red) was frequently mixed in.

This process performs as much CLIP rewind as possible.

By maintaining the original CLIP, the blue_pencil-XL series has preserved the prompt understanding ability inherent to SDXL, allowing it to comprehend natural language prompts.

Why is Natural Language Input So Good?

Personally, I always input prompts as short stories in natural language.

This method allows character expressions to change depending on the scene being captured.

Anime illustration of a champagne gold night view of an amusement park SDXL 16.png (2456×2456)

Anime illustration of a champagne gold night view of an amusement park SDXL 3.png (2456×2456)

Anime illustration of a champagne gold night view of an amusement park SDXL 22.png (2456×2456)

While tags can also specify expressions, creating subtle expressions requires combining several keywords in a weaker way.

With natural language input, various expressions can be generated from the same prompt, and most importantly, it’s a fun experience for the person generating the images.

What is the blue_pencil-XL Family Aiming For?

The blue_pencil-XL family releases a new series that merges newly launched anime models with the existing models.

Anime illustration of a champagne gold night view of an amusement park SDXL 7.png (2456×2456)

The anima_pencil-XL, used in this illustration, is a popular anime model that is also adopted by the image generation platform Fooocus.

The pony_pencil-XL, Illustrious_pencil-XL, and noob_v_pencil-XL (in development) series aim to absorb the strengths of each anime model while improving usability.


flowchart TB
A1(SDXL_Base)
B1(Anime Models)
B2(Realistic Models)
C1(Pony Diffusion V6 XL<br>2024.1.8)
C2(Animagine-XL 3.1<br>2024.3.21)
C3(Illustrious-XL_v0.1<br>2024.9.25)
C4(NoobAI-XL<br>2024.10.8)
C5(NoobAI-XL_V-pred-0.75s<br>2024.12.8)
D1([blue_pencil-XL_v7.0.0<br>2024.6.23])
D2([anima_pencil-XL_v5.0.0<br>2024.6.25])
D3([pony_pencil-XL_v2.0.0<br>2024.6.30])
D4([illustrious_pencil-XL_v2.0.0<br>2024.11.3])
D5([noob_v_pencil-XL_v0.5.1<br>2024.12.10])
A1-->B1
A1-->B2
B1-->C1
B2-->C1
B1---->D1
B2---->D1
B1--->C2
C2--->D2
D1-->D2
D2-->D3
C1-->D3
B1------->C3
C3-->C4
C4--->C5
C3-->D4
D2-->D4
C5-->D5
D4-->D5

Later models, such as Illustrious-XL and NoobAI-XL, have significantly altered CLIP, making natural language input more challenging.

However, the prompt-following ability of Illustrious_pencil-XL and noob_v_pencil-XL series is still superior to the original.

The noob-v-pencil-XL Series!

The latest release in the blue_pencil-XL family is the noob-v-pencil-XL series.

noob_v_pencil-XL_v0.5.1_00006_.png (2456×2456) — noob_v_pencil-XL_v0.5.1

The noob-v-pencil-XL series is a merged model based on the continuously updated NoobAI-XL_V-pred series.

E-pred: Epsilon prediction (Noise prediction)
V-pred: Velocity Prediction (State change prediction)

	E-pred	V-pred
Features	Predicts noise components (ϵ)	Predicts image state changes (v)
Stability	Stabilize with more steps	Stabilizes with fewer steps
Diversity	High diversity due to initial noise impact	Slightly lower diversity

V-pred was originally developed by Novel AI during the Stable Diffusion 1 era as an image generation prediction method.

Compared to regular ε-prediction, it converges faster and generates images with fewer steps.

noob_v_pencil-XL_v0.5.CLIP_change_Hires_00011_.png (2456×2456) — noob_v_pencil-XL_v0.5.1_refined_r1

In actual use, the standout feature of V-pred is its vibrant color. It can express vivid colors that were previously difficult to achieve with anime models.

Prototype CLIP Merges

Now, the theme for this post was the upgrade of CLIP in SDXL.

Here’s the merging recipe I used for my personal model.

noov_v_pencil-XL-v0.5.1_change_clip — Left: Merged version; Right: noob_v_pencil-XL_v0.5.1

noov_v_pencil-XL-v0.5.1_original — Left: Merged version; Right: noob_v_pencil-XL_v0.5.1

CLIP-G
- noob_v_pencil-XL_v0.5.1 x 0.9
- anima_pencil-XL_v5.0.0 x 0.1
CLIP-L
- noob_v_pencil-XL_v0.5.1 x 0.9
- ViT-L-14-BEST-smooth-GmP-TE-only-HF-format x 0.1

The merged model is brighter and has improved details, though in some illustrations, the overall image may appear somewhat washed out.