Enhance Flux.1's Expression! How to Incorporate SDXL's Composition Techniques

Four-frame comic strip about a new female fighter pilot.png (3400×4800)
  • Flux.1 struggles with composition
  • SDXL anime models excel in composition
  • ComfyUI allows flexible combinations

Introduction

Hello, this is Easygoing.

Today, I’ll be exploring ways to enhance the expressive power of the highly discussed AI image generator, Flux.1.

Theme: Fighter Jet and New Female Pilot

The theme this time is a new female pilot taking on her first flight in a fighter jet.

I’ll try to capture both the tension of flying for the first time and the heavy presence of the fighter jet.

Flux.1 Has Amazing Textures

Flux.1 is a new AI image generator that was introduced in August 2024.

Compared to previous AIs, Flux.1 excels in texture quality.

Realistic illustration of a cat.png (1440×1440)

Its overwhelmingly realistic textures are so lifelike that they can be mistaken for the real thing.

Even a Newcomer Has Weaknesses

Although Flux.1 is a promising newcomer that has dramatically improved image generation quality, it also has weaknesses at this stage.

Simply put, it lacks experience.

When comparing Flux.1 to the previous generation model, SDXL, certain things become clear.

Superior Specs, But...

First, let’s compare the text encoders (language understanding models) of Flux.1 and SDXL.

Both have two text encoders that enhance prompt comprehension.

Flux.1

  • T5-XXL (Text-to-Text Transfer Transformer Extra Extra Large text encoder): 9.6 GB
  • CLIP-L (Contrastive Language-Image Pre-training Large text encoder): 0.3 GB

SDXL

  • OpenCLIP-ViT/G (Open Source Contrastive Language-Image Pre-training Vision Transformer Gigantic): 3.5 GB
  • CLIP-ViT/L (Contrastive Language-Image Pre-training Vision Transformer Large): 1.5 GB

What's the Difference?

Both sound impressive, but when comparing the two, there is a significant difference in size.

The comprehension power of language understanding models is proportional to the amount of data.

This means that Flux.1, with its newer model, should theoretically understand prompts better.

Illustration of a female rookie pilot flying a fighter jet for the first time, looking nervous2.png (2576×2576)

However, when actually used, the older SDXL reproduces prompts more accurately.

This difference stems from experience, as SDXL has benefited from users conducting additional training, improving its practical understanding.

While Flux.1 is a genius rookie, SDXL is a veteran with extensive practical experience.

Is Flux.1 Boring?

Where does the difference in understanding between Flux.1 and SDXL show up?

First, let's take a look at some images generated by Flux.1.

Realistic illustration of a front-facing catflux1-dev-Q8_0.gguf.png (1440×1440)
flux1-dev-Q8_0.gguf
Photo-like illustration of fighter jets flying in formation in the sky.png (1440×1440)
Realistic illustration of a female pilot on the ground in front of a fighter jet, looking at the camera.png (2592×2592)
FluxesCore-Dev - V1.0 - fp16

While the images generated by Flux.1 are excellent individually, when viewed consecutively, there’s a sense of monotony.

The Problem Lies in the Composition

The photos shown above all have simple compositions.

They capture the subject in the center with the camera positioned horizontally, known as the “center composition” (Hinomaru composition).

Japanese national flag (Hinomaru).png (1920×1282)
Japanese national flag (Hinomaru)

Center composition is good for focusing attention on the center, but when repeated, it can feel repetitive and dull.

Most Photos Use Center Composition

When we take photos, it’s often considered a mistake if the horizon tilts.

For commemorative photos, unless you’re a professional photographer, you tend to capture the subject almost in the center.

Because much of the training data on the internet uses this composition, Flux.1 tends to replicate it.

SDXL Can Capture Bold Shots

Now let’s look at images generated by SDXL.

Anime-style illustration of a cute cat’s face captured with a tilted camera.png (2048×2048)
anima_pencil-XL-v5.0.0
Anime-style illustration of steaming coffee captured from a slightly tilted angle.png (2048×2048)

While Flux.1 is superior in texture and dimensionality, SDXL creates interesting compositions by tilting the camera or cropping part of the subject.

This sense of playfulness brings movement to the image.

SDXL models, especially those trained on anime-style images, such as Animagine-XL 3.0 and its derivatives, excel at such bold expressions.

Flux.1 Can't Reproduce Prompts

To bring movement to the image, I used the following prompts:

  • dutch angle: shot with the camera tilted
  • close up: shot with a telephoto lens

While SDXL faithfully reproduced the prompts, resulting in varied compositions, Flux.1, with its lack of practical additional training, was unable to accurately reproduce them.

The Best of Both Worlds?

I wondered if SDXL could complement Flux.1’s weaknesses.


flowchart LR
subgraph SDXL
A(Original<br>Sketch)
end
subgraph Flux.1
B(Redraw)
C(Upscaling)
end
subgraph SDXL
D(Final<br>Touch)
end
A-->B
B-->C
C-->D

The idea is to take advantage of Flux.1’s superior textures while letting SDXL handle composition and final touches.

I Tried It Out

Now, let’s look at the actual images.

Rough sketch of a female pilot flying a fighter jet for the first time, looking nervous.png (1024×1024)
SDXL original sketch
High-resolution version of the female pilot flying a fighter jet for the first time, looking nervous2.png (2576×2576)
Flux.1 redraw
Finished illustration of the female pilot flying a fighter jet for the first time2.png (2576×2576)
SDXL final touch

First, I created a rough sketch using SDXL.

Next, I redrew and upscaled the image with Flux.1, enhancing the textures and fixing details like the hands.

Since the textures became too intense for the character, I finished by toning it down with SDXL.

It took a few tries to find the right balance between the softness of the character and the texture of the aircraft.

Challenging ComfyUI!

Since automating this process in Stable Diffusion webUI Forge was impossible, I took this opportunity to try out ComfyUI.

Screenshot of ComfyUI workflow.png (2780×1863)

Learning ComfyUI for the first time was difficult, and I spent about three days battling error messages, but I finally managed to output images.

Workflow

Here are illustrations where you can download the workflows I used.

Uploading images to Google Blogger will delete the ComfyUI workflow, so we will use text data to describe it.

SDXL-Flux1-SDXL for Anime

rabbit in dark.png (2576×2576)

SDXL-Flux1-SDXL_Anime.7z

Tested models:

This is the workflow used for the anime illustrations.

SDXL-Flux1 for Semi-Realistic Images

Semi realistic cat.png (2576×2576)

SDXL-Flux1-semi-realistic.7z

Tested models:

This is the workflow used for semi-realistic images incorporating anime-style compositions.

Summary

  • Flux.1 struggles with composition
  • SDXL anime models excel in composition
  • ComfyUI allows flexible combinations

I see great potential in Flux.1.

As Flux.1 continues to improve through additional training, techniques like the ones introduced today, where we rely on SDXL, may no longer be necessary.

Until then, it seems that this partnership between the rookie and the veteran will continue.

Thank you for reading to the end!

Bonus

When firing weapons in a fighter jet, you say "Fire!" but...

Illustration of a female rookie pilot in a fighter jet cockpit lighting a fire.png (2304×2304)

No! That’s not what I meant!