The Best GPUs for Image Generation: Comfortable Use of Flux.1 / SD 3.5 / AuraFlow!

＃GPU ＃RTX Others #VRAM

2024-11-222025-2-14

Anime illustration of a female knight in sheet metal armor 3031.png (1600×1600)

Choose NVIDIA GPUs for AI tasks.
Aim for 16GB VRAM for models like Flux.1, SD 3.5 large, and AuraFlow.
Recommended GPUs: RTX 4060 Ti 16GB or RTX 3060 12GB.

Introduction

Hello, this is Easygoing.

In this article, we'll explore the best GPUs for utilizing image generation AI!

Useful Reference Sites!

First, let me introduce a very informative website for comparing GPUs used in image generation AI.

The Best GPUs for Deep Learning in 2023 — An In-depth Analysis

This blog, authored by staff from an AI research institute in the US, provides detailed explanations about the hardware required for AI computations.

Performance Chart Overview!

Let's start by referencing the performance chart of GPUs from the site mentioned above.

GPUS_Ada_raw_performance3.png (1600×1572)

RTX/GTX Series: For general use.
A Series / H Series: For professional use.

When comparing GPU performance, focus on the red 16-bit Inference bar in the chart.

Larger numbers in the GPU model indicate newer and higher performance. For example, the GTX 1060, released in 2016, demonstrates how GPU performance has increased several times over just a few years.

GPUs Are Manufactured by Three Companies

As of November 2024, GPUs are produced by three main companies:

NVIDIA (RTX / GTX Series)
AMD (Radeon Series)
Intel (Arc Series)

Anime illustration of a male knight in sheet metal armor 6.png (2048×2048)

Among these, NVIDIA GPUs are optimized for AI applications, making them the top choice for image generation.

Key Features of GPU Generations

Let’s examine the characteristics of NVIDIA GPU generations. AI computations rely heavily on floating-point (FP) calculations.

Generational Summary Table

	Released	FP32	FP16	BF16	FP8
RTX 4000 Series	2022	✅	✅	✅	✅
RTX 3000 Series	2020	✅	✅	✅	❌
RTX 2000 Series	2018	✅	✅	❌	❌
GTX 1000 Series	2016	✅	❌	❌	❌

NVIDIA introduces new GPU generations approximately every two years, and each generation brings optimizations for newer formats.

FP32, FP16, BF16, FP8 Explained

Image generation involves a vast amount of computation. Floating-point formats used in these calculations include:

Format	Sign	Exponent	Mantissa	Precision	Accuracy
FP32	1 bit	8 bits	23 bits	6–7 digits	Excellent
FP16	1 bit	5 bits	10 bits	3–4 digits	Good
BF16	1 bit	8 bits	7 bits	3 digits	Good
FP8 (e4m3)	1 bit	4 bits	3 bits	1–2 digits	Fair
FP8 (e5m2)	1 bit	5 bits	2 bits	1–2 digits	Fair

Sign: Indicates positive or negative.
Exponent: Handles large numbers for stability.
Mantissa: Determines precision.

While FP32 offers the highest precision, it requires significant computational resources. Reducing precision (FP16 → FP8) decreases VRAM usage and computational load but also impacts accuracy.

GPU Memory (VRAM)

VRAM is critical for handling the heavy memory requirements of AI computations. Here's a comparison of VRAM types:

Memory Type	Bandwidth	Example GPUs
GDDR7	Up to 1.5 TB/sec	RTX 5000 Series (Upcoming)
GDDR6X	Up to 1 TB/sec	RTX 4090, 4080, 4070, etc.
GDDR6	Up to 750 GB/sec	RTX 4000, 3000, 2000, GTX 1600 Series
GDDR5X	Up to 550 GB/sec	GTX 1080
GDDR5	Up to 335 GB/sec	GTX 1070, 1060, 1050
DDR5	Up to 50 GB/sec	System RAM and integrated GPUs
DDR4	Up to 25 GB/sec	System RAM and integrated GPUs

The bandwidth figures in the table represent the memory transfer speed.

GDDR is a high-performance memory optimized for VRAM, offering much wider bandwidth compared to general-purpose DDR memory. This allows for significantly faster processing.

What Happens When VRAM Runs Out?

When VRAM is insufficient during processing, the system will rely on system RAM to handle the overflow.

However, system RAM has narrower bandwidth and introduces latency due to PCIe connections, resulting in a significant slowdown.

Anime illustration of a female knight in sheet metal armor 24.png (2048×2048)

If VRAM exceeds its capacity, processing times can multiply by several factors, severely impacting the practicality of image generation.

How Much VRAM Do You Need?

So, how much VRAM is necessary for image generation AI?

Let’s measure the VRAM usage for various models.

Test Environment

RTX 4060 Ti 16GB
System RAM: 64GB
1024 x 1024 resolution, Euler, 20 steps

ComfyUI FP16 Format

First, let’s look at the VRAM usage in FP16 format (blue graph).

Newer models like Flux.1, AuraFlow, and SD 3.5 (large) use 14–15GB of VRAM, meaning you’ll need at least 16GB of VRAM for smooth performance.

Meanwhile, SD 3.5 (medium) and SDXL require just 6GB of VRAM to run.

Using --novram Option to Save VRAM

ComfyUI provides two options for saving VRAM when running low: --lowvram and --novram.

In the purple graph, you can see that the --lowvram option does not significantly reduce VRAM usage.

Anime illustration of a male knight in sheet metal armor 3.png (2048×2048)

In contrast, the red --novram option processes data entirely in system RAM, greatly reducing VRAM usage.

Although --novram processing increases generation time by 1.5–2x, it is still significantly faster than exceeding VRAM capacity.

ComfyUI FP8 Format

Next, let’s examine VRAM usage in FP8 format, which is less demanding than FP16.

While FP8 format (green graph) generally uses less VRAM, Flux.1 still exceeds 12GB of VRAM, even in FP8 format.

For systems with only 12GB of VRAM, FP8 compression remains insufficient for Flux.1.

Stable Diffusion WebUI Forge

Finally, let’s measure VRAM usage in Stable Diffusion WebUI Forge.

Even here, Flux.1 exceeds 12GB of VRAM usage in both FP16 and FP8 formats.

Using GGUF Format with Flux.1

For systems with less than 12GB of VRAM, it is recommended to use the GGUF format, introduced in August 2024.

Anime illustration of a male knight in sheet metal armor 2.png (2048×2048)

We’ll cover GGUF format in detail in the next article.

Comparing Actual GPUs

Now, let’s dive into the performance and pricing of specific GPUs.

High-End Options if Budget Allows

Here is a lineup of the latest RTX 4000 series GPUs:

GPU Model	VRAM Capacity	Tensor Cores	Power Usage	Price
RTX 4090	24GB	336	~450W	¥350,000
RTX 4080 SUPER	16GB	320	~320W	¥200,000
RTX 4080	16GB	304	~320W	¥180,000
RTX 4070 Ti SUPER	16GB	264	~285W	¥150,000
RTX 4070 Ti	12GB	240	~285W	¥140,000
RTX 4070	12GB	192	~285W	¥90,000
RTX 4060 Ti 16GB	16GB	136	~160W	¥80,000

If budget is not a concern, high-end GPUs are ideal.

Among the RTX 4000 series, the following offer the best VRAM value for their price:

RTX 4090 (24GB)
RTX 4070 Ti SUPER (16GB)
RTX 4060 Ti (16GB)

Mid-Range Options with Ample VRAM

High-end GPUs offer exceptional performance, but if you're looking to save costs, mid-range GPUs are a viable option.

NVIDIA has introduced large VRAM models in the mid-range lineup, including the RTX 4060 Ti and RTX 3060.

GPU Model	Release Date	Tensor Cores	Power Usage	Price	Used Price
RTX 4060 Ti 16GB	July 2023	136 cores	160W	¥80,000	¥65,000
RTX 3060 12GB	February 2021	112 cores	170W	¥55,000	¥30,000

If you're looking for an affordable GPU for AI applications, these two models are strong contenders.

Consider Certified Used GPUs

GPUs are robust components, and pre-owned units generally work without issues.

Many PC shops sell pre-owned GPUs with return guarantees. Based on my experience, if no issues appear within the first week, the GPU is likely reliable.

Upgrading GPU cooling by replacing the fan with Noctua | ITX Man Blog

Common issues with pre-owned GPUs include cooling fan malfunctions, but replacing the fan with a larger aftermarket unit often improves cooling efficiency.

For the RTX 4060 Ti, pre-owned options are still relatively expensive due to its recent release. However, RTX 3060 units can be found at very affordable prices in the pre-owned market.

Anime illustration of a female knight in sheet metal armor 28.png (2048×2048)

If you're comfortable with more risk, platforms like Mercari can offer even cheaper deals, albeit without warranties.

What About the RTX 5000 Series?

New RTX series GPUs are released approximately every two years. Originally expected this year, the RTX 5000 series has been delayed until next year.

The VRAM lineup may include Samsung’s new GDDR7 3GB chips, which could provide 50% more VRAM capacity compared to the current 2GB chips.

A 24GB Nvidia GeForce RTX 5080 gaming GPU is now more likely, thanks to Samsung

The top-tier RTX 5090 is expected to feature 32GB of VRAM using the current 2GB chips. Products exceeding this capacity are unlikely in this generation.

Expected Pricing of the RTX 5060

NVIDIA is enjoying record-breaking success, particularly in the AI GPU market, where it holds a near monopoly. Consequently, the RTX 5000 series is expected to be priced aggressively.

Currency devaluation has driven up GPU prices in Japan. For example, the RTX 4060 Ti 16GB is priced near what was historically considered high-end.

Model	Release Date	Price (USD)	Price (JPY)	USD/JPY Rate
RTX 4060 Ti 16GB	July 2023	$499	¥71,000	142
RTX 3060 12GB	February 2021	$329	¥34,800	106
RTX 2060	January 2019	$349	¥37,700	108

Given the current economic conditions, waiting for the RTX 5060 might not guarantee better pricing.

How Does AMD’s Radeon Compare?

AMD's Radeon series is the second most widely used GPU line after NVIDIA.

While Radeon GPUs excel at general computation, they lack hardware equivalent to Tensor cores, lagging behind NVIDIA for AI applications.

Anime illustration of a female knight in sheet metal armor 15.png (2048×2048)

AMD has no immediate plans to implement Tensor core-like hardware in consumer GPUs, so NVIDIA’s dominance will likely continue for at least 1–2 years.

Challenges with Intel Arc GPUs

Intel Arc GPUs debuted in April 2022 as newcomers to the market.

However, their software ecosystem, particularly drivers, remains underdeveloped. This results in instability, especially for AI workloads.

Anime illustration of a female knight in sheet metal armor 18.png (2048×2048)

When I tested an Intel Arc GPU on a borrowed PC, frequent memory leaks rendered it nearly unusable for image generation.

For AI applications, NVIDIA GPUs remain the clear recommendation.

Online Services vs Local PCs: Which Is More Affordable?

Online services offer access to high-performance computational resources but require ongoing rental fees.

In general, local GPUs become more cost-effective after about 300 days of use.

Anime illustration of a female knight in sheet metal armor 19.png (2048×2048)

Additionally, many online platforms, including SeaArt AI, optimize for FP8 execution and limit FP16 utilization.

Use Stable Diffusion / Flux.1 Online! Recommended: SeaArt AI | AI image journey

For high-quality AI illustrations, FP16 capability gives local setups an edge.

Trivia from the Blog

PCIe 3.0 x4 bandwidth is sufficient for GPU connectivity.
System RAM should equal or exceed your VRAM capacity; add more if necessary.
The FP8 performance of a single RTX 4090 surpasses the combined computational power of all the world's supercomputers in 2007.