feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag by shikaku2 · Pull Request #1573 · leejet/stable-diffusion.cpp

shikaku2 · 2026-05-28T00:18:50Z

Summary

Three related improvements to the --convert workflow and inference startup, focused on reducing disk footprint, RAM usage during conversion, and VRAM requirements at inference time.

1. RMSE-guided mixed-precision quantization (`--rmse <threshold>`)

Adds a --rmse <pct> flag to convert mode that automatically selects per-tensor quantization types by running a two-pass sweep.

How it works:

Pass 1: for each tensor, loads it as f32, tests candidate quant types (f16, Q8_0, Q6_K, Q5_1, Q5_0, Q4_K, Q4_0, IQ4_NL, Q3_K, Q2_K) from highest to lowest quality, and records the lowest-quality type that keeps RMSE within the threshold.
Pass 2: quantizes and writes each tensor at its assigned type.

Peak RAM during conversion = f32 size of the single largest tensor (not the full model). The two-pass design is streaming — no full model is held in memory at once.

Results on SD3.5 Large:

The original model files are all F16/BF16 — no F32 in distribution:

File	Format	Size
`sd3.5_large.safetensors`	F16 + BF16	16.5 GB
`t5xxl_fp16.safetensors`	F16	9.8 GB
`clip_g.safetensors`	F16	1.4 GB
`clip_l.safetensors`	F16	0.25 GB
Total (4 files)		~27.9 GB

RMSE quantization results (all bundled into a single AIO GGUF):

Target	Size	Reduction
F16 baseline (4 files)	~27.9 GB	—
1% RMSE	~14 GB	−50%
3% RMSE	~13 GB	−53%
6% RMSE	~12 GB	−57%

At 1% RMSE, most tensors land on Q4_K or Q5_K. RMSE is a tensor-level metric, not a perceptual one — see visual comparison below.

2. All-in-one GGUF bundling (`--convert` with multiple component flags)

--convert now accepts separate component files (--clip_l, --clip_g, --t5xxl, --diffusion-model, --llm, --vae) and writes them all into a single output GGUF, including metadata that allows the loader to identify each component.

Before this, distributing a quantized model required shipping 4–6 separate files and passing each as a CLI flag. After, a single .gguf is self-contained and loadable with just -m.

# Before
./sd -m sd3.5_large.safetensors --clip_l clip_l.safetensors --clip_g clip_g.safetensors \
    --t5xxl t5xxl_fp16.safetensors ...

# After (convert once)
./sd --convert --diffusion-model sd3.5_large.safetensors \
    --clip_l clip_l.safetensors --clip_g clip_g.safetensors \
    --t5xxl t5xxl_fp16.safetensors -o sd3.5_large_aio.gguf

# Then run with a single file
./sd -m sd3.5_large_aio.gguf ...

This is convenience packaging — no quality or performance change.

3. Lazy-load / staged VRAM eviction (`-ll` / `--lazy-load`)

Adds a -ll/--lazy-load flag that enables mmap-backed model loading and staged RAM eviction across the inference pipeline.

Problem: Systems with limited VRAM cannot run the full pipeline when all components (text encoders + diffusion model + VAE) are loaded simultaneously.

How it works:

Model is loaded via mmap. Pages are only read from disk when accessed.
After each pipeline stage completes, madvise(MADV_DONTNEED) is called on that component's tensors, releasing physical pages without invalidating pointers.
Eviction is sequential: text encoders → diffusion model → VAE. Since these stages don't overlap, peak VRAM/RAM is the max of any single component rather than the sum.
Auto-enables VAE tiling when -ll is active, to avoid a single large allocation. (SD3.5 VAE decode at 1024×1024 would require a ~4.6 GiB VkBuffer which exceeds the Vulkan maxMemoryAllocationSize = 4 GiB hard limit on many GPUs.)

Also applies to --convert: lazy-load + threading reduces peak RAM during quantization significantly — useful for generating quants on machines without large RAM.

This feature is architecture-agnostic (UNet, DiT, Flux, WAN, etc.) and works with both AIO GGUFs and separately-loaded component files.

Visual comparison

SD3.5 Large, Seed 42, 1024×1024, 20 steps. Baseline = F16 safetensors with -ll. All RMSE variants are AIO GGUF with -ll. Hardware: RX 9060 XT 16 GB, Vulkan.

"a cute cat"

F16 baseline	1% RMSE	3% RMSE	6% RMSE

"a vintage photograph of an old phonograph sitting on a table"

F16 baseline	1% RMSE	3% RMSE	6% RMSE

"a serene japanese garden with cherry blossoms at sunset"

F16 baseline	1% RMSE	3% RMSE	6% RMSE

Testing

SD3.5 Large at 1024×1024, 20 steps, Vulkan backend (RX 9060 XT 16GB)
5 diverse prompts tested with -ll enabled; all succeeded
VAE tiling: ~3.5s (49 tiles) vs ~65s CPU fallback without tiling
RMSE conversion tested at 1%, 3%, 6% thresholds on SD3.5 Large

Known limitations / future work

RMSE threshold tuning is model-dependent; 1% is a reasonable starting point but systematic perceptual quality evaluation would help establish better defaults
-ll eviction is currently Linux-only (madvise path); Windows/macOS gracefully skip eviction but still benefit from mmap loading
AIO bundling does not yet validate that bundled components are compatible with each other

…flag - --rmse <pct>: streaming two-pass mixed-precision quantization; peak RAM = f32 size of single largest tensor, not the full model - --convert with multiple component flags (--clip_l, --clip_g, --t5xxl, --diffusion-model, --llm, --vae) bundles into a single AIO GGUF - -ll/--lazy-load: mmap-backed loading with staged madvise(MADV_DONTNEED) eviction after each pipeline stage; auto-enables VAE tiling to avoid 4+ GiB single allocations that exceed Vulkan maxMemoryAllocationSize Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Seed 42, 1024x1024, 20 steps. Baseline = F16 safetensors with -ll. Variants: 1%/3%/6% RMSE AIO GGUF. Prompts: cat, phonograph, garden. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wbruna

Please take a look at CONTRIBUTING.md. A few immediate issues I see:

It is doing too much at once: the lazy loading, RMSE conversion and AIO target are independent features.
AIs should not be credited as co-authors.

Looking at the implementation itself, I see a few other issues:

you take f16 as a "ceiling" type, but many models are provided as bf16, or could even have f32 for critical tensors. It would be more sensible to use the current type as a ceiling, and only use f16 if --type f16 was explicitly requested.
the streaming conversion is a nice feature, but I see no reason for it to be restricted to the RMSE conversion mode. In fact, since RMSE is per-tensor, it should be able to use almost exactly the same code path (maybe even as a per-tensor-path target: say, --tensor-type-rules=^model.diffuser_model.*=rmse:0.01).
if I understand the code correctly, you are testing from the lowest to the highest quant, sequentially, and picking the first quant that passes the threshold. That wastes a ton of CPU for low threshold values: I believe a binary search would easily take half the time on average. Plus, you are counting on the optimizer to detect that the baseline computation can be reused: if that's not the case, it's wasting CPU, too.
the #if defined(__linux__) is at the wrong code layer: platform-dependent code should be in util.cpp.
MADV_DONTNEED is advisory, and the OS is already free to evict memory-mapped pages. It's also purely performance-oriented, so maybe it could be controlled by a SD_MMAP_FLAGS flag: assume lazy-loading, call an evict_memory util.cpp function to discard memory when appropriate, and internally that function would check for both __linux__ and the flag)
as I've commented above, the eviction should also be called for normal mmap-to-VRAM code paths.

shikaku2 and others added 2 commits May 27, 2026 20:17

docs: add RMSE quantization comparison images

ce95e91

Seed 42, 1024x1024, 20 steps. Baseline = F16 safetensors with -ll. Variants: 1%/3%/6% RMSE AIO GGUF. Prompts: cat, phonograph, garden. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wbruna suggested changes May 28, 2026

View reviewed changes

This was referenced May 28, 2026

feat: add lazy load eager eviction #1574

Draft

feat: stream model conversion #1581

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573

feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573
shikaku2 wants to merge 2 commits into
leejet:masterfrom
shikaku2:feat/rmse-aio-lazyload

shikaku2 commented May 28, 2026 •

edited

Loading

Uh oh!

wbruna left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shikaku2 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. RMSE-guided mixed-precision quantization (--rmse <threshold>)

2. All-in-one GGUF bundling (--convert with multiple component flags)

3. Lazy-load / staged VRAM eviction (-ll / --lazy-load)

Visual comparison

"a cute cat"

"a vintage photograph of an old phonograph sitting on a table"

"a serene japanese garden with cherry blossoms at sunset"

Testing

Known limitations / future work

Uh oh!

wbruna left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shikaku2 commented May 28, 2026 •

edited

Loading

1. RMSE-guided mixed-precision quantization (`--rmse <threshold>`)

2. All-in-one GGUF bundling (`--convert` with multiple component flags)

3. Lazy-load / staged VRAM eviction (`-ll` / `--lazy-load`)