feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573
Draft
shikaku2 wants to merge 2 commits into
Draft
feat: add RMSE-guided quantization, AIO GGUF bundling, and lazy-load flag#1573shikaku2 wants to merge 2 commits into
shikaku2 wants to merge 2 commits into
Conversation
…flag - --rmse <pct>: streaming two-pass mixed-precision quantization; peak RAM = f32 size of single largest tensor, not the full model - --convert with multiple component flags (--clip_l, --clip_g, --t5xxl, --diffusion-model, --llm, --vae) bundles into a single AIO GGUF - -ll/--lazy-load: mmap-backed loading with staged madvise(MADV_DONTNEED) eviction after each pipeline stage; auto-enables VAE tiling to avoid 4+ GiB single allocations that exceed Vulkan maxMemoryAllocationSize Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seed 42, 1024x1024, 20 steps. Baseline = F16 safetensors with -ll. Variants: 1%/3%/6% RMSE AIO GGUF. Prompts: cat, phonograph, garden. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wbruna
suggested changes
May 28, 2026
Contributor
wbruna
left a comment
There was a problem hiding this comment.
Please take a look at CONTRIBUTING.md. A few immediate issues I see:
- It is doing too much at once: the lazy loading, RMSE conversion and AIO target are independent features.
- AIs should not be credited as co-authors.
Looking at the implementation itself, I see a few other issues:
- you take f16 as a "ceiling" type, but many models are provided as bf16, or could even have f32 for critical tensors. It would be more sensible to use the current type as a ceiling, and only use f16 if
--type f16was explicitly requested. - the streaming conversion is a nice feature, but I see no reason for it to be restricted to the RMSE conversion mode. In fact, since RMSE is per-tensor, it should be able to use almost exactly the same code path (maybe even as a per-tensor-path target: say,
--tensor-type-rules=^model.diffuser_model.*=rmse:0.01). - if I understand the code correctly, you are testing from the lowest to the highest quant, sequentially, and picking the first quant that passes the threshold. That wastes a ton of CPU for low threshold values: I believe a binary search would easily take half the time on average. Plus, you are counting on the optimizer to detect that the baseline computation can be reused: if that's not the case, it's wasting CPU, too.
- the
#if defined(__linux__)is at the wrong code layer: platform-dependent code should be inutil.cpp. MADV_DONTNEEDis advisory, and the OS is already free to evict memory-mapped pages. It's also purely performance-oriented, so maybe it could be controlled by aSD_MMAP_FLAGSflag: assume lazy-loading, call anevict_memoryutil.cpp function to discard memory when appropriate, and internally that function would check for both__linux__and the flag)- as I've commented above, the eviction should also be called for normal mmap-to-VRAM code paths.
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related improvements to the
--convertworkflow and inference startup, focused on reducing disk footprint, RAM usage during conversion, and VRAM requirements at inference time.1. RMSE-guided mixed-precision quantization (
--rmse <threshold>)Adds a
--rmse <pct>flag to convert mode that automatically selects per-tensor quantization types by running a two-pass sweep.How it works:
Peak RAM during conversion = f32 size of the single largest tensor (not the full model). The two-pass design is streaming — no full model is held in memory at once.
Results on SD3.5 Large:
The original model files are all F16/BF16 — no F32 in distribution:
sd3.5_large.safetensorst5xxl_fp16.safetensorsclip_g.safetensorsclip_l.safetensorsRMSE quantization results (all bundled into a single AIO GGUF):
At 1% RMSE, most tensors land on Q4_K or Q5_K. RMSE is a tensor-level metric, not a perceptual one — see visual comparison below.
2. All-in-one GGUF bundling (
--convertwith multiple component flags)--convertnow accepts separate component files (--clip_l,--clip_g,--t5xxl,--diffusion-model,--llm,--vae) and writes them all into a single output GGUF, including metadata that allows the loader to identify each component.Before this, distributing a quantized model required shipping 4–6 separate files and passing each as a CLI flag. After, a single
.ggufis self-contained and loadable with just-m.This is convenience packaging — no quality or performance change.
3. Lazy-load / staged VRAM eviction (
-ll/--lazy-load)Adds a
-ll/--lazy-loadflag that enables mmap-backed model loading and staged RAM eviction across the inference pipeline.Problem: Systems with limited VRAM cannot run the full pipeline when all components (text encoders + diffusion model + VAE) are loaded simultaneously.
How it works:
madvise(MADV_DONTNEED)is called on that component's tensors, releasing physical pages without invalidating pointers.-llis active, to avoid a single large allocation. (SD3.5 VAE decode at 1024×1024 would require a ~4.6 GiB VkBuffer which exceeds the VulkanmaxMemoryAllocationSize = 4 GiBhard limit on many GPUs.)Also applies to
--convert: lazy-load + threading reduces peak RAM during quantization significantly — useful for generating quants on machines without large RAM.This feature is architecture-agnostic (UNet, DiT, Flux, WAN, etc.) and works with both AIO GGUFs and separately-loaded component files.
Visual comparison
SD3.5 Large, Seed 42, 1024×1024, 20 steps. Baseline = F16 safetensors with
-ll. All RMSE variants are AIO GGUF with-ll. Hardware: RX 9060 XT 16 GB, Vulkan."a cute cat"
"a vintage photograph of an old phonograph sitting on a table"
"a serene japanese garden with cherry blossoms at sunset"
Testing
-llenabled; all succeededKnown limitations / future work
-lleviction is currently Linux-only (madvisepath); Windows/macOS gracefully skip eviction but still benefit from mmap loading