QuantLLM v2.0 - Ultra-fast, Pure Python Quantization & Training #17

codewithdark-git · 2025-12-16T13:28:56Z

codewithdark-git
Dec 16, 2025
Maintainer

We are thrilled to announce the official release of QuantLLM v2.0! This major release brings a completely redesigned API, enhanced performance, and a beautiful, professional user experience.

✨ Highlights

🚀 TurboModel: The Unified API

We've unified model loading, quantization, finetuning, and export into a single meaningful class: TurboModel.

from quantllm import turbo

# 1. Load: Auto-detects memory & capabilities
#    (Automatically enables Flash Attention 2 & 4-bit loading)
model = turbo("meta-llama/Llama-3-8B")

# 2. Chat: Simple completion interface
print(model.generate("What is the future of AI?"))

# 3. Finetune: 1-line training with LORA
#    (Automatically handles DataCollators, Gradient Checkpointing)
model.finetune(my_dataset, epochs=3)

# 4. Export: Convert directly to GGUF
model.export("gguf", "llama3-finetuned.gguf")

📦 Pure Python GGUF Export (No Binaries!)

Forget compiling llama.cpp or dealing with complex C++ toolchains. QuantLLM v2.0 includes a native Python GGUF writer.

Works on Windows, Linux, and Mac natively.
Supports all major quantization types (Q4_K_M, Q8_0, Q5_K_M).
Zero external dependencies.

🎨 Beautiful UI

We've overhauled the logging system to provide clear, actionable feedback:

SmartConfig Panel: Displays exact model parameters (e.g., "7.24B") and memory compression stats ("14GB ➔ 4.5GB (Saved 68%)") before loading.
Themed Logging: A cohesive Orange theme (orange1) for all spinners, progress bars, and success messages.
Clean Output: Suppressed noise from Hugging Face/Datasets libraries.

⚡ Performance Optimizations

torch.compile Enabled: Automatically compiles training graphs for up to 2x faster training on modern GPUs.
Dynamic Padding: Batches are padded dynamically, significantly reducing VRAM usage compared to static padding.
OOM Prevention: Automatically sets PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to prevent fragmentation crashes.

�️ Technical Improvements & Bug Fixes

FIXED: Resolved TypeError: object of type 'generator' has no len() during GGUF tensor processing.
FIXED: Solved ValueError: model did not return a loss during finetuning by integrating DataCollatorForLanguageModeling.
FIXED: Resolved AttributeError when passing SmartConfig objects as overrides (preserved torch.dtype objects via asdict).
CHANGED: Disabled WandB logging by default to keep the console clean (enable via WANDB_DISABLED="false").

Installation:

pip install git+https://github.com/codewithdark-git/QuantLLM.git

Made with ❤️ by Dark Coder

⭐ Star us on GitHub • 💖 Sponsor

This discussion was created from the release QuantLLM v2.0 - Ultra-fast, Pure Python Quantization & Training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

QuantLLM v2.0 - Ultra-fast, Pure Python Quantization & Training #17

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

QuantLLM v2.0 - Ultra-fast, Pure Python Quantization & Training #17

Uh oh!

Uh oh!

codewithdark-git Dec 16, 2025 Maintainer

✨ Highlights

🚀 TurboModel: The Unified API

📦 Pure Python GGUF Export (No Binaries!)

🎨 Beautiful UI

⚡ Performance Optimizations

�️ Technical Improvements & Bug Fixes

Replies: 0 comments

codewithdark-git
Dec 16, 2025
Maintainer