Skip to content

QuantLLM v2.0 - Ultra-fast, Pure Python Quantization & Training

Latest

Choose a tag to compare

@codewithdark-git codewithdark-git released this 16 Dec 13:28

We are thrilled to announce the official release of QuantLLM v2.0! This major release brings a completely redesigned API, enhanced performance, and a beautiful, professional user experience.

✨ Highlights

🚀 TurboModel: The Unified API

We've unified model loading, quantization, finetuning, and export into a single meaningful class: TurboModel.

from quantllm import turbo

# 1. Load: Auto-detects memory & capabilities
#    (Automatically enables Flash Attention 2 & 4-bit loading)
model = turbo("meta-llama/Llama-3-8B")

# 2. Chat: Simple completion interface
print(model.generate("What is the future of AI?"))

# 3. Finetune: 1-line training with LORA
#    (Automatically handles DataCollators, Gradient Checkpointing)
model.finetune(my_dataset, epochs=3)

# 4. Export: Convert directly to GGUF
model.export("gguf", "llama3-finetuned.gguf")

📦 Pure Python GGUF Export (No Binaries!)

Forget compiling llama.cpp or dealing with complex C++ toolchains. QuantLLM v2.0 includes a native Python GGUF writer.

  • Works on Windows, Linux, and Mac natively.
  • Supports all major quantization types (Q4_K_M, Q8_0, Q5_K_M).
  • Zero external dependencies.

🎨 Beautiful UI

We've overhauled the logging system to provide clear, actionable feedback:

  • SmartConfig Panel: Displays exact model parameters (e.g., "7.24B") and memory compression stats ("14GB ➔ 4.5GB (Saved 68%)") before loading.
  • Themed Logging: A cohesive Orange theme (orange1) for all spinners, progress bars, and success messages.
  • Clean Output: Suppressed noise from Hugging Face/Datasets libraries.

⚡ Performance Optimizations

  • torch.compile Enabled: Automatically compiles training graphs for up to 2x faster training on modern GPUs.
  • Dynamic Padding: Batches are padded dynamically, significantly reducing VRAM usage compared to static padding.
  • OOM Prevention: Automatically sets PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to prevent fragmentation crashes.

�️ Technical Improvements & Bug Fixes

  • FIXED: Resolved TypeError: object of type 'generator' has no len() during GGUF tensor processing.
  • FIXED: Solved ValueError: model did not return a loss during finetuning by integrating DataCollatorForLanguageModeling.
  • FIXED: Resolved AttributeError when passing SmartConfig objects as overrides (preserved torch.dtype objects via asdict).
  • CHANGED: Disabled WandB logging by default to keep the console clean (enable via WANDB_DISABLED="false").

Installation:

pip install git+https://github.com/codewithdark-git/QuantLLM.git