hqq-quantization

Community

Quantize LLMs fast, no calibration needed.

AuthorDoanNgocCuong
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill enables rapid and efficient quantization of Large Language Models (LLMs) to lower bit precisions (e.g., 4-bit, 3-bit, 2-bit) without requiring calibration datasets, significantly reducing memory footprint and speeding up inference.

Core Features & Use Cases

  • Calibration-Free Quantization: Quantize models instantly without needing a representative dataset.
  • Flexible Precision: Supports 8-bit, 4-bit, 3-bit, and 2-bit quantization with configurable group sizes.
  • Optimized Backends: Integrates with various high-performance backends like Marlin, TorchAO, and BitBlas for accelerated inference.
  • Framework Integration: Seamlessly works with HuggingFace Transformers and vLLM for easy deployment.
  • PEFT Compatible: Enables fine-tuning of quantized models using LoRA and other Parameter-Efficient Fine-Tuning techniques.
  • Use Case: You have a large LLM that consumes too much VRAM. Use this Skill to quantize it to 4-bit precision, allowing it to run on hardware with less memory while maintaining good performance.

Quick Start

Use the hqq-quantization skill to quantize the 'meta-llama/Llama-3.1-8B' model to 4-bit precision.

Dependency Matrix

Required Modules

hqqtorch

Components

scriptsreferencesassets

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: hqq-quantization
Download link: https://github.com/DoanNgocCuong/continuous-training-pipeline_T3_2026/archive/main.zip#hqq-quantization

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.