gptq
CommunityCompress LLMs for consumer GPUs.
AuthorDoanNgocCuong
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill enables the deployment of large language models (LLMs) on hardware with limited memory, such as consumer GPUs, by significantly reducing their size.
Core Features & Use Cases
- 4-Bit Quantization: Compresses LLMs to 4-bit precision using the GPTQ algorithm, drastically cutting memory requirements.
- Minimal Accuracy Loss: Achieves this compression with less than 2% degradation in model performance (perplexity).
- Faster Inference: Provides a 3-4x speedup in inference compared to standard FP16 models.
- Use Case: Deploying a 70B parameter LLM on a single RTX 4090 GPU (24GB VRAM) for real-time chat applications, which would be impossible with FP16 precision.
Quick Start
Use the gptq skill to load the quantized model 'TheBloke/Llama-2-7B-Chat-GPTQ' onto your CUDA device.
Dependency Matrix
Required Modules
auto-gptqtransformersoptimumpeft
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: gptq Download link: https://github.com/DoanNgocCuong/continuous-training-pipeline_T3_2026/archive/main.zip#gptq Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.