gptq

Community

Compress LLMs for consumer GPUs.

AuthorDoanNgocCuong
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill enables the deployment of large language models (LLMs) on hardware with limited memory, such as consumer GPUs, by significantly reducing their size.

Core Features & Use Cases

  • 4-Bit Quantization: Compresses LLMs to 4-bit precision using the GPTQ algorithm, drastically cutting memory requirements.
  • Minimal Accuracy Loss: Achieves this compression with less than 2% degradation in model performance (perplexity).
  • Faster Inference: Provides a 3-4x speedup in inference compared to standard FP16 models.
  • Use Case: Deploying a 70B parameter LLM on a single RTX 4090 GPU (24GB VRAM) for real-time chat applications, which would be impossible with FP16 precision.

Quick Start

Use the gptq skill to load the quantized model 'TheBloke/Llama-2-7B-Chat-GPTQ' onto your CUDA device.

Dependency Matrix

Required Modules

auto-gptqtransformersoptimumpeft

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: gptq
Download link: https://github.com/DoanNgocCuong/continuous-training-pipeline_T3_2026/archive/main.zip#gptq

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.