model-pruning

Community

Compress LLMs, accelerate inference, save costs.

AuthorzechenzhangAGI
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill tackles the challenges of deploying large language models, which are often slow and memory-intensive. It enables you to significantly reduce model size and accelerate inference, making LLMs more practical for edge devices and cost-effective serving.

Core Features & Use Cases

  • Model Compression: Reduce LLM size by 40-60% with minimal accuracy loss (typically <1%), making models lighter and easier to store.
  • Inference Acceleration: Achieve 2-4× speedup in inference on hardware accelerators by leveraging structured sparsity patterns like N:M pruning.
  • One-Shot Pruning: Compress models without the need for extensive and costly retraining, using efficient methods like Wanda and SparseGPT.
  • Efficient Deployment: Enable deployment on resource-constrained hardware (e.g., mobile, edge devices) and reduce the memory footprint for serving.
  • Use Case: Deploy a Llama-2-7b model on a mobile device or a low-cost GPU server, achieving faster response times and lower operational costs without sacrificing much performance.

Quick Start

Apply Wanda pruning to a Llama-2-7b-hf model to achieve 50% sparsity using a small calibration dataset, without any retraining.

Dependency Matrix

Required Modules

torchtransformersaccelerate

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: model-pruning
Download link: https://github.com/zechenzhangAGI/AI-research-SKILLs/archive/main.zip#model-pruning

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository