model-pruning
CommunityCompress LLMs, accelerate inference, save costs.
Software Engineering#model compression#SparseGPT#LLM pruning#Wanda#model optimization#N:M sparsity#inference acceleration
AuthorzechenzhangAGI
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill tackles the challenges of deploying large language models, which are often slow and memory-intensive. It enables you to significantly reduce model size and accelerate inference, making LLMs more practical for edge devices and cost-effective serving.
Core Features & Use Cases
- Model Compression: Reduce LLM size by 40-60% with minimal accuracy loss (typically <1%), making models lighter and easier to store.
- Inference Acceleration: Achieve 2-4× speedup in inference on hardware accelerators by leveraging structured sparsity patterns like N:M pruning.
- One-Shot Pruning: Compress models without the need for extensive and costly retraining, using efficient methods like Wanda and SparseGPT.
- Efficient Deployment: Enable deployment on resource-constrained hardware (e.g., mobile, edge devices) and reduce the memory footprint for serving.
- Use Case: Deploy a Llama-2-7b model on a mobile device or a low-cost GPU server, achieving faster response times and lower operational costs without sacrificing much performance.
Quick Start
Apply Wanda pruning to a Llama-2-7b-hf model to achieve 50% sparsity using a small calibration dataset, without any retraining.
Dependency Matrix
Required Modules
torchtransformersaccelerate
Components
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: model-pruning Download link: https://github.com/zechenzhangAGI/AI-research-SKILLs/archive/main.zip#model-pruning Please download this .zip file, extract it, and install it in the .claude/skills/ directory.