optimize-pa-decode-gluon
CommunityBoost paged attention decode performance.
Authorfsx950223
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses performance bottlenecks in the paged attention decode implementation for AMD GPUs, aiming to significantly speed up AI model inference.
Core Features & Use Cases
- Kernel Optimization: Analyzes and optimizes Triton/Gluon kernels for memory access, compute efficiency, and instruction selection.
- API-Level Tuning: Identifies and rectifies inefficiencies in the Python API wrapper for kernel dispatch and tensor management.
- Use Case: When running large language models on AMD MI300X or MI350 hardware, this Skill can be used to fine-tune the paged attention kernels, leading to faster response times and higher throughput.
Quick Start
Use the optimize-pa-decode-gluon skill to analyze and optimize the paged attention decode implementation in aiter/ops/triton/gluon/pa_decode_gluon.py.
Dependency Matrix
Required Modules
None requiredComponents
referencesscripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: optimize-pa-decode-gluon Download link: https://github.com/fsx950223/claude-stuff/archive/main.zip#optimize-pa-decode-gluon Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.