optimize-pa-decode-gluon

Community

Boost paged attention decode performance.

Authorfsx950223
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses performance bottlenecks in the paged attention decode implementation for AMD GPUs, aiming to significantly speed up AI model inference.

Core Features & Use Cases

  • Kernel Optimization: Analyzes and optimizes Triton/Gluon kernels for memory access, compute efficiency, and instruction selection.
  • API-Level Tuning: Identifies and rectifies inefficiencies in the Python API wrapper for kernel dispatch and tensor management.
  • Use Case: When running large language models on AMD MI300X or MI350 hardware, this Skill can be used to fine-tune the paged attention kernels, leading to faster response times and higher throughput.

Quick Start

Use the optimize-pa-decode-gluon skill to analyze and optimize the paged attention decode implementation in aiter/ops/triton/gluon/pa_decode_gluon.py.

Dependency Matrix

Required Modules

None required

Components

referencesscripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: optimize-pa-decode-gluon
Download link: https://github.com/fsx950223/claude-stuff/archive/main.zip#optimize-pa-decode-gluon

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.