Name: optimize-pa-decode-gluon
Availability: InStock
Author: fsx950223

System Documentation

What problem does it solve?

This Skill addresses performance bottlenecks in the paged attention decode implementation for AMD GPUs, aiming to significantly speed up AI model inference.

Core Features & Use Cases

Kernel Optimization: Analyzes and optimizes Triton/Gluon kernels for memory access, compute efficiency, and instruction selection.
API-Level Tuning: Identifies and rectifies inefficiencies in the Python API wrapper for kernel dispatch and tensor management.
Use Case: When running large language models on AMD MI300X or MI350 hardware, this Skill can be used to fine-tune the paged attention kernels, leading to faster response times and higher throughput.

Quick Start

Use the optimize-pa-decode-gluon skill to analyze and optimize the paged attention decode implementation in aiter/ops/triton/gluon/pa_decode_gluon.py.

Please help me install this Skill: Name: optimize-pa-decode-gluon Download link: https://github.com/fsx950223/claude-stuff/archive/main.zip#optimize-pa-decode-gluon Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

optimize-pa-decode-gluon

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper