prefetch-data-load

Community

Overlap GPU compute with data loads.

Authorfsx950223
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill optimizes GPU kernel performance by overlapping data loading latency with computation, significantly reducing execution time for memory-bound loops.

Core Features & Use Cases

  • Software Prefetching: Implements double-buffering to preload data for the next iteration while the current one is computing.
  • Latency Hiding: Effectively hides global memory load latency behind compute instructions in Triton/Gluon kernels.
  • Use Case: Accelerate deep learning inference by optimizing the data loading pipeline for matrix multiplication kernels (MFMA) in attention mechanisms or other compute-intensive operations.

Quick Start

Apply prefetch optimization to the provided Triton/Gluon kernel loop.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: prefetch-data-load
Download link: https://github.com/fsx950223/claude-stuff/archive/main.zip#prefetch-data-load

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.