benchmark-kernel

Community

Benchmark FlashInfer kernels accurately.

Authorariusewy
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides a structured and accurate method for benchmarking the performance of FlashInfer's GPU kernels, enabling direct comparison between different backends and configurations.

Core Features & Use Cases

  • Accurate Timing: Utilizes CUPTI for precise GPU kernel execution times, falling back to CUDA events when CUPTI is unavailable.
  • Backend Comparison: Allows benchmarking across multiple backends like FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.
  • Reproducible Results: Supports generating reproducible benchmark commands and saving results to CSV for analysis.
  • Use Case: A performance engineer needs to determine the fastest attention kernel for their specific hardware and workload. They can use this Skill to run benchmarks for various attention routines with different batch sizes and sequence lengths, comparing the median execution time and achieved TFLOPS across multiple backends.

Quick Start

Run the benchmark for the BatchDecodeWithPagedKVCacheWrapper routine using FlashAttention 2 and cuDNN backends.

Dependency Matrix

Required Modules

cupti-python

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: benchmark-kernel
Download link: https://github.com/ariusewy/flashinfer_dev/archive/main.zip#benchmark-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.