benchmark-kernel
CommunityBenchmark FlashInfer kernels accurately.
Authorariusewy
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill provides a structured and accurate method for benchmarking the performance of FlashInfer's GPU kernels, enabling direct comparison between different backends and configurations.
Core Features & Use Cases
- Accurate Timing: Utilizes CUPTI for precise GPU kernel execution times, falling back to CUDA events when CUPTI is unavailable.
- Backend Comparison: Allows benchmarking across multiple backends like FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.
- Reproducible Results: Supports generating reproducible benchmark commands and saving results to CSV for analysis.
- Use Case: A performance engineer needs to determine the fastest attention kernel for their specific hardware and workload. They can use this Skill to run benchmarks for various attention routines with different batch sizes and sequence lengths, comparing the median execution time and achieved TFLOPS across multiple backends.
Quick Start
Run the benchmark for the BatchDecodeWithPagedKVCacheWrapper routine using FlashAttention 2 and cuDNN backends.
Dependency Matrix
Required Modules
cupti-python
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: benchmark-kernel Download link: https://github.com/ariusewy/flashinfer_dev/archive/main.zip#benchmark-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.