llm-inference-batching-scheduler
CommunityOptimize LLM inference batching.
AuthorZurybr
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the complex challenge of optimizing batch schedulers for LLM inference on compilation-based accelerators, aiming to minimize costs while adhering to strict latency requirements.
Core Features & Use Cases
- Cost Optimization: Reduces compilation costs by minimizing unique shapes and padding overhead.
- Latency Management: Balances batching strategies to meet P95 and P99 latency thresholds.
- Use Case: When deploying LLMs on TPUs, this skill helps design a scheduler that efficiently groups incoming requests to reduce expensive shape compilations and minimize wasted computation due to padding, ensuring fast response times.
Quick Start
Analyze the request distribution and cost model to derive optimal generation bucket sizes and shape configurations for LLM inference batching.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-inference-batching-scheduler Download link: https://github.com/Zurybr/lefarma-skills/archive/main.zip#llm-inference-batching-scheduler Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.