llm-inference-batching-scheduler

Community

Optimize LLM inference batching.

AuthorZurybr
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the complex challenge of optimizing batch schedulers for LLM inference on compilation-based accelerators, aiming to minimize costs while adhering to strict latency requirements.

Core Features & Use Cases

  • Cost Optimization: Reduces compilation costs by minimizing unique shapes and padding overhead.
  • Latency Management: Balances batching strategies to meet P95 and P99 latency thresholds.
  • Use Case: When deploying LLMs on TPUs, this skill helps design a scheduler that efficiently groups incoming requests to reduce expensive shape compilations and minimize wasted computation due to padding, ensuring fast response times.

Quick Start

Analyze the request distribution and cost model to derive optimal generation bucket sizes and shape configurations for LLM inference batching.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-inference-batching-scheduler
Download link: https://github.com/Zurybr/lefarma-skills/archive/main.zip#llm-inference-batching-scheduler

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.