llm-inference-scaling
CommunityScale LLM inference on Kubernetes
AuthorBagelHole
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the challenge of efficiently scaling Large Language Model (LLM) inference workloads on Kubernetes, ensuring high availability and cost-effectiveness.
Core Features & Use Cases
- Automated Scaling: Dynamically adjusts the number of inference pods based on real-time traffic and resource utilization using KEDA and Prometheus metrics.
- GPU-Aware Autoscaling: Leverages custom GPU metrics to scale inference clusters effectively.
- Cost Optimization: Integrates strategies for using spot instances to reduce GPU compute costs.
- Use Case: A rapidly growing AI startup experiences unpredictable spikes in user requests to their LLM API. This Skill ensures their Kubernetes cluster automatically scales the vLLM inference pods up during peak hours and down during lulls, preventing service degradation and optimizing cloud spend.
Quick Start
Configure KEDA to scale the 'vllm-llama-8b' deployment based on waiting requests and GPU cache usage.
Dependency Matrix
Required Modules
None requiredComponents
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-inference-scaling Download link: https://github.com/BagelHole/DevOps-Security-Agent-Skills/archive/main.zip#llm-inference-scaling Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.