llm-inference-scaling

Community

Scale LLM inference on Kubernetes

AuthorBagelHole
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the challenge of efficiently scaling Large Language Model (LLM) inference workloads on Kubernetes, ensuring high availability and cost-effectiveness.

Core Features & Use Cases

  • Automated Scaling: Dynamically adjusts the number of inference pods based on real-time traffic and resource utilization using KEDA and Prometheus metrics.
  • GPU-Aware Autoscaling: Leverages custom GPU metrics to scale inference clusters effectively.
  • Cost Optimization: Integrates strategies for using spot instances to reduce GPU compute costs.
  • Use Case: A rapidly growing AI startup experiences unpredictable spikes in user requests to their LLM API. This Skill ensures their Kubernetes cluster automatically scales the vLLM inference pods up during peak hours and down during lulls, preventing service degradation and optimizing cloud spend.

Quick Start

Configure KEDA to scale the 'vllm-llama-8b' deployment based on waiting requests and GPU cache usage.

Dependency Matrix

Required Modules

None required

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-inference-scaling
Download link: https://github.com/BagelHole/DevOps-Security-Agent-Skills/archive/main.zip#llm-inference-scaling

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.