Name: llm-inference-scaling
Availability: InStock
Author: BagelHole

System Documentation

What problem does it solve?

This Skill addresses the challenge of efficiently scaling Large Language Model (LLM) inference workloads on Kubernetes, ensuring high availability and cost-effectiveness.

Core Features & Use Cases

Automated Scaling: Dynamically adjusts the number of inference pods based on real-time traffic and resource utilization using KEDA and Prometheus metrics.
GPU-Aware Autoscaling: Leverages custom GPU metrics to scale inference clusters effectively.
Cost Optimization: Integrates strategies for using spot instances to reduce GPU compute costs.
Use Case: A rapidly growing AI startup experiences unpredictable spikes in user requests to their LLM API. This Skill ensures their Kubernetes cluster automatically scales the vLLM inference pods up during peak hours and down during lulls, preventing service degradation and optimizing cloud spend.

Quick Start

Configure KEDA to scale the 'vllm-llama-8b' deployment based on waiting requests and GPU cache usage.

Please help me install this Skill: Name: llm-inference-scaling Download link: https://github.com/BagelHole/DevOps-Security-Agent-Skills/archive/main.zip#llm-inference-scaling Please download this .zip file, extract it, and install it in the .claude/skills/ directory.

llm-inference-scaling

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper