serving-llms-vllm
CommunityHigh-throughput LLM serving with vLLM.
Authorovachiever
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill introduces vLLM-based serving for LLMs, delivering scalable, low-latency inference with features like PagedAttention, continuous batching, and OpenAI-compatible endpoints, including quantization and tensor parallelism.
Core Features & Use Cases
- Production-grade serving: Deploy OpenAI-compatible LLM APIs with high throughput.
- Memory-efficient inference: Leverage PagedAttention and quantization to fit large models on available GPUs.
- Operational excellence: Tune latency, throughput, and monitoring in production, including metrics and health checks.
Quick Start
Start a local server for a 7B model with 1 GPU and test a completion request.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: serving-llms-vllm Download link: https://github.com/ovachiever/droid-tings/archive/main.zip#serving-llms-vllm Please download this .zip file, extract it, and install it in the .claude/skills/ directory.