serving-llms-vllm

Community

High-throughput LLM serving with vLLM.

Authorovachiever
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill introduces vLLM-based serving for LLMs, delivering scalable, low-latency inference with features like PagedAttention, continuous batching, and OpenAI-compatible endpoints, including quantization and tensor parallelism.

Core Features & Use Cases

  • Production-grade serving: Deploy OpenAI-compatible LLM APIs with high throughput.
  • Memory-efficient inference: Leverage PagedAttention and quantization to fit large models on available GPUs.
  • Operational excellence: Tune latency, throughput, and monitoring in production, including metrics and health checks.

Quick Start

Start a local server for a 7B model with 1 GPU and test a completion request.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: serving-llms-vllm
Download link: https://github.com/ovachiever/droid-tings/archive/main.zip#serving-llms-vllm

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository