vllm-omni-distributed

Community

Scale distributed inference across GPUs.

Authorhsliuustc0106
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Distributes large inference workloads across multiple GPUs or machines, enabling scalable deployments and efficient resource utilization.

Core Features & Use Cases

  • Tensor Parallelism (TP): Split model weights across GPUs to reduce latency and improve throughput.
  • Pipeline Parallelism (PP): Divide the model across sequential GPU groups to boost overall throughput.
  • Disaggregation / OmniConnector: Run Encode, Prefill, Decode, and Generate stages on separate GPU pools for independent scaling.
  • Multi-node with Ray: Orchestrate distributed serving across a Ray cluster for larger deployments.
  • Sequence Parallelism: Accelerate diffusion-based generation by splitting steps across GPUs.

Quick Start

Launch a multi-node Ray cluster and start the vLLM-Omni server with your model and desired tensor- and pipeline-parallel settings.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: vllm-omni-distributed
Download link: https://github.com/hsliuustc0106/vllm-omni-skills/archive/main.zip#vllm-omni-distributed

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.