vlm-run/autovllm

Optimized vLLM for production VLM inference.

A patched vLLM image that accelerates vision-language serving end-to-end, from JPEG and PNG decode to fused Triton kernels. Up to 225x faster on cached image batches versus vanilla vLLM.

Every bottleneck in VLM serving, addressed.

autovllm patches both the CPU image pipeline and the GPU Triton kernels of vllm/vllm-openai. The same vLLM server you already know, now built for real multi-modal traffic.

TurboJPEG and Rust PNG

SIMD-accelerated JPEG decode and a Rust/PyO3 PNG decoder replace PIL. ~2x on 4K JPEG and ~4.6x on 1K PNG.

LRU and base64 caching

A 512-entry decode cache with O(1) fingerprint lookup, plus a base64-level short-circuit that skips b64decode on cache hits.

Fused Triton kernels

Eight fused kernels: RMSNorm, residual+norm, SiLU MLP, QK-norm+RoPE, LM head+top-k, DeltaNet recurrent, and more.

Liger Kernel built in

Optimized RMSNorm, SwiGLU, and RoPE from Liger with backward-pass support. Drop in without any config changes.

Pre-IPC image cap

Resolution capped to 4 MP before inter-process transfer. No more shipping full-res pixel buffers across the wire.

Drop-in Docker image

Ships as vlmrun/vlmrun-vllm-openai:v0.16.0-* on top of official vLLM. Swap one image, keep everything else.

12345678910111213141516
# 1. Build the patched image
cd vllm/
make build
# → vlmrun/vlmrun-vllm-openai:v0.16.0-<PATCH_VERSION>

# 2. Run the server (drop-in replacement for vllm/vllm-openai)
docker run --gpus all --rm -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vlmrun/vlmrun-vllm-openai:latest \
    --model Qwen/Qwen3-VL-8B-Instruct

# 3. Call it like any OpenAI endpoint
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"Qwen/Qwen3-VL-8B-Instruct",
         "messages":[{"role":"user","content":"hi"}]}'

Quick Start

Swap one Docker image. Everything else stays the same.

autovllm ships as a patched Docker image on top of vllm/vllm-openai. Your OpenAI-compatible API surface, schedulers, and configs are unchanged. You just go faster.

  • Up to 225x on cached image batches vs vanilla vLLM
  • 2.24x on cold 4K JPEG decode, 4.6x on 1K PNG decode
  • Zero changes to your OpenAI-compatible client
  • Apache 2.0: audit, patch, and self-host

Ship faster VLM inference in production.