Last modified: Jun 11, 2026

Install vLLM in Python Guide

vLLM is a fast and easy-to-use library for LLM inference and serving. It supports popular models and runs efficiently on GPUs. This guide shows you how to install vLLM in Python quickly and correctly.

We will cover system requirements, installation steps, common issues, and a simple test. By the end, you will have vLLM ready to use.

Prerequisites for vLLM Installation

Before installing vLLM, check your system. vLLM requires a Linux or macOS environment with Python 3.8 or later. Windows users may need WSL2 or Docker.

You also need a CUDA-compatible GPU with at least 8GB VRAM. vLLM uses GPU acceleration for fast inference. If you don't have a GPU, you can still install but performance will be poor.

Make sure you have pip updated. Run this command:


pip install --upgrade pip

Step 1: Install vLLM via pip

The easiest way to install vLLM is using pip. Open your terminal and run:


pip install vllm

This installs the latest stable version. It will also install dependencies like PyTorch, transformers, and CUDA libraries. The process may take a few minutes.

If you want a specific version, use pip install vllm==0.4.0. Check the official vLLM release page for version numbers.

Step 2: Verify Installation

After installation, verify that vLLM is working. Open a Python shell or create a script. Run this code:


import vllm
print(vllm.__version__)

If you see a version number like "0.4.0", the installation succeeded. If you get an import error, something went wrong.

You can also check GPU support. Use the torch.cuda.is_available() function to confirm CUDA is available:


import torch
print(torch.cuda.is_available())  # Should print True

Step 3: Install with CUDA Support (If Needed)

If you have a GPU but vLLM doesn't detect it, you may need to install PyTorch with CUDA first. Run this command:


pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then reinstall vLLM. This ensures the CUDA dependencies are correct. For newer CUDA versions, adjust the index URL (e.g., cu121 for CUDA 12.1).

Step 4: Test vLLM with a Simple Model

Now, test vLLM with a small model. Use the LLM class from vLLM. Here is an example:


from vllm import LLM, SamplingParams

# Load a small model (e.g., "facebook/opt-125m")
model = LLM(model="facebook/opt-125m")

# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Generate text
prompts = ["What is the capital of France?"]
outputs = model.generate(prompts, sampling_params)

# Print output
for output in outputs:
    print(output.outputs[0].text)

This code loads the OPT-125M model, generates a response, and prints it. If you see text like "The capital of France is Paris", everything works.

Note that the first run downloads the model weights. This may take a while depending on your internet speed.

Step 5: Install vLLM for CPU (Optional)

If you don't have a GPU, you can still install vLLM but it will use CPU. This is slower but useful for testing. Install with:


pip install vllm[cpu]

Then test with the same code above. Expect much slower inference.

Common Installation Issues and Fixes

Issue 1: "CUDA out of memory" error. This happens when the model is too large for your GPU. Use a smaller model or reduce batch size.

Issue 2: "No module named 'vllm'". This means pip didn't install correctly. Try reinstalling with pip install --no-cache-dir vllm.

Issue 3: "ImportError: libcudart.so.11.0: cannot open shared object file". This indicates missing CUDA runtime. Install CUDA toolkit from NVIDIA or use the PyTorch CUDA installation above.

Issue 4: Windows compatibility. vLLM does not natively support Windows. Use WSL2 (Windows Subsystem for Linux) or Docker. Install Ubuntu in WSL2, then follow the Linux steps.

Best Practices for vLLM Installation

Always install in a virtual environment. This prevents conflicts with other packages. Use python -m venv vllm_env and activate it before installing.

Keep your pip and Python updated. Old versions may cause errors. Also, ensure your GPU drivers are up to date.

For production, consider using Docker. The official vLLM Docker image includes all dependencies. Pull it with docker pull vllm/vllm-openai.

Conclusion

Installing vLLM in Python is straightforward. You just need pip, a compatible system, and a GPU for best performance. Follow the steps above and test with the example code.

vLLM makes LLM inference fast and efficient. Once installed, you can run large models like Llama, Mistral, or GPT-like models with ease. Start with a small model to verify your setup, then scale up.

If you encounter issues, check the official vLLM documentation or community forums. Happy coding!