vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:

install vLLM
pip install vllm
start vLLM server
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

This spins up the vLLM server with an OpenAI-compatible API.

The default vLLM server URL is http://localhost:8000/

Example

Basic Agent

from agno.agent import Agent
from agno.models.vllm import vLLM

agent = Agent(
    model=vLLM(
        id="meta-llama/Llama-3.1-8B-Instruct", 
        base_url="http://localhost:8000/",
    ),
    markdown=True
)

agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Agno tools:

with_tools.py
from agno.agent import Agent
from agno.models.vllm import vLLM
from agno.tools.duckduckgo import DuckDuckGoTools

agent = Agent(
    model=vLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
    tools=[DuckDuckGoTools()],
    show_tool_calls=True,
    markdown=True
)

agent.print_response("What's the latest news about AI?")
View more examples here.

For the full list of supported models, see the vLLM documentation.

Params

ParameterTypeDefaultDescription
idstrRequiredThe ID of the model to use (e.g. "Qwen/Qwen2.5-7B-Instruct").
namestr"vLLM"Name of this model instance.
providerstr"vLLM"Provider name.
api_keyOptional[str]"EMPTY"API key (sent for OpenAI-compat compliance; usually not needed).
base_urlstr"http://localhost:8000/v1/"URL of the vLLM server (OpenAI-compatible endpoint).
max_tokensOptional[int]NoneThe maximum number of tokens to generate.
temperaturefloat0.7Sampling temperature.
top_pfloat0.8Nucleus sampling probability.
top_kOptional[int]NoneRestrict sampling to the top-K tokens.
frequency_penaltyOptional[float]NonePenalizes new tokens based on their frequency in the text so far.
presence_penaltyfloat1.5Repetition penalty.
stopOptional[Union[str, List[str]]]NoneUp to 4 sequences where the API will stop generating further tokens.
seedOptional[int]NoneA seed for deterministic sampling.
request_paramsOptional[Dict[str, Any]]NoneExtra keyword args merged into the request.
client_paramsOptional[Dict[str, Any]]NoneAdditional parameters to pass to the client.
timeoutOptional[float]NoneTimeout for the HTTP request.
max_retriesOptional[int]NoneMaximum number of request retries.
enable_thinkingOptional[bool]NoneEnables vLLM "thinking" mode (passes enable_thinking in chat_template_kwargs).

vLLM also supports the params of OpenAI.

vLLM is a subclass of the Model class and has access to the same params.