vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.

Prerequisites

Install vLLM and start serving a model:

install vLLM

pip install vllm

start vLLM server

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --dtype float16 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

This spins up the vLLM server with an OpenAI-compatible API.

The default vLLM server URL is http://localhost:8000/

Example

Basic Agent

from agno.agent import Agent
from agno.models.vllm import vLLM

agent = Agent(
    model=vLLM(
        id="meta-llama/Llama-3.1-8B-Instruct", 
        base_url="http://localhost:8000/",
    ),
    markdown=True
)

agent.print_response("Share a 2 sentence horror story.")

Advanced Usage

With Tools

vLLM models work seamlessly with Agno tools:

with_tools.py

from agno.agent import Agent
from agno.models.vllm import vLLM
from agno.tools.duckduckgo import DuckDuckGoTools

agent = Agent(
    model=vLLM(id="meta-llama/Llama-3.1-8B-Instruct"),
    tools=[DuckDuckGoTools()],
    show_tool_calls=True,
    markdown=True
)

agent.print_response("What's the latest news about AI?")

View more examples here.

For the full list of supported models, see the vLLM documentation.

Params

Parameter	Type	Default	Description
`id`	`str`	Required	The ID of the model to use (e.g. `"Qwen/Qwen2.5-7B-Instruct"`).
`name`	`str`	`"vLLM"`	Name of this model instance.
`provider`	`str`	`"vLLM"`	Provider name.
`api_key`	`Optional[str]`	`"EMPTY"`	API key (sent for OpenAI-compat compliance; usually not needed).
`base_url`	`str`	`"http://localhost:8000/v1/"`	URL of the vLLM server (OpenAI-compatible endpoint).
`max_tokens`	`Optional[int]`	`None`	The maximum number of tokens to generate.
`temperature`	`float`	`0.7`	Sampling temperature.
`top_p`	`float`	`0.8`	Nucleus sampling probability.
`top_k`	`Optional[int]`	`None`	Restrict sampling to the top-K tokens.
`frequency_penalty`	`Optional[float]`	`None`	Penalizes new tokens based on their frequency in the text so far.
`presence_penalty`	`float`	`1.5`	Repetition penalty.
`stop`	`Optional[Union[str, List[str]]]`	`None`	Up to 4 sequences where the API will stop generating further tokens.
`seed`	`Optional[int]`	`None`	A seed for deterministic sampling.
`request_params`	`Optional[Dict[str, Any]]`	`None`	Extra keyword args merged into the request.
`client_params`	`Optional[Dict[str, Any]]`	`None`	Additional parameters to pass to the client.
`timeout`	`Optional[float]`	`None`	Timeout for the HTTP request.
`max_retries`	`Optional[int]`	`None`	Maximum number of request retries.
`enable_thinking`	`Optional[bool]`	`None`	Enables vLLM "thinking" mode (passes `enable_thinking` in `chat_template_kwargs`).

vLLM also supports the params of OpenAI.

vLLM is a subclass of the Model class and has access to the same params.

Introduction

Concepts

Other

How to

Prerequisites

Example

Advanced Usage

With Tools

Params

Introduction

Concepts

Other

How to

​Prerequisites

​Example

​Advanced Usage

​With Tools

​Params

Prerequisites

Example

Advanced Usage

With Tools

Params