Models
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving, designed for high-throughput and memory-efficient LLM serving.
Prerequisites
Install vLLM and start serving a model:
install vLLM
start vLLM server
This spins up the vLLM server with an OpenAI-compatible API.
The default vLLM server URL is http://localhost:8000/
Example
Basic Agent
Advanced Usage
With Tools
vLLM models work seamlessly with Agno tools:
with_tools.py
View more examples here.
For the full list of supported models, see the vLLM documentation.
Params
Parameter | Type | Default | Description |
---|---|---|---|
id | str | Required | The ID of the model to use (e.g. "Qwen/Qwen2.5-7B-Instruct" ). |
name | str | "vLLM" | Name of this model instance. |
provider | str | "vLLM" | Provider name. |
api_key | Optional[str] | "EMPTY" | API key (sent for OpenAI-compat compliance; usually not needed). |
base_url | str | "http://localhost:8000/v1/" | URL of the vLLM server (OpenAI-compatible endpoint). |
max_tokens | Optional[int] | None | The maximum number of tokens to generate. |
temperature | float | 0.7 | Sampling temperature. |
top_p | float | 0.8 | Nucleus sampling probability. |
top_k | Optional[int] | None | Restrict sampling to the top-K tokens. |
frequency_penalty | Optional[float] | None | Penalizes new tokens based on their frequency in the text so far. |
presence_penalty | float | 1.5 | Repetition penalty. |
stop | Optional[Union[str, List[str]]] | None | Up to 4 sequences where the API will stop generating further tokens. |
seed | Optional[int] | None | A seed for deterministic sampling. |
request_params | Optional[Dict[str, Any]] | None | Extra keyword args merged into the request. |
client_params | Optional[Dict[str, Any]] | None | Additional parameters to pass to the client. |
timeout | Optional[float] | None | Timeout for the HTTP request. |
max_retries | Optional[int] | None | Maximum number of request retries. |
enable_thinking | Optional[bool] | None | Enables vLLM "thinking" mode (passes enable_thinking in chat_template_kwargs ). |
vLLM
also supports the params of OpenAI.
vLLM
is a subclass of the Model class and has access to the same params.