ggml-org/gpt-oss-20b-GGUF
used in the examples below.
We recommend experimenting to find the best model for your use case. Here are some popular model recommendations:
Google Gemma Models
google/gemma-2b-it-GGUF
- Lightweight 2B parameter model, great for resource-constrained environmentsgoogle/gemma-7b-it-GGUF
- Balanced 7B model with strong performance for general tasksggml-org/gemma-3-1b-it-GGUF
- Latest Gemma 3 series, efficient for everyday use
Meta Llama Models
Meta-Llama-3-8B-Instruct
- Popular 8B parameter model with excellent instruction followingMeta-Llama-3.1-8B-Instruct
- Enhanced version with improved capabilities and 128K contextMeta-Llama-3.2-3B-Instruct
- Compact 3B model for faster inference
Default Options
ggml-org/gpt-oss-20b-GGUF
- Default model for general use cases- Models with different quantizations (Q4_K_M, Q8_0, etc.) for different speed/quality tradeoffs
- Choose models based on your hardware constraints and performance requirements
Set up LlamaCpp
Install LlamaCpp
First, install LlamaCpp following the official installation guide:install
brew install
Download a Model
Download a model in GGUF format following the llama.cpp model download guide. For the examples below, we useggml-org/gpt-oss-20b-GGUF
.
Start the Server
Start the LlamaCpp server with your model:start server
http://127.0.0.1:8080
with an OpenAI Chat compatible endpoints
Example
After starting the LlamaCpp server, use theLlamaCpp
model class to access it:
Configuration
TheLlamaCpp
model supports customizing the server URL and model ID:
View more examples here.
Params
Name | Type | Default | Description |
---|---|---|---|
id | str | "ggml-org/gpt-oss-20b-GGUF" | The model identifier |
name | str | "LlamaCpp" | The name of this chat model instance |
provider | str | "LlamaCpp" | The provider of the model |
base_url | str | "http://127.0.0.1:8080/v1" | The base URL for the LlamaCpp server |
LlamaCpp
is a subclass of the OpenAILike class and has access to the same params.
Server Configuration
The LlamaCpp server supports many configuration options:Common Server Options
--ctx-size
: Context size (0 for unlimited)--batch-size
,-b
: Batch size for prompt processing--ubatch-size
,-ub
: Physical batch size for prompt processing--threads
,-t
: Number of threads to use--host
: IP address to listen on (default: 127.0.0.1)--port
: Port to listen on (default: 8080)
Model Options
--model
,-m
: Model file path--hf-repo
: HuggingFace model repository--jinja
: Use Jinja templating for chat formatting
llama-server --help
.
Performance Optimization
Hardware Acceleration
LlamaCpp supports various acceleration backends:gpu acceleration
Model Quantization
Use quantized models for better performance:Q4_K_M
: Balanced size and qualityQ8_0
: Higher quality, larger sizeQ2_K
: Smallest size, lower quality
Troubleshooting
Server Connection Issues
Ensure the LlamaCpp server is running and accessible:check server
Model Loading Problems
- Verify the model file exists and is in GGML format
- Check available memory for large models
- Ensure the model is compatible with your LlamaCpp version
Performance Issues
- Adjust batch sizes (
-b
,-ub
) based on your hardware - Use GPU acceleration if available
- Consider using quantized models for faster inference