Run Large Language Models locally with LLaMA CPP LlamaCpp is a powerful tool for running large language models locally with efficient inference. LlamaCpp supports multiple open-source models and provides an OpenAI-compatible API server. LlamaCpp supports a wide variety of models in GGML format. You can find models on HuggingFace, including the default ggml-org/gpt-oss-20b-GGUF used in the examples below. We recommend experimenting to find the best model for your use case. Here are some popular model recommendations:

Google Gemma Models

  • google/gemma-2b-it-GGUF - Lightweight 2B parameter model, great for resource-constrained environments
  • google/gemma-7b-it-GGUF - Balanced 7B model with strong performance for general tasks
  • ggml-org/gemma-3-1b-it-GGUF - Latest Gemma 3 series, efficient for everyday use

Meta Llama Models

  • Meta-Llama-3-8B-Instruct - Popular 8B parameter model with excellent instruction following
  • Meta-Llama-3.1-8B-Instruct - Enhanced version with improved capabilities and 128K context
  • Meta-Llama-3.2-3B-Instruct - Compact 3B model for faster inference

Default Options

  • ggml-org/gpt-oss-20b-GGUF - Default model for general use cases
  • Models with different quantizations (Q4_K_M, Q8_0, etc.) for different speed/quality tradeoffs
  • Choose models based on your hardware constraints and performance requirements

Set up LlamaCpp

Install LlamaCpp

First, install LlamaCpp following the official installation guide:
install
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Or using package managers:
brew install
# macOS with Homebrew
brew install llama.cpp

Download a Model

Download a model in GGUF format following the llama.cpp model download guide. For the examples below, we use ggml-org/gpt-oss-20b-GGUF.

Start the Server

Start the LlamaCpp server with your model:
start server
llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048
This starts the server at http://127.0.0.1:8080 with an OpenAI Chat compatible endpoints

Example

After starting the LlamaCpp server, use the LlamaCpp model class to access it:
from agno.agent import Agent
from agno.models.llama_cpp import LlamaCpp

agent = Agent(
    model=LlamaCpp(id="ggml-org/gpt-oss-20b-GGUF"),
    markdown=True
)

# Print the response in the terminal
agent.print_response("Share a 2 sentence horror story.")

Configuration

The LlamaCpp model supports customizing the server URL and model ID:
from agno.agent import Agent
from agno.models.llama_cpp import LlamaCpp

# Custom server configuration
agent = Agent(
    model=LlamaCpp(
        id="your-custom-model",
        base_url="http://localhost:8080/v1",  # Custom server URL
    ),
    markdown=True
)
View more examples here.

Params

NameTypeDefaultDescription
idstr"ggml-org/gpt-oss-20b-GGUF"The model identifier
namestr"LlamaCpp"The name of this chat model instance
providerstr"LlamaCpp"The provider of the model
base_urlstr"http://127.0.0.1:8080/v1"The base URL for the LlamaCpp server
LlamaCpp is a subclass of the OpenAILike class and has access to the same params.

Server Configuration

The LlamaCpp server supports many configuration options:

Common Server Options

  • --ctx-size: Context size (0 for unlimited)
  • --batch-size, -b: Batch size for prompt processing
  • --ubatch-size, -ub: Physical batch size for prompt processing
  • --threads, -t: Number of threads to use
  • --host: IP address to listen on (default: 127.0.0.1)
  • --port: Port to listen on (default: 8080)

Model Options

  • --model, -m: Model file path
  • --hf-repo: HuggingFace model repository
  • --jinja: Use Jinja templating for chat formatting
For a complete list of server options, run llama-server --help.

Performance Optimization

Hardware Acceleration

LlamaCpp supports various acceleration backends:
gpu acceleration
# NVIDIA GPU (CUDA)
make LLAMA_CUDA=1

# Apple Metal (macOS)
make LLAMA_METAL=1

# OpenCL
make LLAMA_CLBLAST=1

Model Quantization

Use quantized models for better performance:
  • Q4_K_M: Balanced size and quality
  • Q8_0: Higher quality, larger size
  • Q2_K: Smallest size, lower quality

Troubleshooting

Server Connection Issues

Ensure the LlamaCpp server is running and accessible:
check server
curl http://127.0.0.1:8080/v1/models

Model Loading Problems

  • Verify the model file exists and is in GGML format
  • Check available memory for large models
  • Ensure the model is compatible with your LlamaCpp version

Performance Issues

  • Adjust batch sizes (-b, -ub) based on your hardware
  • Use GPU acceleration if available
  • Consider using quantized models for faster inference