Why use embedders?
- Better recall than keywords: They understand meaning, so “How do I reset my passcode?” finds docs mentioning “change PIN”.
- Ground LLMs in your data: Provide the model with trusted, domain-specific context at answer time.
- Scale to large knowledge bases: Vectors enable fast similarity search across thousands or millions of chunks.
- Multilingual retrieval: Many embedders map different languages to the same semantic space.
When to use embedders
Use embedders when you need any of the following:- RAG and context injection: Supply relevant snippets to your agent before responding.
- Semantic search: Let users query by meaning across product docs, wikis, tickets, or chats.
- Deduplication and clustering: Group similar content or avoid repeating the same info.
- Personal and team memory: Store summaries and facts for later recall by agents.
How it works in Agno
Agno usesOpenAIEmbedder
as the default, but you can swap in any supported embedder. When you add content to a knowledge base, the embedder converts each chunk into a vector and stores it in your vector database. Later, when an agent searches, it embeds the query and finds the most similar vectors.
Here’s a basic setup:
Choosing an embedder
Pick based on your constraints:- Hosted vs local: Prefer local (e.g., Ollama, FastEmbed) for offline or strict data residency; hosted (OpenAI, Gemini, Voyage) for best quality and convenience.
- Latency and cost: Smaller models are cheaper/faster; larger models often retrieve better.
- Language support: Ensure your embedder supports the languages you expect.
- Dimension compatibility: Match your vector DB’s expected embedding size if it’s fixed.
Quick Comparison
Embedder | Type | Best For | Cost | Performance |
---|---|---|---|---|
OpenAI | Hosted | General use, proven quality | $$ | Excellent |
Ollama | Local | Privacy, offline, no API costs | Free | Good |
Voyage AI | Hosted | Specialized retrieval tasks | $$$ | Excellent |
Gemini | Hosted | Google ecosystem, multilingual | $$ | Excellent |
FastEmbed | Local | Fast local embeddings | Free | Good |
HuggingFace | Local/Hosted | Open source models, customization | Free/$ | Variable |
Supported embedders
The following embedders are supported:- OpenAI
- Cohere
- Gemini
- AWS Bedrock
- Azure OpenAI
- Fireworks
- HuggingFace
- Jina
- Mistral
- Ollama
- Qdrant FastEmbed
- Together
- Voyage AI
Best Practices
Chunk your content wisely: Split long docs into 300–1,000 token chunks with 10-20% overlap. This balances context preservation with retrieval precision.
Store rich metadata: Include titles, source URLs, timestamps, and permissions with each chunk. This enables filtering and better context in responses.
Test your retrieval quality: Use a small set of test queries to evaluate if you’re finding the right chunks. Adjust chunking strategy and embedder if needed.
Re-embed when you change models: If you switch embedders, you must re-embed all your content. Vectors from different models aren’t compatible.
Batch Embeddings
Many embedding providers support processing multiple texts in a single API call, known as batch embedding. This approach offers several advantages: it reduces the number of API requests, helps avoid rate limits, and significantly improves performance when processing large amounts of text. To enable batch processing, set theenable_batch
flag to True
when configuring your embedder.
The batch_size
paramater can be used to control the amount of texts sent per batch.