Code Chunking

Code chunking splits code based on its structure, leveraging Abstract Syntax Trees (ASTs) to create contextually relevant segments. It uses the Chonkie library to identify natural code boundaries like functions, classes, and blocks. Learn more about code chunking. This preserves code semantics better than fixed-size chunking by ensuring related code stays together in the same chunk, while splitting occurs at meaningful structural boundaries. Code chunking supports several built-in tokenizers or a custom Tokenizer instance.

Create a Python file

from agno.agent import Agent
from agno.knowledge.chunking.code import CodeChunking
from agno.knowledge.knowledge import Knowledge
from agno.knowledge.reader.text_reader import TextReader
from agno.vectordb.pgvector import PgVector

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"

knowledge = Knowledge(
    vector_db=PgVector(table_name="python_code_chunking", db_url=db_url),
)

knowledge.insert(
    url="https://raw.githubusercontent.com/agno-agi/agno/main/libs/agno/agno/session/workflow.py",
    reader=TextReader(
        chunking_strategy=CodeChunking(
            tokenizer="gpt2",
            chunk_size=500,
            language="python",
        ),
    ),
)

agent = Agent(knowledge=knowledge, search_knowledge=True)
agent.print_response("How does the Workflow class work?", markdown=True)

Set up your virtual environment

uv venv --python 3.12
source .venv/bin/activate

Install dependencies

uv pip install -U agno sqlalchemy psycopg pgvector "chonkie[code]" openai

Set OpenAI Key

Set your OPENAI_API_KEY as an environment variable. You can get one from OpenAI.

Mac

export OPENAI_API_KEY=sk-***

Windows

setx OPENAI_API_KEY sk-***

Run PgVector

docker run -d \
  -e POSTGRES_DB=ai \
  -e POSTGRES_USER=ai \
  -e POSTGRES_PASSWORD=ai \
  -e PGDATA=/var/lib/postgresql/data/pgdata \
  -v pgvolume:/var/lib/postgresql/data \
  -p 5532:5432 \
  --name pgvector \
  agno/pgvector:16

Run the script

python code_chunking.py

Code Chunking Params

Parameter	Type	Default	Description
`tokenizer`	`Union[str, TokenizerProtocol]`	`"character"`	The tokenizer for measuring chunk sizes. Supports several built-in tokenizers or a custom `Tokenizer` instance.
`chunk_size`	`int`	`2048`	Maximum size of each chunk in tokens (based on the selected tokenizer).
`language`	`Union[Literal["auto"], Any]`	`"auto"`	The programming language to parse. Use `"auto"` for automatic detection or specify a tree-sitter language name (e.g., `"python"`, `"javascript"`, `"go"`, `"rust"`).
`include_nodes`	`bool`	`False`	Whether to include AST nodes. Note: Chonkie's base Chunk type does not store node information.
`chunker_params`	`Optional[Dict[str, Any]]`	`None`	Additional parameters to pass directly to Chonkie's CodeChunker.

Get Started

Basics

Advanced

Other

Code Chunking

Code Chunking Params

Get Started

Basics

Advanced

Other

​Code Chunking Params

Code Chunking Params