Readers

Readers are the first step in the process of creating Knowledge from content. They transform raw content from various sources into structured Document objects that can be embedded, chunked, and stored in vector databases.

What are Readers?

A Reader is a specialized component that knows how to parse and extract content from specific data sources or file formats. Think of readers as translators that convert different content formats into a standardized format that Agno can work with. Every piece of content that enters your knowledge base must pass through a reader first. The reader’s job is to:

Parse the raw content from its original format
Extract the meaningful text and metadata
Structure the content into Document objects
Apply chunking strategies to break large content into manageable pieces

How Readers Work

All readers inherit from the base Reader class and follow a consistent pattern:

# Every reader implements these core methods
class Reader:
    def read(self, obj, name=None) -> List[Document]:
        """Synchronously read and process content"""
        pass

    async def async_read(self, obj, name=None) -> List[Document]:
        """Asynchronously read and process content"""
        pass

The Reading Process

When a reader processes content, it follows these steps:

Content Ingestion: The reader receives raw content (file, URL, text, etc.)
Parsing: Extract text and metadata using format-specific logic
Document Creation: Convert parsed content into Document objects
Chunking: Apply chunking strategies to break content into smaller pieces
Return: Provide a list of processed documents ready for embedding

Content Types and Specialization

Each reader specializes in handling specific content types:

@classmethod
def get_supported_content_types(cls) -> List[ContentType]:
    """Returns the content types this reader can handle"""
    return [ContentType.PDF]  # Example for PDFReader

This specialization allows each reader to:

Use format-specific parsing libraries
Extract relevant metadata
Handle format-specific challenges (encryption, encoding, etc.)
Optimize processing for that content type

Reader Configuration

Readers are highly configurable to meet different processing needs:

Chunking Control

reader = PDFReader(
    chunk=True,                    # Enable/disable chunking
    chunk_size=1000,              # Size of each chunk
    chunking_strategy=MyStrategy() # Custom chunking logic
)

Content Processing Options

reader = PDFReader(
    split_on_pages=True,          # Create separate documents per page
    password="secret123",         # Handle encrypted PDFs
    read_images=True             # Extract text from images via OCR
)

Encoding Control

For text-based readers, you can override the file encoding:

reader = TextReader(
    encoding="utf-8"              # Override default encoding
)

reader = CSVReader(
    encoding="latin-1"            # Handle files with specific encodings
)

reader = MarkdownReader(
    encoding="cp1252"             # Windows-specific encoding
)

Metadata and Naming

documents = reader.read(
    file_path,
    name="custom_document_name",  # Override default naming
    password="file_password"      # Runtime password override
)

The Document Output

Readers convert raw content into Document objects with this structure:

Document(
    content="The extracted text content...",
    id="unique_document_identifier",
    name="document_name",
    meta_data={
        "page": 1,                # Page number for PDFs
        "url": "https://...",     # Source URL for web content
        "author": "...",          # Document metadata
    },
    size=len(content)             # Content size in characters
)

Chunking Integration

One of the most important features of readers is their integration with chunking strategies:

Automatic Chunking

When chunk=True, readers automatically apply chunking strategies to break large documents into smaller, more manageable pieces:

# Large PDF gets broken into multiple documents
pdf_reader = PDFReader(chunk=True, chunk_size=1000)
documents = pdf_reader.read("large_document.pdf")
# Returns: [Document(chunk1), Document(chunk2), Document(chunk3), ...]

Chunking Strategy Support

Different readers support different chunking strategies based on their content type:

@classmethod
def get_supported_chunking_strategies(cls) -> List[ChunkingStrategyType]:
    return [
        ChunkingStrategyType.DOCUMENT_CHUNKING,  # Respect document structure
        ChunkingStrategyType.FIXED_SIZE_CHUNKING, # Fixed character/token limits
        ChunkingStrategyType.SEMANTIC_CHUNKING,   # Semantic boundaries
        ChunkingStrategyType.AGENTIC_CHUNKING,    # AI-powered chunking
    ]

Reader Factory and Auto-Selection

Agno provides intelligent reader selection through the ReaderFactory:

# Automatic reader selection based on file extension
reader = ReaderFactory.get_reader_for_extension(".pdf")  # Returns PDFReader
reader = ReaderFactory.get_reader_for_extension(".csv")  # Returns CSVReader

# URL-based reader selection
reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader
reader = ReaderFactory.get_reader_for_url("https://example.com/doc.pdf")     # PDFReader

Supported Readers

The following readers are currently supported:

Reader Name	Description
ArxivReader	Fetches and processes academic papers from arXiv
CSVReader	Parses CSV files and converts rows to documents
FieldLabeledCSVReader	Converts CSV rows to field-labeled text documents
FirecrawlReader	Uses Firecrawl API to scrape and crawl web content
JSONReader	Processes JSON files and converts them into documents
MarkdownReader	Reads and parses Markdown files
PDFReader	Reads and extracts text from PDF files
PPTXReader	Reads and extracts text from PowerPoint (.pptx) files
TextReader	Handles plain text files
WebsiteReader	Crawls entire websites following links recursively
WebSearchReader	Searches and reads web search results
WikipediaReader	Searches and reads Wikipedia articles
YouTubeReader	Extracts transcripts and metadata from YouTube videos

Async Processing

All readers support asynchronous processing for better performance:

# Synchronous reading
documents = reader.read("file.pdf")

# Asynchronous reading - better for I/O intensive operations
documents = await reader.async_read("file.pdf")

# Batch processing with async
tasks = [reader.async_read(file) for file in file_list]
all_documents = await asyncio.gather(*tasks)

Usage in Knowledge

Readers integrate seamlessly with Agno Knowledge:

from agno.knowledge.reader.pdf_reader import PDFReader

# Custom reader configuration
reader = PDFReader(
    chunk_size=1000,
    chunking_strategy=SemanticChunking(),
)

knowledge_base = Knowledge(
    vector_db=vector_db,
)

# Use custom reader
knowledge_base.add_content(
    path="data/documents",
    reader=reader  # Override default reader
)

Best Practices

Choose the Right Reader

Use specialized readers for better extraction quality
Consider format-specific features (PDF encryption, CSV delimiters, etc.)

Configure Chunking Appropriately

Smaller chunks for precise retrieval
Larger chunks for maintaining context
Use semantic chunking for structured documents

Optimize for Performance

Use async readers for I/O-heavy operations
Batch process multiple files when possible
Cache readers through ReaderFactory when processing many files

Handle Errors Gracefully

Readers return empty lists for failed processing
Check reader logs for debugging information
Provide fallback readers for unknown formats

Next Steps

Chunking Strategies

Learn how to optimize content chunking for better search results

Content Types

Understand different ways to add information to your knowledge base

Vector Databases

Choose the right storage solution for your processed content

Examples

See readers in action with practical examples

Introduction

Learn

Help

What are Readers?

How Readers Work

The Reading Process

Content Types and Specialization

Reader Configuration

Chunking Control

Content Processing Options

Encoding Control

Metadata and Naming

The Document Output

Chunking Integration

Automatic Chunking

Chunking Strategy Support

Reader Factory and Auto-Selection

Supported Readers

Async Processing

Usage in Knowledge

Best Practices

Choose the Right Reader

Configure Chunking Appropriately

Optimize for Performance

Handle Errors Gracefully

Next Steps

Chunking Strategies

Content Types

Vector Databases

Examples

Introduction

Learn

Help

​What are Readers?

​How Readers Work

​The Reading Process

​Content Types and Specialization

​Reader Configuration

​Chunking Control

​Content Processing Options

​Encoding Control

​Metadata and Naming

​The Document Output

​Chunking Integration

​Automatic Chunking

​Chunking Strategy Support

​Reader Factory and Auto-Selection

​Supported Readers

​Async Processing

​Usage in Knowledge

​Best Practices

​Choose the Right Reader

​Configure Chunking Appropriately

​Optimize for Performance

​Handle Errors Gracefully

​Next Steps

Chunking Strategies

Content Types

Vector Databases

Examples

What are Readers?

How Readers Work

The Reading Process

Content Types and Specialization

Reader Configuration

Chunking Control

Content Processing Options

Encoding Control

Metadata and Naming

The Document Output

Chunking Integration

Automatic Chunking

Chunking Strategy Support

Reader Factory and Auto-Selection

Supported Readers

Async Processing

Usage in Knowledge

Best Practices

Choose the Right Reader

Configure Chunking Appropriately

Optimize for Performance

Handle Errors Gracefully

Next Steps