Readers are the first step in the process of creating Knowledge from content. They transform raw content from various sources into structured Document objects that can be embedded, chunked, and stored in vector databases.

What are Readers?

A Reader is a specialized component that knows how to parse and extract content from specific data sources or file formats. Think of readers as translators that convert different content formats into a standardized format that Agno can work with. Every piece of content that enters your knowledge base must pass through a reader first. The reader’s job is to:
  1. Parse the raw content from its original format
  2. Extract the meaningful text and metadata
  3. Structure the content into Document objects
  4. Apply chunking strategies to break large content into manageable pieces

How Readers Work

All readers inherit from the base Reader class and follow a consistent pattern:
# Every reader implements these core methods
class Reader:
    def read(self, obj, name=None) -> List[Document]:
        """Synchronously read and process content"""
        pass

    async def async_read(self, obj, name=None) -> List[Document]:
        """Asynchronously read and process content"""
        pass

The Reading Process

When a reader processes content, it follows these steps:
  1. Content Ingestion: The reader receives raw content (file, URL, text, etc.)
  2. Parsing: Extract text and metadata using format-specific logic
  3. Document Creation: Convert parsed content into Document objects
  4. Chunking: Apply chunking strategies to break content into smaller pieces
  5. Return: Provide a list of processed documents ready for embedding

Content Types and Specialization

Each reader specializes in handling specific content types:
@classmethod
def get_supported_content_types(cls) -> List[ContentType]:
    """Returns the content types this reader can handle"""
    return [ContentType.PDF]  # Example for PDFReader
This specialization allows each reader to:
  • Use format-specific parsing libraries
  • Extract relevant metadata
  • Handle format-specific challenges (encryption, encoding, etc.)
  • Optimize processing for that content type

Reader Configuration

Readers are highly configurable to meet different processing needs:

Chunking Control

reader = PDFReader(
    chunk=True,                    # Enable/disable chunking
    chunk_size=1000,              # Size of each chunk
    chunking_strategy=MyStrategy() # Custom chunking logic
)

Content Processing Options

reader = PDFReader(
    split_on_pages=True,          # Create separate documents per page
    password="secret123",         # Handle encrypted PDFs
    read_images=True             # Extract text from images via OCR
)

Metadata and Naming

documents = reader.read(
    file_path,
    name="custom_document_name",  # Override default naming
    password="file_password"      # Runtime password override
)

The Document Output

Readers convert raw content into Document objects with this structure:
Document(
    content="The extracted text content...",
    id="unique_document_identifier",
    name="document_name",
    meta_data={
        "page": 1,                # Page number for PDFs
        "url": "https://...",     # Source URL for web content
        "author": "...",          # Document metadata
    },
    size=len(content)             # Content size in characters
)

Chunking Integration

One of the most important features of readers is their integration with chunking strategies:

Automatic Chunking

When chunk=True, readers automatically apply chunking strategies to break large documents into smaller, more manageable pieces:
# Large PDF gets broken into multiple documents
pdf_reader = PDFReader(chunk=True, chunk_size=1000)
documents = pdf_reader.read("large_document.pdf")
# Returns: [Document(chunk1), Document(chunk2), Document(chunk3), ...]

Chunking Strategy Support

Different readers support different chunking strategies based on their content type:
@classmethod
def get_supported_chunking_strategies(cls) -> List[ChunkingStrategyType]:
    return [
        ChunkingStrategyType.DOCUMENT_CHUNKING,  # Respect document structure
        ChunkingStrategyType.FIXED_SIZE_CHUNKING, # Fixed character/token limits
        ChunkingStrategyType.SEMANTIC_CHUNKING,   # Semantic boundaries
        ChunkingStrategyType.AGENTIC_CHUNKING,    # AI-powered chunking
    ]

Reader Factory and Auto-Selection

Agno provides intelligent reader selection through the ReaderFactory:
# Automatic reader selection based on file extension
reader = ReaderFactory.get_reader_for_extension(".pdf")  # Returns PDFReader
reader = ReaderFactory.get_reader_for_extension(".csv")  # Returns CSVReader

# URL-based reader selection
reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader
reader = ReaderFactory.get_reader_for_url("https://example.com/doc.pdf")     # PDFReader

Supported Readers

The following readers are currently supported:
Reader NameDescription
ArxivReaderFetches and processes academic papers from arXiv
CSVReaderParses CSV files and converts rows to documents
FirecrawlReaderUses Firecrawl API to scrape and crawl web content
JSONReaderProcesses JSON files and converts them into documents
MarkdownReaderReads and parses Markdown files
PDFReaderReads and extracts text from PDF files
TextReaderHandles plain text files
WebPageReaderScrapes and processes content from web pages
WebsiteReaderCrawls entire websites following links recursively
WikipediaReaderSearches and reads Wikipedia articles
YouTubeReaderExtracts transcripts and metadata from YouTube videos

Async Processing

All readers support asynchronous processing for better performance:
# Synchronous reading
documents = reader.read("file.pdf")

# Asynchronous reading - better for I/O intensive operations
documents = await reader.async_read("file.pdf")

# Batch processing with async
tasks = [reader.async_read(file) for file in file_list]
all_documents = await asyncio.gather(*tasks)

Usage in Knowledge

Readers integrate seamlessly with Agno Knowledge:
from agno.knowledge.reader.pdf_reader import PDFReader

# Custom reader configuration
reader = PDFReader(
    chunk_size=1000,
    chunking_strategy=SemanticChunking(),
)

knowledge_base = Knowledge(
    vector_db=vector_db,
)

# Use custom reader
knowledge_base.add_content(
    path="data/documents",
    reader=reader  # Override default reader
)

Best Practices

Choose the Right Reader

  • Use specialized readers for better extraction quality
  • Consider format-specific features (PDF encryption, CSV delimiters, etc.)

Configure Chunking Appropriately

  • Smaller chunks for precise retrieval
  • Larger chunks for maintaining context
  • Use semantic chunking for structured documents

Optimize for Performance

  • Use async readers for I/O-heavy operations
  • Batch process multiple files when possible
  • Cache readers through ReaderFactory when processing many files

Handle Errors Gracefully

  • Readers return empty lists for failed processing
  • Check reader logs for debugging information
  • Provide fallback readers for unknown formats

Next Steps

Now that you understand how readers work, check out the Examples Section.