What You’ll Learn
By building this agent, you’ll understand:- How to integrate Firecrawl for reliable web scraping and content extraction
- How to define structured output schemas using Pydantic models
- How to create nested data structures for complex web content
- How to handle optional fields and varied page structures
Use Cases
Build competitive intelligence tools, content aggregation systems, knowledge base constructors, or automated documentation generators.How It Works
The agent extracts structured data from web pages in a systematic process:- Fetch: Uses Firecrawl to retrieve and parse the target webpage
- Analyze: Identifies key sections, elements, and hierarchical structure
- Extract: Pulls information according to the Pydantic output schema
- Structure: Organizes content into nested models (sections, metadata, links, contact info)
Code
web_extraction_agent.py
What to Expect
The agent will scrape the target URL using Firecrawl and extract all information into a structured PageInformation object. The output includes the page title, description, features, organized content sections with headings, important links, contact information, and additional metadata. The structured output ensures consistency and makes the extracted data easy to process, store, or display programmatically. Optional fields handle pages with varying structures gracefully.Usage
1
Create a virtual environment
Open the
Terminal and create a python virtual environment.2
Set your API key
3
Install libraries
4
Run Agent
Next Steps
- Change the target URL to extract data from different websites
- Modify the
PageInformationPydantic model to capture additional fields - Adjust the agent’s instructions to focus on specific content types
- Explore Firecrawl Tools for advanced scraping options