TrafilaturaTools provides advanced web scraping and text extraction capabilities with support for crawling and content analysis.
Parameter | Type | Default | Description |
---|---|---|---|
output_format | str | "txt" | Default output format (txt, json, xml, markdown, csv, html). |
include_comments | bool | False | Whether to extract comments along with main text. |
include_tables | bool | False | Whether to include table content. |
include_images | bool | False | Whether to include image information (experimental). |
include_formatting | bool | False | Whether to preserve text formatting. |
include_links | bool | False | Whether to preserve links (experimental). |
with_metadata | bool | False | Whether to include metadata in extractions. |
favor_precision | bool | False | Whether to prefer precision over recall. |
favor_recall | bool | False | Whether to prefer recall over precision. |
target_language | Optional[str] | None | Target language filter (ISO 639-1 format). |
deduplicate | bool | True | Whether to remove duplicate segments. |
max_crawl_urls | int | 100 | Maximum number of URLs to crawl per website. |
max_known_urls | int | 1000 | Maximum number of known URLs during crawling. |
enable_extract_text | bool | True | Whether to extract text content. |
enable_extract_metadata | bool | True | Whether to extract metadata information. |
enable_html_to_text | bool | True | Whether to convert HTML content to clean text. |
enable_batch_extract | bool | True | Whether to extract content from multiple URLs in batch. |
Function | Description |
---|---|
extract_text | Extract clean text content from a URL or HTML. |
extract_metadata | Extract metadata information from web pages. |
html_to_text | Convert HTML content to clean text. |
crawl_website | Crawl a website and extract content from multiple pages. |
batch_extract | Extract content from multiple URLs in batch. |
get_page_info | Get comprehensive page information including metadata. |