Capabilities
The service supports a wide range of input formats:PDF Documents
High-fidelity extraction using
pymupdf (fitz) with PyPDF2 fallback.Microsoft Office
Native support for Word (
.docx), Excel (.xlsx), and PowerPoint (.pptx).Text & Code
Parses
.txt, .md, .json, .csv, .html and other plain text formats.Multimedia
Integrates with OpenAI Whisper API for transcribing Audio and Video files.
Web Scraping
Extracts cleaner content from URLs using
playwright and beautifulsoup4.E-Books & Email
Processes
.epub books and .mbox email archives.Standardized Output
Regardless of the input format, the collector outputs a consistent JSON structure. This normalization is crucial for downstream embedding and indexing.Getting Started
To learn more about the internals and how to integrate with the collector:Architecture
Review the internal processing pipeline and tech stack.
Extensions
Learn about external integrations like YouTube and GitHub.