I just parsed 500 research papers in under 10 minutes. On CPU. No GPU. No cloud.
๐๐ฉ๐๐ง๐๐๐ญ๐๐๐จ๐๐๐๐ซ ๐๐๐ , the open-source PDF parser that just hit #1 on every benchmark that matters for AI.
Here's what blew me away:
1๏ธโฃ 0.907 overall accuracy (beats docling, marker, pymupdf4llm)
2๏ธโฃ 0.928 table extraction, the hardest thing to get right in PDFs
3๏ธโฃ 60+ pages/second in local mode. 100+ pages/sec with batch on 8-core machines
4๏ธโฃ Bounding boxes on every element, exact pixel coordinates for RAG citations
5๏ธโฃ Built-in prompt injection filtering (most parsers don't even think about this)
And it's 3 lines of Python:
๐ฑ๐ช๐ฑ ๐ช๐ฏ๐ด๐ต๐ข๐ญ๐ญ ๐ฐ๐ฑ๐ฆ๐ฏ๐ฅ๐ข๐ต๐ข๐ญ๐ฐ๐ข๐ฅ๐ฆ๐ณ-๐ฑ๐ฅ๐ง
๐ฐ๐ฑ๐ฆ๐ฏ๐ฅ๐ข๐ต๐ข๐ญ๐ฐ๐ข๐ฅ๐ฆ๐ณ_๐ฑ๐ฅ๐ง.๐ค๐ฐ๐ฏ๐ท๐ฆ๐ณ๐ต(
ย ๐ช๐ฏ๐ฑ๐ถ๐ต_๐ฑ๐ข๐ต๐ฉ=["๐ง๐ฐ๐ญ๐ฅ๐ฆ๐ณ/"],
ย ๐ฐ๐ถ๐ต๐ฑ๐ถ๐ต_๐ฅ๐ช๐ณ="๐ฐ๐ถ๐ต๐ฑ๐ถ๐ต/",
ย ๐ง๐ฐ๐ณ๐ฎ๐ข๐ต="๐ฎ๐ข๐ณ๐ฌ๐ฅ๐ฐ๐ธ๐ฏ,๐ซ๐ด๐ฐ๐ฏ"
)
For complex tables and scanned PDFs, hybrid mode routes pages to an AI backend, still running 100% locally.
Also: it's the first open-source tool to auto-generate Tagged PDFs for accessibility compliance (EAA deadline was June 2025, a lot of teams are scrambling).
19.9k stars. Apache 2.0. Worth 30 seconds of your time.
What PDF parser are you using for your RAG pipeline right now? ๐