philippschaefer shared this post ยท May 5
Hao Hoang

I just parsed 500 research papers in under 10 minutes. On CPU. No GPU. No cloud.

๐Ž๐ฉ๐ž๐ง๐ƒ๐š๐ญ๐š๐‹๐จ๐š๐๐ž๐ซ ๐๐ƒ๐…, the open-source PDF parser that just hit #1 on every benchmark that matters for AI.

Here's what blew me away:
1๏ธโƒฃ 0.907 overall accuracy (beats docling, marker, pymupdf4llm)
2๏ธโƒฃ 0.928 table extraction, the hardest thing to get right in PDFs
3๏ธโƒฃ 60+ pages/second in local mode. 100+ pages/sec with batch on 8-core machines
4๏ธโƒฃ Bounding boxes on every element, exact pixel coordinates for RAG citations
5๏ธโƒฃ Built-in prompt injection filtering (most parsers don't even think about this)

And it's 3 lines of Python:

๐˜ฑ๐˜ช๐˜ฑ ๐˜ช๐˜ฏ๐˜ด๐˜ต๐˜ข๐˜ญ๐˜ญ ๐˜ฐ๐˜ฑ๐˜ฆ๐˜ฏ๐˜ฅ๐˜ข๐˜ต๐˜ข๐˜ญ๐˜ฐ๐˜ข๐˜ฅ๐˜ฆ๐˜ณ-๐˜ฑ๐˜ฅ๐˜ง

๐˜ฐ๐˜ฑ๐˜ฆ๐˜ฏ๐˜ฅ๐˜ข๐˜ต๐˜ข๐˜ญ๐˜ฐ๐˜ข๐˜ฅ๐˜ฆ๐˜ณ_๐˜ฑ๐˜ฅ๐˜ง.๐˜ค๐˜ฐ๐˜ฏ๐˜ท๐˜ฆ๐˜ณ๐˜ต(
ย ๐˜ช๐˜ฏ๐˜ฑ๐˜ถ๐˜ต_๐˜ฑ๐˜ข๐˜ต๐˜ฉ=["๐˜ง๐˜ฐ๐˜ญ๐˜ฅ๐˜ฆ๐˜ณ/"],
ย ๐˜ฐ๐˜ถ๐˜ต๐˜ฑ๐˜ถ๐˜ต_๐˜ฅ๐˜ช๐˜ณ="๐˜ฐ๐˜ถ๐˜ต๐˜ฑ๐˜ถ๐˜ต/",
ย ๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต="๐˜ฎ๐˜ข๐˜ณ๐˜ฌ๐˜ฅ๐˜ฐ๐˜ธ๐˜ฏ,๐˜ซ๐˜ด๐˜ฐ๐˜ฏ"
)

For complex tables and scanned PDFs, hybrid mode routes pages to an AI backend, still running 100% locally.

Also: it's the first open-source tool to auto-generate Tagged PDFs for accessibility compliance (EAA deadline was June 2025, a lot of teams are scrambling).

19.9k stars. Apache 2.0. Worth 30 seconds of your time.

What PDF parser are you using for your RAG pipeline right now? ๐Ÿ‘‡

1.4K
Carmelo Juanes Rodrรญguez 0.928 on tables is impressive, but the real win is CPU-only parsing at scale. That's the constraint we hit constantly in production workflows where latency matters as much as accuracy May 3 6 likes
Prab Singh Parsing is one part of the puzzle. The next bottleneck is memory. With #fastmemory, you can achieve 100% accuracy in RAG. All benchmarks prove that. May 3 1 like