Hao Hoang on linkedin

philippschaefer shared this post · May 5

I just parsed 500 research papers in under 10 minutes. On CPU. No GPU. No cloud.

𝐎𝐩𝐞𝐧𝐃𝐚𝐭𝐚𝐋𝐨𝐚𝐝𝐞𝐫 𝐏𝐃𝐅, the open-source PDF parser that just hit #1 on every benchmark that matters for AI.

Here's what blew me away:
1️⃣ 0.907 overall accuracy (beats docling, marker, pymupdf4llm)
2️⃣ 0.928 table extraction, the hardest thing to get right in PDFs
3️⃣ 60+ pages/second in local mode. 100+ pages/sec with batch on 8-core machines
4️⃣ Bounding boxes on every element, exact pixel coordinates for RAG citations
5️⃣ Built-in prompt injection filtering (most parsers don't even think about this)

And it's 3 lines of Python:

𝘱𝘪𝘱 𝘪𝘯𝘴𝘵𝘢𝘭𝘭 𝘰𝘱𝘦𝘯𝘥𝘢𝘵𝘢𝘭𝘰𝘢𝘥𝘦𝘳-𝘱𝘥𝘧

𝘰𝘱𝘦𝘯𝘥𝘢𝘵𝘢𝘭𝘰𝘢𝘥𝘦𝘳_𝘱𝘥𝘧.𝘤𝘰𝘯𝘷𝘦𝘳𝘵(
𝘪𝘯𝘱𝘶𝘵_𝘱𝘢𝘵𝘩=["𝘧𝘰𝘭𝘥𝘦𝘳/"],
𝘰𝘶𝘵𝘱𝘶𝘵_𝘥𝘪𝘳="𝘰𝘶𝘵𝘱𝘶𝘵/",
𝘧𝘰𝘳𝘮𝘢𝘵="𝘮𝘢𝘳𝘬𝘥𝘰𝘸𝘯,𝘫𝘴𝘰𝘯"
)

For complex tables and scanned PDFs, hybrid mode routes pages to an AI backend, still running 100% locally.

Also: it's the first open-source tool to auto-generate Tagged PDFs for accessibility compliance (EAA deadline was June 2025, a lot of teams are scrambling).

19.9k stars. Apache 2.0. Worth 30 seconds of your time.

What PDF parser are you using for your RAG pipeline right now? 👇

Carmelo Juanes Rodríguez 0.928 on tables is impressive, but the real win is CPU-only parsing at scale. That's the constraint we hit constantly in production workflows where latency matters as much as accuracy May 3 6 likes

Prab Singh Parsing is one part of the puzzle. The next bottleneck is memory. With #fastmemory, you can achieve 100% accuracy in RAG. All benchmarks prove that. May 3 1 like