Open Source · Self-Hosted · MIT License

PDF parsers give you text. PaperFlow gives you knowledge.

Open-source PDF->Markdown engine with a local Web UI. Choose your parser - run free locally or connect a cloud API. Get structured Markdown with footnotes, LaTeX, figure links, and metadata.

Run it on your machine, your server, or your own document stack.

Core idea

raw parser output
  -> normalize
  -> enrich
  -> structured markdown

Use it with

PyMuPDF for fast digital PDFs, PaddleOCR-VL for local AI parsing, Marker API for easiest premium quality, or self-hosted Marker for private infrastructure.

Local Web UI - runs on your machine

Use the quick decision guide, pick the parser that fits your document and privacy needs, then download structured Markdown. Your files stay local unless you explicitly choose the cloud API.

Local-first workflow
+---------------------------------------------+
|  PaperFlow Local UI                         |
|                                             |
|  Quick decision guide                       |
|  Standard digital PDF? -> PyMuPDF           |
|  Scan / formulas / OCR? -> PaddleOCR-VL     |
|  Easiest premium path? -> Marker API        |
|  Private enterprise setup? -> Self-hosted   |
|                                             |
|  1. Choose parser                           |
|     (*) PyMuPDF Local                       |
|     ( ) PaddleOCR-VL-0.9B                   |
|     ( ) Marker API (Datalab.to)             |
|     ( ) Enterprise Marker Self-Hosted       |
|                                             |
|  2. Drop PDF here                           |
|     +-------------------------------+       |
|     |  paper.pdf  /  2.4 MB         |       |
|     +-------------------------------+       |
|                                             |
|  Setup, tradeoffs, and exact commands       |
|  appear for the selected parser             |
|                                             |
|  [Convert]  [Batch Process]                 |
|                                             |
|  3. Download                                |
|      paper.md     paperflow.zip             |
+---------------------------------------------+
100% local - files never leave your machine
Batch processing - convert multiple PDFs at once
Scenario-driven parser chooser with setup instructions

Before / After

The parser is not the moat. Post-processing is where raw text becomes something you can actually use.

Best sales tool kept intact
Raw parser output
[1] dead text -- can't click
\[ E = mc^2 \] wrong delimiters
[Fig. 3] plain string
No metadata

References
[1] Vaswani et al. Attention Is All You Need.
After PaperFlow
[^1] hover to preview reference
$$ E = mc^2 $$ renders everywhere
[[#^fig-3|Fig. 3]] click to jump
---
title: "Attention Is All You Need"
authors:
  - "Vaswani"
  - "Shazeer"
---

Works with any Markdown editor: Obsidian, Notion, Logseq, VS Code, or your RAG pipeline.

Source PDF Generic converter PaperFlow output
Source PDF comparison Generic converter comparison PaperFlow output comparison

Get started in 60 seconds

Run the local Web UI, then choose the parser that matches your document type, privacy needs, and setup tolerance.

Local Web UI first
Quick Start
git clone https://github.com/TylerMorrison21/paperflow
cd paperflow
pip install -r requirements.txt
uvicorn api.main:app --port 8000
# Open http://localhost:8000

That's it. The Web UI checks which parsers are actually ready, then shows setup steps and exact commands for each option.

Install
pip install paperflow-postprocess
Python
from paperflow_postprocess import enhance

raw_md = open("parser_output.md").read()
result = enhance(raw_md, images={}, metadata={"title": "My Paper"})
Install
npm install -g paperflow-mcp

Use PaperFlow in Claude Desktop with the MCP server package.

  • macOS / Linux config:
macOS / Linux
{
  "mcpServers": {
    "paperflow": {
      "command": "npx",
      "args": ["-y", "paperflow-mcp"]
    }
  }
}
Windows
{
  "mcpServers": {
    "paperflow": {
      "command": "cmd",
      "args": ["/c", "npx", "-y", "paperflow-mcp"]
    }
  }
}
Submit PDF
curl -X POST http://localhost:8000/api/submit \
  -F "file=@paper.pdf"

curl http://localhost:8000/api/jobs/<job_id>/result -o paper.md
curl http://localhost:8000/api/jobs/<job_id>/package -o paperflow.zip

Run the local API and send PDFs to it directly from scripts, curl, or your own document workflow.

  • Use `parser=pymupdf` for standard digital PDFs and the fastest local path.
  • Use `parser=paddleocr_vl` for scans, formulas, and more complex layouts.
  • Use `parser=marker_api` for the easiest premium-quality cloud path.

Parser Options

Choose the parser by workflow, not by raw specs. PaperFlow applies the same post-processing after extraction.

Quick decision guide
Parser Use it when Positioning
PyMuPDF Local You have a standard digital PDF and want the fastest free local path Default choice for contracts, reports, invoices, and high-volume text-layer PDFs
PaddleOCR-VL-0.9B You need local AI for scans, equations, tables, or complex academic layouts Best local option when quality matters and cloud upload is off the table
Marker API (Datalab.to) You want the easiest premium-quality setup and cloud processing is acceptable Fastest route to premium extraction with your own API key
Enterprise Marker Self-Hosted You need private infrastructure, compliance, and enterprise control Top-tier quality on your own servers or private cloud
Docling You want to experiment with another upstream parser Supported as an external parser input into the PaperFlow post-processing pipeline
LlamaParse You already use it and want to normalize downstream markdown Useful as an upstream source, but output shape may vary
Others You have a parser that can already emit markdown PaperFlow is parser-agnostic on the post-processing side

Parsing is commoditized. Post-processing is the product.

The Pipeline

Four steps from parser output to workflow-ready Markdown.

Extract -> Normalize -> Enrich -> Output
Step 1

Extract

Your chosen parser outputs raw Markdown.

Step 2

Normalize

LaTeX delimiters unified, broken formatting cleaned, repeated noise stripped.

Step 3

Enrich

Citations become footnotes, figures become links, YAML metadata gets injected.

Step 4

Output

Structured Markdown ready for editors, note systems, search indexes, and agents.

Need PaperFlow in your organization?

Deploy it privately, use a hosted API, or wire it into the systems your team already works in.

Enterprise options

Private Deployment

Docker image on your servers. Air-gapped. Your data never touches the internet.

Commercial API

Hosted endpoint with SLA. Pay per page.

Custom Integration

Connect to iManage, SharePoint, NetDocuments, or your document management system.

PaperFlow is already used by researchers and developers in 40+ countries. MIT licensed for all use cases.

Stats & Social Proof

PaperFlow already has real usage across researchers, developers, and document-heavy workflows.

Momentum exists
77K

Reddit views

Organic attention from researchers and builders.

810

Upvotes

Strong validation for the problem and the demo.

10,000+

Pages processed

Enough real-world PDF messiness to shape the pipeline.

40+

Countries

Used by developers, researchers, and document-heavy workflows globally.