Open Source · Self-Hosted · MIT License

PDF parsers give you text. PaperFlow gives you knowledge.

Open-source PDF->Markdown engine with a local Web UI. Choose your parser - run free locally or connect a cloud API. Get structured Markdown with footnotes, LaTeX, figure links, and metadata.

Get Started ⭐ GitHub PyPI

Run it on your machine, your server, or your own document stack.

Core idea

raw parser output
  -> normalize
  -> enrich
  -> structured markdown

Use it with

PyMuPDF for fast digital PDFs, PaddleOCR-VL for local AI parsing, Marker API for easiest premium quality, or self-hosted Marker for private infrastructure.

Local Web UI - runs on your machine

Use the quick decision guide, pick the parser that fits your document and privacy needs, then download structured Markdown. Your files stay local unless you explicitly choose the cloud API.

Local-first workflow

+---------------------------------------------+
|  PaperFlow Local UI                         |
|                                             |
|  Quick decision guide                       |
|  Standard digital PDF? -> PyMuPDF           |
|  Scan / formulas / OCR? -> PaddleOCR-VL     |
|  Easiest premium path? -> Marker API        |
|  Private enterprise setup? -> Self-hosted   |
|                                             |
|  1. Choose parser                           |
|     (*) PyMuPDF Local                       |
|     ( ) PaddleOCR-VL-0.9B                   |
|     ( ) Marker API (Datalab.to)             |
|     ( ) Enterprise Marker Self-Hosted       |
|                                             |
|  2. Drop PDF here                           |
|     +-------------------------------+       |
|     |  paper.pdf  /  2.4 MB         |       |
|     +-------------------------------+       |
|                                             |
|  Setup, tradeoffs, and exact commands       |
|  appear for the selected parser             |
|                                             |
|  [Convert]  [Batch Process]                 |
|                                             |
|  3. Download                                |
|      paper.md     paperflow.zip             |
+---------------------------------------------+

100% local - files never leave your machine

Batch processing - convert multiple PDFs at once

Scenario-driven parser chooser with setup instructions

Before / After

The parser is not the moat. Post-processing is where raw text becomes something you can actually use.

Best sales tool kept intact

Raw parser output

[1] dead text -- can't click
\[ E = mc^2 \] wrong delimiters
[Fig. 3] plain string
No metadata

References
[1] Vaswani et al. Attention Is All You Need.

After PaperFlow

[^1] hover to preview reference
$$ E = mc^2 $$ renders everywhere
[[#^fig-3|Fig. 3]] click to jump
---
title: "Attention Is All You Need"
authors:
  - "Vaswani"
  - "Shazeer"
---

Works with any Markdown editor: Obsidian, Notion, Logseq, VS Code, or your RAG pipeline.

Source PDF	Generic converter	PaperFlow output

Get started in 60 seconds

Run the local Web UI, then choose the parser that matches your document type, privacy needs, and setup tolerance.

Local Web UI first

Quick Start

git clone https://github.com/TylerMorrison21/paperflow
cd paperflow
pip install -r requirements.txt
uvicorn api.main:app --port 8000
# Open http://localhost:8000

That's it. The Web UI checks which parsers are actually ready, then shows setup steps and exact commands for each option.

Install

pip install paperflow-postprocess

Python

from paperflow_postprocess import enhance

raw_md = open("parser_output.md").read()
result = enhance(raw_md, images={}, metadata={"title": "My Paper"})

Install

npm install -g paperflow-mcp

Use PaperFlow in Claude Desktop with the MCP server package.

macOS / Linux config:

macOS / Linux

{
  "mcpServers": {
    "paperflow": {
      "command": "npx",
      "args": ["-y", "paperflow-mcp"]
    }
  }
}

Windows

{
  "mcpServers": {
    "paperflow": {
      "command": "cmd",
      "args": ["/c", "npx", "-y", "paperflow-mcp"]
    }
  }
}

Submit PDF

curl -X POST http://localhost:8000/api/submit \
  -F "file=@paper.pdf"

curl http://localhost:8000/api/jobs/<job_id>/result -o paper.md
curl http://localhost:8000/api/jobs/<job_id>/package -o paperflow.zip

Run the local API and send PDFs to it directly from scripts, curl, or your own document workflow.

Use `parser=pymupdf` for standard digital PDFs and the fastest local path.
Use `parser=paddleocr_vl` for scans, formulas, and more complex layouts.
Use `parser=marker_api` for the easiest premium-quality cloud path.

Parser Options

Choose the parser by workflow, not by raw specs. PaperFlow applies the same post-processing after extraction.

Quick decision guide

Parser	Use it when	Positioning
PyMuPDF Local	You have a standard digital PDF and want the fastest free local path	Default choice for contracts, reports, invoices, and high-volume text-layer PDFs
PaddleOCR-VL-0.9B	You need local AI for scans, equations, tables, or complex academic layouts	Best local option when quality matters and cloud upload is off the table
Marker API (Datalab.to)	You want the easiest premium-quality setup and cloud processing is acceptable	Fastest route to premium extraction with your own API key
Enterprise Marker Self-Hosted	You need private infrastructure, compliance, and enterprise control	Top-tier quality on your own servers or private cloud
Docling	You want to experiment with another upstream parser	Supported as an external parser input into the PaperFlow post-processing pipeline
LlamaParse	You already use it and want to normalize downstream markdown	Useful as an upstream source, but output shape may vary
Others	You have a parser that can already emit markdown	PaperFlow is parser-agnostic on the post-processing side

Parsing is commoditized. Post-processing is the product.

The Pipeline

Four steps from parser output to workflow-ready Markdown.

Extract -> Normalize -> Enrich -> Output

Step 1

Extract

Your chosen parser outputs raw Markdown.

Step 2

Normalize

LaTeX delimiters unified, broken formatting cleaned, repeated noise stripped.

Step 3

Enrich

Citations become footnotes, figures become links, YAML metadata gets injected.

Step 4

Output

Structured Markdown ready for editors, note systems, search indexes, and agents.

Need PaperFlow in your organization?

Deploy it privately, use a hosted API, or wire it into the systems your team already works in.

Enterprise options

Private Deployment

Docker image on your servers. Air-gapped. Your data never touches the internet.

Commercial API

Hosted endpoint with SLA. Pay per page.

Custom Integration

Connect to iManage, SharePoint, NetDocuments, or your document management system.

PaperFlow is already used by researchers and developers in 40+ countries. MIT licensed for all use cases.

Stats & Social Proof

PaperFlow already has real usage across researchers, developers, and document-heavy workflows.

Momentum exists

77K

Reddit views

Organic attention from researchers and builders.

810

Upvotes

Strong validation for the problem and the demo.

10,000+

Pages processed

Enough real-world PDF messiness to shape the pipeline.

40+

Countries

Used by developers, researchers, and document-heavy workflows globally.