paper2epub

A powerful academic PDF to EPUB converter with AI-powered layout detection and LaTeX math support.

Features

Academic-First Design: Optimized for scientific papers, research documents, and technical publications
LaTeX Math Support: Preserves mathematical equations using Nougat's neural OCR
Complex Layout Handling: AI-powered detection of multi-column layouts, tables, and figures
GPU Acceleration: Optional CUDA/MPS (Apple Silicon) support for faster processing
Figure Extraction: Automatic extraction and embedding of figures using PyMuPDF
Multiple Output Formats: EPUB3 with optional intermediate Markdown
Easy to Use: Both CLI and Python API available

Installation

Basic Installation

pip install paper2epub

From Source

git clone https://github.com/MAXNORM8650/paper2epub.git
cd paper2epub
pip install -e .

Development Installation

pip install -e ".[dev]"

Requirements

Python 3.9+
PyTorch 2.0+
For GPU acceleration:
- NVIDIA GPU: CUDA-enabled PyTorch
- Apple Silicon (M1/M2/M3): MPS-enabled PyTorch (included by default)

Quick Start

Command Line

# Basic conversion
paper2epub paper.pdf

# Specify output and metadata
paper2epub paper.pdf -o output.epub -t "My Paper" -a "John Doe"

# Use larger model with GPU
paper2epub paper.pdf -m base -d cuda

# Save intermediate markdown
paper2epub paper.pdf --save-markdown

# Skip figure extraction
paper2epub paper.pdf --no-figures

# Set minimum figure size (filter small images)
paper2epub paper.pdf --figure-min-size 150

Python API

from paper2epub import Paper2EpubConverter

# Initialize converter
converter = Paper2EpubConverter(
    model_tag="0.1.0-small",  # or "0.1.0-base" for better quality
    device="auto",             # auto-detect GPU/CPU
    extract_figures=True,      # enable figure extraction
    figure_min_size=100,       # minimum figure size in pixels
)

# Convert PDF to EPUB
output_path = converter.convert(
    pdf_path="paper.pdf",
    title="My Academic Paper",
    author="John Doe",
    save_markdown=True,        # optionally save .md file
)

print(f"Created: {output_path}")

CLI Options

Usage: paper2epub [OPTIONS] PDF_PATH

Options:
  -o, --output PATH          Output EPUB file path
  -t, --title TEXT           Book title
  -a, --author TEXT          Author name
  -l, --language TEXT        Language code (default: en)
  -m, --model [small|base]   Nougat model size (default: small)
  -d, --device [auto|cuda|mps|cpu]  Device to use
  -b, --batch-size INT       Batch size for processing
  --save-markdown            Save intermediate markdown file
  --no-figures               Skip figure extraction from PDF
  --figure-min-size INT      Minimum figure size in pixels (default: 100)
  -v, --verbose              Enable verbose logging
  --version                  Show version
  --help                     Show this message and exit

How It Works

paper2epub uses a multi-stage pipeline:

PDF Extraction: Nougat (Meta's neural OCR) extracts text, tables, and LaTeX equations
Figure Extraction: PyMuPDF extracts embedded images from the PDF
Markdown Generation: Content is converted to Markdown with preserved structure
EPUB Creation: Markdown and images are transformed into EPUB3 with MathML/MathJax support

Why Nougat?

Nougat (Neural Optical Understanding for Academic Documents) is Meta's state-of-the-art model specifically designed for academic papers. It excels at:

Recognizing complex mathematical notation
Handling multi-column layouts
Preserving table structures
Extracting figures and captions

Model Sizes

Model	Size	Speed	Quality	Use Case
small	~350MB	Fast	Good	Quick conversions, testing
base	~1.2GB	Moderate	Better	Production use, complex papers

Performance

CPU: 1-3 pages/minute (small model)
GPU (CUDA): 10-20 pages/minute
Apple Silicon (MPS): 5-15 pages/minute

Examples

Convert Multiple PDFs

for pdf in *.pdf; do
    paper2epub "$pdf" -a "Author Name"
done

Batch Processing in Python

from pathlib import Path
from paper2epub import Paper2EpubConverter

converter = Paper2EpubConverter()

pdf_dir = Path("papers")
for pdf_file in pdf_dir.glob("*.pdf"):
    print(f"Converting {pdf_file.name}...")
    converter.convert(pdf_file)

Limitations

Scanned PDFs may require higher quality OCR (use base model)
Very complex equations might need manual review
Image quality depends on source PDF resolution
EPUB readers vary in math rendering support (MathJax recommended)

Troubleshooting

Dependency Conflicts

Issue 1: albumentations

If you get an error about albumentations or ImageCompression:

# Install compatible version
pip install 'albumentations<1.4.0'

Issue 2: pypdfium2 (PdfDocument has no attribute 'render')

If you get an error about 'PdfDocument' object has no attribute 'render':

# Install compatible version
pip install 'pypdfium2>=4.0.0,<5.0.0'

Or reinstall with all fixes:

pip install --upgrade paper2epub

Out of Memory

# Reduce batch size
paper2epub paper.pdf -b 1

# Use CPU instead of GPU
paper2epub paper.pdf -d cpu

Poor Quality Output

# Use larger model
paper2epub paper.pdf -m base

# Enable verbose logging to debug
paper2epub paper.pdf -v

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Nougat by Meta AI Research
ebooklib for EPUB creation
PyMuPDF for PDF handling

Citation

If you use paper2epub in academic work, please cite:

@software{paper2epub,
  title = {paper2epub: Academic PDF to EPUB Converter},
  author = {Komal Kumar},
  year = {2026},
  url = {https://github.com/MAXNORM8650/paper2epub}
}

For Nougat:

@article{blecher2023nougat,
  title={Nougat: Neural Optical Understanding for Academic Documents},
  author={Blecher, Lukas and Cucurull, Guillem and Scialom, Thomas and Stojnic, Robert},
  journal={arXiv preprint arXiv:2308.13418},
  year={2023}
}

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Roadmap

GROBID integration for better metadata extraction
Support for more input formats (DOCX, LaTeX)
Batch processing UI
Cloud/API deployment option
Enhanced equation rendering options
Custom styling templates

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
examples		examples
paper2epub		paper2epub
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup_dependencies.sh		setup_dependencies.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paper2epub

Features

Installation

Basic Installation

From Source

Development Installation

Requirements

Quick Start

Command Line

Python API

CLI Options

How It Works

Why Nougat?

Model Sizes

Performance

Examples

Convert Multiple PDFs

Batch Processing in Python

Limitations

Troubleshooting

Dependency Conflicts

Out of Memory

Poor Quality Output

Contributing

License

Acknowledgments

Citation

Support

Roadmap

About

Uh oh!

Releases

Packages

Languages

License

MAXNORM8650/paper2epub

Folders and files

Latest commit

History

Repository files navigation

paper2epub

Features

Installation

Basic Installation

From Source

Development Installation

Requirements

Quick Start

Command Line

Python API

CLI Options

How It Works

Why Nougat?

Model Sizes

Performance

Examples

Convert Multiple PDFs

Batch Processing in Python

Limitations

Troubleshooting

Dependency Conflicts

Out of Memory

Poor Quality Output

Contributing

License

Acknowledgments

Citation

Support

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages