paperTrans 📄➡️📝

paperTrans is an academic tool designed to convert scholarly papers from PDF format to Markdown, incorporating layout detection, optical character recognition (OCR), and translation capabilities.

Features 🚀

PDF to Markdown conversion
Automated layout detection for text, titles, figures, and tables
Image extraction and preservation
Text translation (English and Traditional Chinese)
Web-based user interface

Methodology 🔬

paperTrans employs a multi-step process to transform PDF documents:

PDF Processing 📸: Convert PDF pages to images using pdf2image.
Layout Detection 🔍:
- Utilize layoutparser with a pre-trained Detectron2 model.
- Model: PubLayNet/faster_rcnn_R_50_FPN_3x
- Configuration: lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config
- Element Classification: Text, Title, List, Table, Figure
Text Extraction 🔤:
- Apply OCR to detected text regions using pytesseract.
- Merge overlapping text blocks for coherence.
- Sort elements based on their spatial position within the page.
Figure and Table Handling 📊:
- Extract and save figures and tables as separate image files.
- Generate references to these images in the output Markdown.
Translation 🌐:
- Translate extracted text using pygtrans.
- Current language support: English and Traditional Chinese.
Markdown Generation ✍️:
- Construct a structured Markdown document using mdutils.
- Incorporate translated text, headers, and image references.

Technical Stack 🛠️

PDF Processing: pdf2image
Layout Detection: layoutparser with Detectron2
OCR: pytesseract
Translation: pygtrans
Markdown Generation: mdutils
Web Interface: streamlit

Usage 🖥️

To run the application:

streamlit run app.py

Upload a PDF file, select the target language, and download the resulting Markdown file.

Core Functionality 🧠

paperTrans Class

The paperTrans class in paperTrans.py is the core of the project. Key methods include:

__init__(self, path: str, leng: str = 'zh-tw'): Initializes the conversion process.
toElements(self, image, first=False, path='output'): Converts a page image to structured elements.
detialVirtualize(self, element, image_): Performs OCR on a specific element.
saveImage(self, image, path): Saves extracted figures and tables.
elementsTrans(self, elements): Translates text elements.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
app.py		app.py
paperTrans.py		paperTrans.py
paperTrans.svg		paperTrans.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

paperTrans 📄➡️📝

Features 🚀

Methodology 🔬

Technical Stack 🛠️

Usage 🖥️

Core Functionality 🧠

paperTrans Class

About

Releases

Packages

Languages

YuamLu/paperTrans

Folders and files

Latest commit

History

Repository files navigation

paperTrans 📄➡️📝

Features 🚀

Methodology 🔬

Technical Stack 🛠️

Usage 🖥️

Core Functionality 🧠

paperTrans Class

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages