paperTrans is an academic tool designed to convert scholarly papers from PDF format to Markdown, incorporating layout detection, optical character recognition (OCR), and translation capabilities.
- PDF to Markdown conversion
- Automated layout detection for text, titles, figures, and tables
- Image extraction and preservation
- Text translation (English and Traditional Chinese)
- Web-based user interface
paperTrans employs a multi-step process to transform PDF documents:
-
PDF Processing 📸: Convert PDF pages to images using
pdf2image
. -
Layout Detection 🔍:
- Utilize
layoutparser
with a pre-trained Detectron2 model. - Model: PubLayNet/faster_rcnn_R_50_FPN_3x
- Configuration:
lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config
- Element Classification: Text, Title, List, Table, Figure
- Utilize
-
Text Extraction 🔤:
- Apply OCR to detected text regions using
pytesseract
. - Merge overlapping text blocks for coherence.
- Sort elements based on their spatial position within the page.
- Apply OCR to detected text regions using
-
Figure and Table Handling 📊:
- Extract and save figures and tables as separate image files.
- Generate references to these images in the output Markdown.
-
Translation 🌐:
- Translate extracted text using
pygtrans
. - Current language support: English and Traditional Chinese.
- Translate extracted text using
-
Markdown Generation ✍️:
- Construct a structured Markdown document using
mdutils
. - Incorporate translated text, headers, and image references.
- Construct a structured Markdown document using
- PDF Processing:
pdf2image
- Layout Detection:
layoutparser
with Detectron2 - OCR:
pytesseract
- Translation:
pygtrans
- Markdown Generation:
mdutils
- Web Interface:
streamlit
To run the application:
streamlit run app.py
Upload a PDF file, select the target language, and download the resulting Markdown file.
The paperTrans
class in paperTrans.py
is the core of the project. Key methods include:
__init__(self, path: str, leng: str = 'zh-tw')
: Initializes the conversion process.toElements(self, image, first=False, path='output')
: Converts a page image to structured elements.detialVirtualize(self, element, image_)
: Performs OCR on a specific element.saveImage(self, image, path)
: Saves extracted figures and tables.elementsTrans(self, elements)
: Translates text elements.