Skip to content

An academic tool designed to convert scholarly papers from PDF format to Markdown, incorporating layout detection, optical character recognition (OCR), and translation capabilities.

Notifications You must be signed in to change notification settings

YuamLu/paperTrans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

paperTrans Logo

paperTrans 📄➡️📝

paperTrans is an academic tool designed to convert scholarly papers from PDF format to Markdown, incorporating layout detection, optical character recognition (OCR), and translation capabilities.

Features 🚀

  • PDF to Markdown conversion
  • Automated layout detection for text, titles, figures, and tables
  • Image extraction and preservation
  • Text translation (English and Traditional Chinese)
  • Web-based user interface

Methodology 🔬

paperTrans employs a multi-step process to transform PDF documents:

  1. PDF Processing 📸: Convert PDF pages to images using pdf2image.

  2. Layout Detection 🔍:

    • Utilize layoutparser with a pre-trained Detectron2 model.
    • Model: PubLayNet/faster_rcnn_R_50_FPN_3x
    • Configuration: lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config
    • Element Classification: Text, Title, List, Table, Figure
  3. Text Extraction 🔤:

    • Apply OCR to detected text regions using pytesseract.
    • Merge overlapping text blocks for coherence.
    • Sort elements based on their spatial position within the page.
  4. Figure and Table Handling 📊:

    • Extract and save figures and tables as separate image files.
    • Generate references to these images in the output Markdown.
  5. Translation 🌐:

    • Translate extracted text using pygtrans.
    • Current language support: English and Traditional Chinese.
  6. Markdown Generation ✍️:

    • Construct a structured Markdown document using mdutils.
    • Incorporate translated text, headers, and image references.

Technical Stack 🛠️

  • PDF Processing: pdf2image
  • Layout Detection: layoutparser with Detectron2
  • OCR: pytesseract
  • Translation: pygtrans
  • Markdown Generation: mdutils
  • Web Interface: streamlit

Usage 🖥️

To run the application:

streamlit run app.py

Upload a PDF file, select the target language, and download the resulting Markdown file.

Core Functionality 🧠

paperTrans Class

The paperTrans class in paperTrans.py is the core of the project. Key methods include:

  • __init__(self, path: str, leng: str = 'zh-tw'): Initializes the conversion process.
  • toElements(self, image, first=False, path='output'): Converts a page image to structured elements.
  • detialVirtualize(self, element, image_): Performs OCR on a specific element.
  • saveImage(self, image, path): Saves extracted figures and tables.
  • elementsTrans(self, elements): Translates text elements.

About

An academic tool designed to convert scholarly papers from PDF format to Markdown, incorporating layout detection, optical character recognition (OCR), and translation capabilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages