Skip to content

Extraction of data from table is not accurate #2081

@harrier1290

Description

@harrier1290

Bug

I am trying to extract data from the table from research article and the results are not accurate. I have similar observations with other tables also. Often some values are missing and often some special characters like '=' and '~' are added. Sometimes, number '0' is misinterpreted as letter 'o'.

I tried with tesseractocr also, to rule out the role of EasyOCR engine, but things got worse with tesseract ocr.

Is it known bug in Docling, while extracting the values from tables?
...

Steps to reproduce

Table from paper was used for extracting data https://doi.org/10.1016/j.jpcs.2024.112412
...

Docling version

2.44.0
...

Python version

3.12.3...

Please find the pdf showing the lose of values while extraction of data

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions