Skip to content

[Bug]: PDF Parser miss text after OCR #4640

@fengjac

Description

@fengjac

Is there an existing issue for the same bug?

  • I have checked the existing issues.

RAGFlow workspace code commit ID

3c2c894

RAGFlow image version

3c2c894

Other environment information

Windows 11 Pro
Python 3.10.16
pytorch 12.4

Actual behavior

I use this file

layout1.pdf

to test pdf_parser.py

And then I found that it has missed the word "rr" after OCR

As you see, my pdf file has rr like:

Image

After running self._image_ function, the boxes are like:

Image

It has missed the word "rr" after OCR

Expected behavior

No response

Steps to reproduce

Debug pdf_parser.py with layout1.pdf file(I have put it to Actual behavior) in vscode

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐞 bugSomething isn't working, pull request that fix bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions