Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to proceed? #5

Open
philipus opened this issue Nov 19, 2020 · 1 comment
Open

how to proceed? #5

philipus opened this issue Nov 19, 2020 · 1 comment

Comments

@philipus
Copy link

philipus commented Nov 19, 2020

nice work

First a question:
could you provide your model -> model66.zip. I do not have a decent GPU :-( otherwise how much time it would take on google cloud with a GPU (T4, K80, P100, V100, P4)? my current val_loss at epoch 188 is 0.27689?!

some missing links:
While going through the steps in the python training script I found that one need jpg instead of bmp for tensorflow right?
For one xml "10.1.1.1.xml" no BMP (JPG) exist. That is why the scrip crashs at this point.

At last another question:
I figured that the model get table mask and column mask. In the comment of the original paper I found this:

Outputs: After the documents are processed using the model, the masks of tables and columns are generated. These masks are used to filter out the table and its column regions from the image. Now using the Tesseract OCR, the information is extracted from the segmented regions. Below is an image showing the masks that are generated and later extracted from the tables

Any hint for me where I can find a good intro to extract the final table content from BMP/JPG picture and table / column mask using tesseract OCR?

@francescoperessini
Copy link

Hi @philipus,
I'm currently working on this same implementation and I can answer to some of your questions.

could you provide your model -> model66.zip. I do not have a decent GPU :-( otherwise how much time it would take on google cloud with a GPU (T4, K80, P100, V100, P4)? my current val_loss at epoch 188 is 0.27689?!

Right now I'm using Google Colab to train the model: it's free and it guarantees good performances (about 15 seconds per epoch). You can upload the dataset on Drive and access it directly through the notebook on Colab. However, having the model could be useful also for me.

While going through the steps in the python training script I found that one need jpg instead of bmp for tensorflow right?

Yes, the script "wants" jpgs as input. I've converted all the bmps into jpgs for the moment because I was having troubles using the decode_bmp function from TensorFlow. I will come back later on this.

For one xml "10.1.1.1.xml" no BMP (JPG) exist. That is why the scrip crashs at this point.

I also found this "issue", just remove the xml file related to that image.

At last another question:
I figured that the model get table mask and column mask. In the comment of the original paper I found this:
Outputs: After the documents are processed using the model, the masks of tables and columns are generated. These masks are used to filter out the table and its column regions from the image. Now using the Tesseract OCR, the information is extracted from the segmented regions. Below is an image showing the masks that are generated and later extracted from the tables
Any hint for me where I can find a good intro to extract the final table content from BMP/JPG picture and table / column mask using tesseract OCR?

I'm also interested in this point. I didn't explore very well any possibile solution, at the moment the only thing I can suggest you is to use the output masks to cut out from the original images the non-tables pixels (you can find libraries online to do this easily) and then use the final images to extract the text with Tesseract. Again, any hint would help.

Hope I was clear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants