Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improving labeling and the execution script #36

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

heaven00
Copy link
Member

@heaven00 heaven00 commented Dec 21, 2017

Labeling

  • Adding Random Forest Model to give binary marking for Groupings and non-groupings
  • Added string matching to make header and title labeling more robust.

Execution Script

  • Added resume capabilities
  • Extracted out Default Numeric Headers as script parameters

@heaven00 heaven00 changed the title ENH: Improving labelling and the execution script ENH: Improving labeling and the execution script Dec 21, 2017
Copy link
Member

@gggodhwani gggodhwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comments and make required changes

@@ -3,6 +3,7 @@
'''
import re
import pandas as pd
import joblib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it to the requirements.txt

if ('Actuals' in row['text'] or
'Budget' in row['text'] or
'Revised' in row['text'] or
'Estimate' in row['text']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

# check capitalization of letters
if row.is_text and row.text.isupper() and pd.isnull(row.label):
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

if row.is_text and row.text.isupper() and pd.isnull(row.label):
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
in row.text or 'LOAN EXPENDITURE' in row.text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE'
in row.text or 'LOAN EXPENDITURE' in row.text):
row['label'] = 'title'
if 'demand no' in row.text.lower():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a regex wherever you can, and add it on the top as a constant!

COLUMN_COUNT = 6

def __init__(self, img, block_features, page_num, target_folder):
def __init__(self, img, block_features, page_num, target_folder,
default_headers):
self.img = img
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain all the init arguments

'''
Check which pdfs are already generated and remove them from the complete
list of pdfs.
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args and return type documentation is missing?



def process_folder(input_folder_path, output_folder_path, resume,
default_headers):
'''Process a folder of demand draft pdfs and store the output in the output
folder.
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args and return type documentation is missing?

vertical_ratio,
page_num,
pdf_file_path,
(25, 20),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare 25,20 as class variables and explain its rationale

check_and_create_folder(features_log_folder)
block_features.to_csv('{0}/{1}.csv'.format(features_log_folder,
page_num), index=False)
# Blank page check
if len(block_features.index) > 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare constant 3 as Class variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants