-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Improving labeling and the execution script #36
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the comments and make required changes
@@ -3,6 +3,7 @@ | |||
''' | |||
import re | |||
import pandas as pd | |||
import joblib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it to the requirements.txt
if ('Actuals' in row['text'] or | ||
'Budget' in row['text'] or | ||
'Revised' in row['text'] or | ||
'Estimate' in row['text']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a regex wherever you can, and add it on the top as a constant!
# check capitalization of letters | ||
if row.is_text and row.text.isupper() and pd.isnull(row.label): | ||
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in | ||
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a regex wherever you can, and add it on the top as a constant!
if row.is_text and row.text.isupper() and pd.isnull(row.label): | ||
if ('REVENUE EXPENDITURE' in row.text or 'DETAILED ACCOUNT' in | ||
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE' | ||
in row.text or 'LOAN EXPENDITURE' in row.text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a regex wherever you can, and add it on the top as a constant!
row.text or 'ABSTRACT ACCOUNT' in row.text or 'CAPITAL EXPENDITURE' | ||
in row.text or 'LOAN EXPENDITURE' in row.text): | ||
row['label'] = 'title' | ||
if 'demand no' in row.text.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a regex wherever you can, and add it on the top as a constant!
COLUMN_COUNT = 6 | ||
|
||
def __init__(self, img, block_features, page_num, target_folder): | ||
def __init__(self, img, block_features, page_num, target_folder, | ||
default_headers): | ||
self.img = img |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain all the init arguments
''' | ||
Check which pdfs are already generated and remove them from the complete | ||
list of pdfs. | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args and return type documentation is missing?
|
||
|
||
def process_folder(input_folder_path, output_folder_path, resume, | ||
default_headers): | ||
'''Process a folder of demand draft pdfs and store the output in the output | ||
folder. | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args and return type documentation is missing?
vertical_ratio, | ||
page_num, | ||
pdf_file_path, | ||
(25, 20), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declare 25,20 as class variables and explain its rationale
check_and_create_folder(features_log_folder) | ||
block_features.to_csv('{0}/{1}.csv'.format(features_log_folder, | ||
page_num), index=False) | ||
# Blank page check | ||
if len(block_features.index) > 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declare constant 3 as Class variable?
Labeling
Execution Script