You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been trying to use dedupe to clean a large dataset of company names gathered from job postings. Before loading the list of unique company names into a modified version of mysql_example.py, we standardize them by removing punctuation and spaces, converting to lowercase, and removing common substrings like "corporate."
For an example of what records we seek to block together, as well as the cleaning process:
Note that while "Tmobile" and "T-Mobile, US" should be labelled as the same company (as should "Target" and "target13"), "Amazon Herb Company" should be blocked separately from "Amazon Corporate LLC Pvt Ltd", "Amazon Corporation", and "Amazon Corp Seattle."
No matter what we do, we always run into "BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data." It is unclear whether this is caused by an error in the original mysql_example.py, our modifications to it, or the training stage, but we have tried this on numerous datasets of company names (of various sizes, levels of duplication, etc.) and gotten the same result. Here is the full stack trace:
We also have data on the industry NAICS code (at the 2 digit and 3 digit level) and location (at the state, county, and city level) for each posting observation if either of those things could be helpful.
Linked in this Google Drive folder is the modified mysql_example files we have been working with, as well as a sample of the data.
Any help or suggestions are GREATLY appreciated!
The text was updated successfully, but these errors were encountered:
Hi there,
We have been trying to use dedupe to clean a large dataset of company names gathered from job postings. Before loading the list of unique company names into a modified version of mysql_example.py, we standardize them by removing punctuation and spaces, converting to lowercase, and removing common substrings like "corporate."
For an example of what records we seek to block together, as well as the cleaning process:
Note that while "Tmobile" and "T-Mobile, US" should be labelled as the same company (as should "Target" and "target13"), "Amazon Herb Company" should be blocked separately from "Amazon Corporate LLC Pvt Ltd", "Amazon Corporation", and "Amazon Corp Seattle."
No matter what we do, we always run into "BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on? If so, try adding more training data." It is unclear whether this is caused by an error in the original mysql_example.py, our modifications to it, or the training stage, but we have tried this on numerous datasets of company names (of various sizes, levels of duplication, etc.) and gotten the same result. Here is the full stack trace:
We also have data on the industry NAICS code (at the 2 digit and 3 digit level) and location (at the state, county, and city level) for each posting observation if either of those things could be helpful.
Linked in this Google Drive folder is the modified mysql_example files we have been working with, as well as a sample of the data.
Any help or suggestions are GREATLY appreciated!
The text was updated successfully, but these errors were encountered: