Training not providing enough matches #1077
I am having a similar issue with record linkage: the training session gives mostly distincts and only very few matches. Problem: In version 2.0.17, the labelling gives a lot of pairs (>30) that are obvious non-links, and only 1-2 pairs that could be true links. There are about 30k records in both data sets. The features I use are:
I manually inspected some of the records: there are links to be found. I had used dedupe before on similar data and did not expect this. Thus, I tried out different versions, and at least in version 2.0.11 the labelling works much better (ie, many more pairs that are likely to be true links) with the same data. Environment: ubuntu 22.04. The project uses the following conda settings: #Name Version Build Channel
there's been a number of changes that could impact the active labeling. if you could isolate this to a specific release that would be helpful. if you could provide some example data where the current code seems to be performing worse, that would also be very helpful |
@fgregg thanks for the response. Based on previous comments from @f-hafner, I switched from 2.0.17 back to version 2.0.11, with my own fix to the KeyError issue, the training seems to be well balanced between distincts and matches now. So the issue with not enough matches must have come from 2.0.12 or later. I had this issue when testing with 2.0.17 consistently for three different datasets. At the moment, I can't share the data unfortunately because of PII in there. Interested if anyone has had this issue with any public dataset? BTW, for those interested, here is my quick fix in core.py in version 2.0.11 for the KeyError when encountered with size of the dataset between ~66,000 and ~92,000: |
could you narrow it down to a specific version between 2.0.11 and 2.0.17 |
Let me do some testing and will let you know... |
I should be able to share a sample of my dataset where the issue occurs; I'll let you know. |
thank you very much! |
@tigerang22 , I think I can confirm this. With 2.0.14, I stopped at 100 negative, 1 positive. With 2.0.13, I stopped at 22 negative, 18 positive. |
Here is the repo with data and scripts: https://github.com/f-hafner/dedupe_training_example |
@fgregg any insight on the issue and when a future release would have the fix? Thanks |
i believe i have addressed this on main @f-hafner and @tigerang22. can you confirm that it works for your cases? @f-hafner thank you for the example code, that was very helpful |
@fgregg great! I will give it a try shortly. |
@fgregg I encountered a KeyError related to the datetime field type and it turns out that your commit yesterday doesn't have variables/date_time.py anymore. Are we expected to add that as a custom type now? Please advise. |
ugh! this is probably related to #1085 |
@fgregg I solved the datetime type issue by resetting my virtual environment. Just completed test of commit aa2b04e against my previous dataset. The same problem unfortunately still exists for me, 1 match and close to 100 distinct pairs before I stopped the testing. @f-hafner have you had luck with your scenarios? |
@tigerang22. that’s unfortunate! i |
I haven't tried it out yet, but I will let you know when I have |
Hi @fgregg , @tigerang22 I tried using the github version of dedupe (also on my sample data). It still gave almost only negatives. But I am not sure I got the right version. I installed dedupe as follows:
But then Details here: https://github.com/f-hafner/dedupe_training_example What is the correct way to install the github version? |
@f-hafner looks like you installed it okay. it's a bit simpler to do it like this pip install https://github.com/dedupeio/dedupe/archive/522e7b2147d61fa36d6dee6288df57aee95c4bcc.zip that's very strange that the performance didn't get better for you. using your test repo, it seemed to be working very well for me. hmmm.... |
@fgregg is there any chance that this issue is related to the Dedupe and RecordLink DisagreementLearners if you don't already have a training file? In these situations, it seems like a randomly chosen record is used to kickoff the learning process and identify pairs of records for you to label. Is it possible that this randomly chosen record just isn't very helpful for learning initial blocking rules and setting up the active learning session? Also, since some initial blocking is occurring, I wonder if with |
i think i have a fix for this in 2.0.23 |
@fgregg Great! I will give 2.0.23 a shot. |
@fgregg I have just tested 2.0.23 and unfortunately the same issue exists. Are there any fine tuning options that might be affecting this, such as calling deduper.prepare_training(temp_d) with dynamic sample_size and blocked_proportion instead of using the default values? I did notice that the previous version such as 2.0.13 would take 4-5 mins but now with 2.0.23 it is taking more than 10 mins to finish the prepare_training call. |
@tigerang22 can you check to see if the example that @f-hafner poster also doesn't work for you. (it does for me now). |
Hello, I actually struggle with the same problem, version 2.0.23, I've tried to go a little further and stopped at 10 positives and 2000 negatives. My script is based on the pgsql_big_dedupe_example (hope it's up to date :), adapted to use Django's 3.2 ORM, as I plan to build an identity manager with Dedupe. My variables are very similar to @f-hafner, I use distinct birth, last, first and middle names (all 'Strings'), a few others (birth date, place, country, ...), and interactions to boost the scores : dedupe_fields = [ It looks like only one field is eventually used as a predicate (in my case, the logger shows it's the birth date, defined either as DateTime or String), and of course it's not enough to efficiently dedupe my 315k entries. Some entities end up with members with only the birth date as common data. Back to 2.0.13, with the same variables definition, I stopped at 47/10 positive, 1000/10 negative, and the following predicates : With this, I end up with ~3000 entities (out of ~28000 I'm supposed to find). @fgregg since it looks like it works for you, could it come from something I obviously missed in the variables definition ? Is the training engine more efficient with split birth/last/first/... names into multiple variable or to keep these in a single string ? (the question may be valid for other variables too). Or, since I have quite a lot of entries, does the training simply need a lot more samples, both with 2.0.13 and 2.0.23 ... ? Thanks for your answers. |
I have been using dedupe 2.0.6. Recently I ran into the KeyError issue with a dataset of 78,598 records. After I upgraded to version 2.017, the KeyError issue has been resolved. However as I am doing regression testing using 2.0.17 against the previous datasets, I have noticed a dramatic memory increase from 300 mb to 8-10 gb and twice as much time as version 2.06 during the deduper.prepare_training() call on my windows machine for a dataset with 121,420 records (I have a Linux app service in Azure that I had to double the size for. I haven't measured the actual memory consumption yet so can't give the metrics at the moment). The more significant problem is that although there is better sampling according to this, my training session consistently ends up with 200-300 distincts and only 3-5 matches.
@fgregg, is this problem related to sampling solely or have other things been changed since 2.06 that is causing what I am experiencing, i.e. memory, perf and not enough matches during training? I have noticed that the old sampling code, that caused the KeyError, had been moved out of core.py to convenience.py and new sampling code is now being used.
Thanks in advance. Love the great work of this project!
The text was updated successfully, but these errors were encountered: