Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify, implement and document the workflow for manual deduplication #208

Open
LukasWallrich opened this issue Sep 27, 2024 · 1 comment

Comments

@LukasWallrich
Copy link
Collaborator

Currently, CiteSource allows to identify potential duplicates, but not to deal with them - need to figure out how to reexport the relevant ASySD functions.

@TNRiley can you check if ASySD::dedup_citations_add_manual works with our dedup_citations() output?

@TNRiley
Copy link
Collaborator

TNRiley commented Oct 9, 2024

walking this through step by step. Including numbers based on the data in the new_benchmark_testing vignette so it's easy to walk through and for me to remember

If we run CiteSource::dedup_citations(raw_citations, manual = TRUE) we return a list

  • unique_citations_manual$unique (499 rows - which represents the 499 distinct records across all sources)
  • unique_citations_manual$manual_dedup (2 rows - which represent 2 different sets of potential duplicates)

Then if we run CiteSource::dedup_citations_add_manual (unique_citations_manual$unique, unique_citations_manual$manual_dedup)
(Currently, there is an issue with the extra_merge_fields argument, but if we remove the extra_merge_fields = "cite_string" argument from the CiteSource::dedup_citations_add_manual function) we return

  • a single dataframe with 497 rows (this is due to the fact that ASySD and CiteSource equate the additional_pairs as TRUE duplicates)

Therefore, a user to deal with potential duplicates within CiteSource (in R at least - the shiny is okay), we would need to build in a function or develop a table to show the user the potential duplicate data, record their decision, and assign any TRUE duplicates that were identified as the data that is then passed through dedup_citations_add_manual's additional_pairs.

TNRiley added a commit that referenced this issue Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants