Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in computing Local Citations with Scopus Input #467

Open
jwhwaal opened this issue May 30, 2024 · 3 comments
Open

Error in computing Local Citations with Scopus Input #467

jwhwaal opened this issue May 30, 2024 · 3 comments

Comments

@jwhwaal
Copy link

jwhwaal commented May 30, 2024

I have noted there is an error when computing Local Citations with a Scopus input file. The problem does not present itself when using a Web of Science input file. In the histNetwork function there is a filter to remove false positives, based on the PP field. However, not all papers do have PP (page numbers). E.g. Journal of Cleaner Production only uses an article identifier, e.g. Van der Waal, Johannes WH, and Thomas Thijssens. "Corporate involvement in sustainable development goals: Exploring the territory." Journal of Cleaner Production 252 (2020): 119625.
It is here: CR <- CR %>%
dplyr::filter(!is.na(PY), (substr(CR$PP,1,1) %in% 0:9))
or here: CR <- CR %>%
left_join(M_merge, join_by("PY", "AU"), relationship = "many-to-many") %>%
dplyr::filter(!is.na(Included)) %>%
group_by(PY,AU) %>%
mutate(toRemove = ifelse(!is.na(PP.y) & PP.x!=PP.y, TRUE,FALSE)) %>% # to remove FALSE POSITIVE

@massimoaria
Copy link
Owner

Unfortunately, Scopus has changed the way it stores references and, to date, there is no way to identify them uniquely (the string does not include the DOI!).
The use of the PP field is just one 'good enough' strategy we have chosen to adopt, but unfortunately, it does lead to errors in some cases. Errors that would also be there if we decided to adopt a different strategy.

We are of course open to accepting proposals on alternative strategies for identifying local citations in Scopus.

@jwhwaal
Copy link
Author

jwhwaal commented Jun 6, 2024

Indeed, the reference does not include the DOI and so lacks a unique key. I have coded a solution that works quite well, but is not fast: extracting the title fields from the references (assuming it is the longest string, most often the case), and then computing the (cosine or levenshtein) similarity with the TI fields of the local corpus. I got very good results on my corpus. To make it faster and avoid problems with truncated titles (which I had in one instance), the similarity matching could be tried on truncated titles (e.g. 100 characters).

@massimoaria
Copy link
Owner

massimoaria commented Jun 20, 2024

Thanks, I appreciate your suggestion.
We have already tried this solution but it is too computationally expensive to be used daily.
We also tried to identify publication years to divide references into subgroups (by year) and then compute similarity only among titles associated with the same publication year.
The solution is faster than the previous one but unsatisfactory again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants