Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method #1

Closed
CooperSmout opened this issue Nov 21, 2018 · 11 comments
Closed

Method #1

CooperSmout opened this issue Nov 21, 2018 · 11 comments

Comments

@CooperSmout
Copy link
Member

CooperSmout commented Nov 21, 2018

See README.md for background on the problem and goals of this project. This issue is for discussion / planning of the best way to measure 'support' in the academic community for a particular pledge/campaign. 'Support' will be quantified as the proportion of citations that reference articles (or other research outputs) produced by pledgers in the last X years (controlling for time since publication).

Tentative method:

  1. Aquire metadata for all articles (or book chapters, opinion pieces etc) published in previous X years (using Crossref API)
  2. Sort into research fields using journal ontology (e.g. Science Metrix), or another method if unavailable (e.g. open reference list if available on Crossref)
  3. Identify articles produced by members who have signed the pledge (using ORCID - members would need to register for ORCID and maintain ORCID profile). Only authorships in the nominated position/s will be counted (defaulting to first-author papers only, but pledgers can opt-in for other authorship positions as well).
  4. Compute proportion of citations generated by pledgers, separately for each research field
  5. Display this data on the FOK website (www.freeourknowledge.org)

Limitation: Only <90% of papers have a DOI, and this varies wildly by field (particularly low for humanities). Alternatively, we would get better coverage scraping Google Scholar (e.g. https://github.com/alberto-martin/googlescholar), but could lead to more errors and be harder to disambiguate authors (unless we require members to update their GS profile?).

Another limitation: me - totally new to bibliometrics - so looking for feedback on the method and ideally help to develop the platform! (:

@CooperSmout
Copy link
Member Author

CooperSmout commented Nov 21, 2018

thoughts @dhimmel @greenelab? have been drooling over the code in your sci hub study since I found it (https://github.com/greenelab/scihub) :)

@dhimmel
Copy link

dhimmel commented Nov 21, 2018

'Support' will be quantified as the proportion of citations generated by articles (or other research outputs) that were written by pledgers in the last X years

I am unclear whether you are counting outgoing or incoming citations from/to pledgers' works. It sounds like you're counting outgoing citations... i.e. how many citations pledgers made in the past X years. However, isn't it more meaningful to count how many citations from the past X years are incoming to works by pledgers? I.e. if pledgers were to have published fully OA, what percent of all citations would have been redirected to OA works?

The tentative method looks good to me. You will need three things it seems:

  1. a catalog of all articles
  2. a catalog of all citations
  3. a mapping from pledging authors to their past papers

The best current resource for 1 is Crossref IMO. The best resource for 2 is the I4OC citations now available from Crossref (extracted versions available from several locations now). I agree the best resource for 3 is ORCID. While overall ORCID coverage of authors is low, pledgers are more likely to have filled out their ORCID profiles completely. I4OC citations are probably only 50% of all citations, since some publishers like Elsevier & ACS do not share. However, depending on the exact formulation of your metric, it may be OKAY not to have every citation.

@CooperSmout
Copy link
Member Author

I am unclear whether you are counting outgoing or incoming citations from/to pledgers' works

Sorry for the lack of clarity! Yes the idea is to count how many times articles authored by pledgers were cited (rather than the number of times pledging authors cited other articles), as a proxy for the 'impact' of the community of pledgers. I'll change the description above to make this clearer.

The best resource for 2 is the I4OC citations now available from Crossref (extracted versions available from several locations now).

I was thinking that we could just use the Crossref API, because it seems to include a citation count for articles (?) - e.g. if you punch in https://api.crossref.org/works/10.1016/j.neuron.2012.10.038 it spits out a field called "is-referenced-by-count" (equal to 523 for this doi). But I'm unsure where these numbers come from (perhaps the I4OC database?), and how accurate they are, because they're different to the citation counts listed on publishers' websites (which presumably come from Scopus or WoS).

depending on the exact formulation of your metric, it may be OKAY not to have every citation

Agreed, the main thing is to ensure that the metric is a fair representation of the broader community, so if the citation counts are relatively accurate they should be good enough. I'm more concerned about low uptake of DOIs in some fields (e.g. 60% in humanities; https://doi.org/10.1016/j.joi.2015.11.008), and how this might skew the results if we use the crossref/DOI method

@Vinnl
Copy link

Vinnl commented Jan 10, 2019

I just came across this issue, and wanted to give you a heads-up on some potential problems (for which I unfortunately do not have solutions):

  1. Not everything that has a DOI has a CrossRef DOI. This might not be too bad since the vast majority of them do, but there are a few other registration agencies like DataCite: https://www.doi.org/demos.html.
  2. Only a small subset of CrossRef metadata also includes the authors' ORCID iDs.
  3. When they do include ORCID iDs, these are not necessarily verified by the publisher. Often, the corresponding author simply enters the iD's of the other authors themselves.
  4. It sometimes even happens that authors simply register ORCID iD's for their co-authors themselves, leading to duplication. I'm not sure how often this happens though.
  5. Researchers usually do not update their ORCID record to include their publications, and publishers also do not always feed that info back into ORCID, so the other way around is probably not an option either.

Point 2 is most problematic, I'd guess.

@sckott
Copy link

sckott commented Jan 11, 2019

Not everything that has a DOI has a CrossRef DOI.

true, but you can use content negotation https://crosscite.org/docs.html to throw any DOI to resolve URLs

I was thinking that we could just use the Crossref API, ... But I'm unsure where these numbers come from (perhaps the I4OC database?), and how accurate they are, because they're different to the citation counts listed on publishers' websites (which presumably come from Scopus or WoS).

They are from AFAIK internal Crossref data, not from I4OC. yes, i'd expect them to be different from WoS and Scopus. There's also the openurl Crossref service, see rcrossref::cr_citation_count for the internals of how thats done. It does seem to match up for at least one DOI I checked http://api.crossref.org/works/10.7888/juoeh.6.265 vs. rcrossref::cr_citation_count('10.7888/juoeh.6.265')

Scraping google scholar is tough, but i've seen blog posts describing how it can be done, i think it's pretty painful though since they attempt to block any automated usage.

Seems like matching authors to authors of papers might be difficult when there is no ORCID, yes?

@CooperSmout
Copy link
Member Author

CooperSmout commented Jan 14, 2019

2. Only a small subset of CrossRef metadata also includes the authors' ORCID iDs.

Seems like matching authors to authors of papers might be difficult when there is no ORCID, yes?

Yes, this is the primary limitation of the above method - authors would need to keep their ORCID profile up to date (but note that to calculate the % of support we only need pledging authors, and not all authors, to be up to date).

But it seems to me that whatever alternative method we might adopt (including scraping Google Scholar), there would still need to be some kind of verification process (e.g. automated emails like what ResearchGate uses) to check that the publications we attribute to pledging authors are actually theirs, or else we might trigger thresholds based on false data. So rather than building new lists of publications just for this project, I figured it would be simpler and less error-prone to just integrate with ORCID and request anyone who pledges to keep their profile up to date. On the plus side, pledging authors are more likely to already have an ORCID, and we can also make the case that it's good for your career to keep your ORCID profile up to date.

@sckott @Vinnl seems to me that pledging authors keeping their profiles current would (mostly?) resolve the noted problems, would you agree? Is there a better approach do you think?

@CooperSmout
Copy link
Member Author

Not everything that has a DOI has a CrossRef DOI.

Indeed, another limitation. I found this thread by @dhimmel useful here: greenelab/crossref#3 - which references a 1996 study showing that ~99% of Wikipedia citations reference articles with a Crossref DOI. If these stats are comparable for the scholarly literature (not sure if anyone has done a study on this??) we should be fine, because we're only really interested in articles that actually get cited.

true, but you can use content negotation https://crosscite.org/docs.html to throw any DOI to resolve URLs

Wow, this is cool! As I understand it, content negotiation can reveal metadata from Crossref, DataCite, and mEDRA (currently). Do you know if those other services also track citations? If so, we could just use this approach to access citation counts from all DOI services, rather than restricting ourselves to Crossref...

@Vinnl
Copy link

Vinnl commented Jan 14, 2019

rather than building new lists of publications just for this project

That definitely does not sound like the way to go, no.

I figured it would be simpler and less error-prone to just integrate with ORCID and request anyone who pledges to keep their profile up to date.

I agree that that's probably the most viable approach, though I'm not sure how realistic it is - but probably the most realistic out of all options. Do keep in mind, though, that people who would want to artificially inflate support numbers could simply add the most-cited DOIs to their ORCID profile. I'm not sure whether that really is something that would happen, but at least good to not let it take you by surprise if it does.

content negotiation can reveal metadata from Crossref, DataCite, and mEDRA (currently). Do you know if those other services also track citations?

Unfortunately, as far as I know they don't. But I guess you're right - just limiting yourself to CrossRef DOI's and using the CrossRef API is probably fine, if you consider it to be a sample of support. (Depending, of course, on citation data in the CrossRef API being relatively complete.)

@CooperSmout
Copy link
Member Author

people who would want to artificially inflate support numbers could simply add the most-cited DOIs to their ORCID profile

Good point. I just checked, and ORCID allowed me to add a highly cited DOI to my record, despite none of the authors having my name. So you're right, this could allow people to game the system, but we should be able to solve this with a simple check to ensure that the author's name is contained in the author list

@CooperSmout
Copy link
Member Author

Just an update to the proposed method, am now planning to use Dimensions (https://www.dimensions.ai) rather than Science Metrix to dissociate research fields (i.e. step 2), because it classifies articles at the article-level (using machine learning... cool :)) rather than at the journal level (which would complicate things for multidisciplinary journals and/or authors)

@CooperSmout
Copy link
Member Author

For the record I'm closing this issue and porting any future discussions to the discussion repository, so that this repo can be used exclusively for platform code-related issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants