Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability of a minimalist DOI citation graph #21

Open
dhimmel opened this issue Aug 15, 2017 · 4 comments
Open

Availability of a minimalist DOI citation graph #21

dhimmel opened this issue Aug 15, 2017 · 4 comments

Comments

@dhimmel
Copy link

dhimmel commented Aug 15, 2017

Greetings, a while ago I posed issue #1 about downloading the OpenCitations network. Great to see http://opencitations.net/download is now available! Congrats on the milestone.

At the moment, I'm looking for a minimalist encoding of the DOI citation network. The most basic format I can think of would be tabular like:

source cited
10.1371/journal.pcbi.1004259 10.1111/j.2041-210X.2010.00012.x
10.1371/journal.pcbi.1004259 10.1002/ana.22609
10.7287/peerj.preprints.3100v1 10.1007/s11192-016-2225-6

The first row indicates that 10.1371/journal.pcbi.1004259 cites 10.1111/j.2041-210X.2010.00012.x.

Do the OpenCitation downloads easily expose the DOI citation network? Is this table something you would consider adding to the OpenCitations release pipeline? I suspect many users just care about this information and can forgo lot's of complexity.

dhimmel added a commit to greenelab/opencitations that referenced this issue Aug 17, 2017
@dhimmel
Copy link
Author

dhimmel commented Aug 17, 2017

DOI Citation Catalog

I created a repository for processing the OpenCitations figshare datasets: greenelab/opencitations. From the 2017-07-25 release (specifically the corpus_id and corpus_br components), I created a TSV of DOI-to-DOI citations as proposed above. It's available from the file citations-doi.tsv.xz.

Here are the stats we generated for this dataset:

7,574,387 total DOI-to-DOI citations
203,264 DOIs with outgoing DOI citations
3,946,611 DOIs with incoming DOI citations

I was surprised that references are only available for ~200,000 articles. Why is this number so low? Does Crossref possess references for more articles (which they now return via their API) or is Crossref a downstream user of OpenCitations?

Also I didn't see the purpose for using Disk ARchive on the data exports. The figshare files are zipped, so what's the purpose of this extra archiving step, that creates dependency on the antiquated dar program?

@essepuntato
Copy link
Owner

Hi @dhimmel

Thanks for this. I think it is incredible useful indeed. I've already tweeted about it on the Twitter OpenCitations account:

https://twitter.com/opencitations/status/900609593998544896

In the next months, after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository, if you are fine with it, so as to release such information on monthly basis, as highlighted in this issue. What do you think?

Coming to your questions:

  1. 200.000 are the citing articles that have been processed and that are contained in the PubMed Central Open Access datasets. Crossref indeed contains larger number of reference lists available now (thanks to I4OC), but want to wait to have the new OpenCitations infrastructure before starting to gather information also from there. Currently we use Crossref API retrieving additional metadata information about all the citing/cited articles, in particular: title, subtitle, identifiers (e.g. DOI, ISSN, ISBN, URL, and Crossref member URL), author list, publisher, container resources (issue, volume, journal, book, etc.), publication year, pages. In addition, we also use their API for disambiguating bibliographic resources and agents by means of the identifiers retrieved.

  2. The use of DAR as mechanism for packaging items is very useful for backups, since it also allows us to implement a daily incremental backup. However, I see the issue in terms of accessibility. To this end, we plan to expose dumps also in n-quads format monthly (see Publish montly n-quad dumps #16) – to date, we have only experimented it by publishing on Figshare (https://doi.org/10.6084/m9.figshare.5147068) the n-quads zipped version of the full corpus of the April 2017 dump. When the new infrastructure will be up and running, maybe some changes are possible. Not sure if we will abandon DAR, though, since it works quite well for addressing the incremental backup issue. But this is something that we should discuss in the next months.

@dhimmel
Copy link
Author

dhimmel commented Aug 24, 2017

after the launch of the new infrastructure, I would like to include your script within the OpenCitations repository

That would be great! greenelab/opencitations is released under CC0, so it can be used anywhere. When initially copying the code over, I'd appreciate if you set the git commit author to:

--author="Daniel Himmelstein <[email protected]>"

@davidshotton
Copy link
Collaborator

See also Issue 7 about incorrect DOIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants