-
New fields corresponding to NiH project outputs, which are mainly IDs that could be linked to other open datasets:
terms_id_clinicalTrial
: Clinical trial IDs for this project that can be linked directly tohttps://www.clinicaltrials.gov/ct2/show/{clinical_trial_id}
terms_id_pubmedId
: PubMed IDs for this project that can be direclty linked tohttps://pubmed.ncbi.nlm.nih.gov/{pubmed_id}
terms_title_patent
andterms_id_patent
: Patent titles and IDs. Would require linking to patstat for more info.
-
Deduplication according to strategy implemented in this PR, which helps explain the following new fields for the format
terms_{X}_project
withX
=exactDupeId
,exactDupeTitle
: refering to IDs and titles of projects which we have removed from the dataset for being exact duplicates. Only distinct titles are provided, and any funding allocations, start and end dates and funding are incorporated into the deduplicated recorded in the logical way (earliest start date, latest end date, sum of funding)nearDupeId
: referring to IDs of projects with similarity >= 0.8verySimilarId
: referring to IDs of projects with 0.8 > similarity >= 0.65similarId
: referring to IDs of projects with 0.65 > similarity >= 0.4
For example:
- this project has several exact dupes and one near dupe, which itself has one exact dupe
- this project has one very similar but non-identical project
- this project and this project are fairly similar
-
An additional form of deduplication was discovered as discussed in this PR, which says that what NiH tell you is a primary key isn't really a primary key. Instead we impute a primary key based on the core ID as per the PR and consolidate projects accordingly.
-
mesh terms are no longer supported as discussed here, here and elsewhere.
-
The curation of
json_funding_project
is dealt with more gracefully and so the sub-schema (i.e. in the mapping) has been updated from e.g.:{ "year": 1988, "cost_ref": 3654295, "start_date": 2011-07-31, "end_date": 1988-04-01 }
to
{ "year": 1988, "total_cost": 3654295, "project_end": "2011-07-31", "project_start": "1988-04-01" }
-
Cleaning steps implemented in this PR:
- Split
;
terms
into an array - CAPS --> Camel Case
- Address the bad dq issue highlighted in #51
- Check Greek characters (etc) parse ok
- Check question marks (bad unicode parsing) fixed
"NULL"
,""
,"N/A"
,[]
-->null
- Split
This repository has been archived by the owner on Aug 13, 2021. It is now read-only.