-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diseases at jensen lab #1107
base: master
Are you sure you want to change the base?
Diseases at jensen lab #1107
Conversation
update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links
change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`
Update property names
fix links to associationType values
Update property names and name of referencing csv + tmcf file pair
….tmcf Update property names and the naming of the csv + tmcf pair files
fix links to NonCodingRNATypeEnum
…ng.tmcf Update property names and the names for the tmcf and csv file pair
…xtMining.tmcf Update property names and the file names for the csv + tmcf pair
update tmcf filepaths
add link to run.sh file
update output csv file names
Add commands to combine codingGenes-textMining csv files into a single csv
fix malformed tmcf line
fix link bug
update the script so that it downloads, cleans and formats the data in CSV, and removes the original files
Add additional notes and caveats
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
||
As you can see there is a cascading representation of the associated ICD10 codes of 'ICD10:N', 'ICD10:N0', 'ICD10:N04' and a second tree of 'ICD10:N3', 'ICD10:N39', 'ICD10:399'. 'ICD10:N', 'ICD10:N0', 'ICD10:N3', and 'ICD10:root' all do not correspond to any ICD10 codes and thus these lines were removed along with any other line in which an ICD10 code had one or two digits or was root following 'ICD10:'. Then for this particular association, the lowest unique tree leaves were taken in as associations with the Gene 'HSP86'. In this case that is 'ICD10:N04' and 'ICD10:N399'. The remaining line with 'ICD10:N39' was discarded as being a less specific referal than 'ICD10:N399'. Finally, the ICD10 codes were reformatted as necessary so that they follow the proper convention. There is a '.' following the regex string of [A-Z][0-9][0-9]. So, codes like 'ICD10:N399' were converted into the appropriate format of 'ICD10:N39.9' through insertion of the missing '.'. | ||
|
||
The DISEASES datasource is composed of three datasets that were generated using three distinct methods: experiment (experimental data), knowledge (manual curation), and textmining. The specific dataset from which the data is from (i.e. 'experiments', 'knowledge', or 'textmining') is indicated by the providence associated with each property value in the corresponding Disease, DiseaseGeneAssociation, Gene, or ICD10Code node. Also note that for each DiseaseGeneAssociation the link is between a Gene and a Disease represented by either a DOID or ICD10Code. Regardless of which the link is stored in the "diseaseID" property, which points to the corresponding Disease (in cases of DOID) or ICD10Code (in cases of ICD10 code) nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "providence" supposed to be "provenance"?
|
||
### dcid Generation | ||
|
||
Dcids for DiseaseGeneAssociation nodes were generated as follow either: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "follow"?
### dcid Generation | ||
|
||
Dcids for DiseaseGeneAssociation nodes were generated as follow either: | ||
'bio/DOID_<trailing_DOID>_<geneSymbol> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix formatting, everything is on one line when viewing the markdown
Dcids for DiseaseGeneAssociation nodes were generated as follow either: | ||
'bio/DOID_<trailing_DOID>_<geneSymbol> | ||
'bio/ICD10_<trailing_ICD10Code>_<geneSymbol> | ||
where the <DOID> and <trailing_ICD10Code> represent the id following the ':', <geneSymbol> represents the Gene's gene symbol. For example: `bio/DOID_0050177_SEMA3F` and `bio/DOID_0050736_SEMA3F`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be <trailing_DOID>? to be consistent with line 54
Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file: | ||
|
||
```bash | ||
sh run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?
|
||
|
||
def check_for_dcid(row): | ||
check_for_illegal_charc(str(row['dcid'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can make this a for-loop like
for cell in [row['dcid'], row['DOID_dcid'], ...]:
check_for_illegal_charc(str(cell))
print('Error! dcid contains illegal characters!', s) | ||
|
||
|
||
def check_for_dcid(row): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we exit if there are illegal characters? or are illegal characters ok and you just want to see printed statements?
|
||
|
||
def clean_data(df, data_type): | ||
df_tm = df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuirous what the tm stands for
df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")] | ||
df = format_dcids(df, data_type) | ||
df_tm = format_dcids(df_tm, data_type) | ||
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does the function format_RNA_type do filtering? if so, please add a comment on the function because that's not clear from the name of the function
df = df.dropna(axis=1, how='all') # Drop columns with all NA values from df | ||
df_gene = df_gene.dropna(axis=1, how='all') # Drop columns with all NA values from df_gene | ||
df = pd.concat([df, df_gene], ignore_index=True) | ||
df = pd.concat([df, df_gene], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is the exact same as 196, did you want to do the same thing twice?
This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.