Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diseases at jensen lab #1107

Open
wants to merge 44 commits into
base: master
Choose a base branch
from

Conversation

spiekos
Copy link
Contributor

@spiekos spiekos commented Nov 2, 2024

This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.

Suhana Bedi and others added 30 commits August 18, 2023 17:37
update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links
change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`
Update property names
fix links to associationType values
Update property names and name of referencing csv + tmcf file pair
….tmcf

Update property names and the naming of the csv + tmcf pair files
fix links to NonCodingRNATypeEnum
…ng.tmcf

Update property names and the names for the tmcf and csv file pair
…xtMining.tmcf

Update property names and the file names for the csv + tmcf pair
update tmcf filepaths
add link to run.sh file
update output csv file names
Add commands to combine codingGenes-textMining csv files into a single csv
fix malformed tmcf line
update the script so that it downloads, cleans and formats the data in CSV, and removes the original files
@spiekos spiekos requested review from beets and chejennifer November 2, 2024 01:33
Copy link

google-cla bot commented Nov 2, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.


As you can see there is a cascading representation of the associated ICD10 codes of 'ICD10:N', 'ICD10:N0', 'ICD10:N04' and a second tree of 'ICD10:N3', 'ICD10:N39', 'ICD10:399'. 'ICD10:N', 'ICD10:N0', 'ICD10:N3', and 'ICD10:root' all do not correspond to any ICD10 codes and thus these lines were removed along with any other line in which an ICD10 code had one or two digits or was root following 'ICD10:'. Then for this particular association, the lowest unique tree leaves were taken in as associations with the Gene 'HSP86'. In this case that is 'ICD10:N04' and 'ICD10:N399'. The remaining line with 'ICD10:N39' was discarded as being a less specific referal than 'ICD10:N399'. Finally, the ICD10 codes were reformatted as necessary so that they follow the proper convention. There is a '.' following the regex string of [A-Z][0-9][0-9]. So, codes like 'ICD10:N399' were converted into the appropriate format of 'ICD10:N39.9' through insertion of the missing '.'.

The DISEASES datasource is composed of three datasets that were generated using three distinct methods: experiment (experimental data), knowledge (manual curation), and textmining. The specific dataset from which the data is from (i.e. 'experiments', 'knowledge', or 'textmining') is indicated by the providence associated with each property value in the corresponding Disease, DiseaseGeneAssociation, Gene, or ICD10Code node. Also note that for each DiseaseGeneAssociation the link is between a Gene and a Disease represented by either a DOID or ICD10Code. Regardless of which the link is stored in the "diseaseID" property, which points to the corresponding Disease (in cases of DOID) or ICD10Code (in cases of ICD10 code) nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "providence" supposed to be "provenance"?


### dcid Generation

Dcids for DiseaseGeneAssociation nodes were generated as follow either:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "follow"?

### dcid Generation

Dcids for DiseaseGeneAssociation nodes were generated as follow either:
'bio/DOID_<trailing_DOID>_<geneSymbol>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix formatting, everything is on one line when viewing the markdown

Dcids for DiseaseGeneAssociation nodes were generated as follow either:
'bio/DOID_<trailing_DOID>_<geneSymbol>
'bio/ICD10_<trailing_ICD10Code>_<geneSymbol>
where the <DOID> and <trailing_ICD10Code> represent the id following the ':', <geneSymbol> represents the Gene's gene symbol. For example: `bio/DOID_0050177_SEMA3F` and `bio/DOID_0050736_SEMA3F`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be <trailing_DOID>? to be consistent with line 54

Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:

```bash
sh run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?



def check_for_dcid(row):
check_for_illegal_charc(str(row['dcid']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can make this a for-loop like

for cell in [row['dcid'], row['DOID_dcid'], ...]:
    check_for_illegal_charc(str(cell))

print('Error! dcid contains illegal characters!', s)


def check_for_dcid(row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we exit if there are illegal characters? or are illegal characters ok and you just want to see printed statements?



def clean_data(df, data_type):
df_tm = df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuirous what the tm stands for

df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
df = format_dcids(df, data_type)
df_tm = format_dcids(df_tm, data_type)
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does the function format_RNA_type do filtering? if so, please add a comment on the function because that's not clear from the name of the function

df = df.dropna(axis=1, how='all') # Drop columns with all NA values from df
df_gene = df_gene.dropna(axis=1, how='all') # Drop columns with all NA values from df_gene
df = pd.concat([df, df_gene], ignore_index=True)
df = pd.concat([df, df_gene], ignore_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is the exact same as 196, did you want to do the same thing twice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants