UPDATE README & zey mays test files

yonesora56 · Nov 16, 2024 · 63a0d2e · 63a0d2e
1 parent f682b43
commit 63a0d2e
Show file tree

Hide file tree

Showing 33 changed files with 33,242 additions and 105 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,18 +1,23 @@
 .vscode
 .cache
 .DS_Store
-/Archive
 /cwl_cache
 /index
 /Data
-/metascape_result
-/notebooks/cache
-/out/rice_up
-/out/rice_down
-/r-env
 /scripts/archive
-/test/oryza_sativa_test/uniprotkb_human_all_241107.fasta
-/test/oryza_sativa_test/uniprotkb_rice_all_240820.fasta
+/test/oryza_sativa_test/uniprotkb_39947_all.fasta
+/test/oryza_sativa_test/uniprotkb_9606_all.fasta
+/test/zea_mays_test/uniprotkb_9606_all.fasta
+/test/zea_mays_test/uniprotkb_4577_all.fasta
 /test/oryza_sativa_test/index_hit_species/*
 /test/oryza_sativa_test/index_query_species/*
+/test/zea_mays_test/index_hit_species/*
+/test/zea_mays_test/index_query_species/*
+/test/zea_mays_test/result_needle/*
+/test/zea_mays_test/result_water/*
+/test/zey_mays_test/zea_mays_random_gene_afinfo/*
+/test/zey_mays_test/zea_mays_random_gene_mmcif/*
+/test/zey_mays_test/split_fasta_hit_species/*
+/test/zey_mays_test/split_fasta_query_species/*
+/test/zea_mays_test/foldseek_output_swissprot_zm_random_evalue01.tsv
 cmd_history.txt
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ Please check the official website for details.
 
 Here, we will explain how to use the list of rice genes as an example.
 
-### 1. Creation of a TSV file of gene and UniProt ID correspondences
+### **1. Creation of a TSV file of gene and UniProt ID correspondences**
 
 First, you will need the following gene list tsv file. (Please set the column name as "From")
 
@@ -58,4 +58,76 @@ The actual execution results are output together with the [jupyter notebook](./t
 
 &nbsp;
 
-### 2. 
+### **2. Creating and preparing indexes**
+
+I'm sorry, but the [main workflow](./Workflow/plant2human_v1.0.1.cwl) does not currently include the creation of an index (both for foldseek index and BLAST index).
+Please perform the following processes in advance.
+
+### 2-1. Creating a foldseek index
+
+In this workflow, the target of the structural similarity search is specified as the AlphaFold database in order to perform comparisons across a wider range of species.
+Index creation using the `foldseek databases` command is possible with [CWL Command Line Tool file](./Tools/02_foldseek_database.cwl).
+
+Please select the database you want to use from `Alphafold/UniProt`, `Alphafold/UniProt50-minimal`, `Alphafold/UniProt50`, `Alphafold/Proteome`, `Alphafold/Swiss-Prot`.
+You can check the details of this database using the following command.
+
+```bash
+docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases --help
+```
+
+For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following command,
+
+```bash
+# using docker container
+docker run -u $(id -u):$(id -g) --rm -v $(pwd):/home -e HOME=/home --workdir /home quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases Alphafold/Swiss-Prot swissprot tmp --threads 8
+
+# making directory
+mkdir ./index/index_swissprot
+
+# moving index file
+mv swissprot* ./index/index_swissprot/
+```
+Note: We have written a [CWL file describing above process](./Tools/02_foldseek_database.cwl), but we have confirmed that an error occurs and are in the process of correcting it.
+
+&nbsp;
+
+### 2-2. Downloading a BLAST index file
+
+An index FASTA file must be downloaded to obtain the amino acid sequence using the `blastdbcmd` command from the UniProt database.
+In this case, since it is a rice and human comparison, it can be downloaded as follows.
+
+```bash
+# Oryza sativa (all uniprot entries)
+curl -o uniprotkb_39947_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A39947%29"
+
+gzip -d uniprotkb_39947_all.fasta.gz
+
+# Homo sapiens (all uniprot entries)
+curl -o uniprotkb_9606_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A9606%29"
+
+gzip -d uniprotkb_9606_all.fasta.gz
+```
+
+&nbsp;
+
+### 3. Execution of the [main workflow](./Workflow/plant2human_v1.0.1.cwl)
+
+In this process, we perform a structural similarity search using `foldseek easy-search`, and then perform a pairwise alignment of the amino acid sequences of the hit pairs using `needle` and `water`.
+Finally, we create a scatter plot based on this information and output a [jupyter notebook](./test/oryza_sativa_test/plant2human_report.ipynb) as a report.
+Examples of commands are as follows.
+
+```bash
+cwltool --debug --outdir ./test/oryza_sativa_test ./Workflow/plant2human_v1.0.1.cwl ./job/plant2human_job_example_os.yml
+```
+
+&nbsp;
+
+For example, you can visualize the results of structural similarity and global alignment, as shown below.
+
+![image](./image/rice_test_scatter_plot.png)
+
+&nbsp;
+
+The following scatter diagram can also be obtained from the test results of [Zey mays random 100 genes](./test/zea_mays_test).
+
+![image](./image/zey_mays_scatter_plot.png)
diff --git a/Tools/find_enrichment.cwl b/Tools/find_enrichment.cwl
diff --git a/image/rice_test_scatter_plot.png b/image/rice_test_scatter_plot.png
diff --git a/image/zey_mays_scatter_plot.png b/image/zey_mays_scatter_plot.png
diff --git a/job/commandlinetool/36_rice_up_get_panhomology.yml b/job/commandlinetool/36_rice_up_get_panhomology.yml
diff --git a/job/commandlinetool/3_blastdbcmd_human_down.yml b/job/commandlinetool/3_blastdbcmd_human_down.yml
diff --git a/job/commandlinetool/3_blastdbcmd_human_up.yml b/job/commandlinetool/3_blastdbcmd_human_up.yml
diff --git a/job/commandlinetool/3_blastdbcmd_rice_down.yml b/job/commandlinetool/3_blastdbcmd_rice_down.yml
diff --git a/job/commandlinetool/3_blastdbcmd_rice_up.yml b/job/commandlinetool/3_blastdbcmd_rice_up.yml
diff --git a/job/commandlinetool/3_extract_id_human_down.yml b/job/commandlinetool/3_extract_id_human_down.yml
diff --git a/job/commandlinetool/3_extract_id_human_up.yml b/job/commandlinetool/3_extract_id_human_up.yml
diff --git a/job/commandlinetool/3_extract_id_rice_down.yml b/job/commandlinetool/3_extract_id_rice_down.yml
diff --git a/job/commandlinetool/3_extract_id_rice_up.yml b/job/commandlinetool/3_extract_id_rice_up.yml
diff --git a/job/commandlinetool/3_seqretsplit_human_down.yml b/job/commandlinetool/3_seqretsplit_human_down.yml
diff --git a/job/commandlinetool/3_seqretsplit_human_up.yml b/job/commandlinetool/3_seqretsplit_human_up.yml
diff --git a/job/commandlinetool/3_seqretsplit_rice_down.yml b/job/commandlinetool/3_seqretsplit_rice_down.yml
diff --git a/job/commandlinetool/3_seqretsplit_rice_up.yml b/job/commandlinetool/3_seqretsplit_rice_up.yml
diff --git a/job/commandlinetool/40_extract_id_human.yml b/job/commandlinetool/40_extract_id_human.yml
diff --git a/job/commandlinetool/40_extract_id_rice.yml b/job/commandlinetool/40_extract_id_rice.yml
diff --git a/job/commandlinetool/40_id_mapping_panhomology.yml b/job/commandlinetool/40_id_mapping_panhomology.yml
diff --git a/job/commandlinetool/makeblastdb_human.yml b/job/commandlinetool/makeblastdb_human.yml
diff --git a/job/plant2human_job_example_os.yml b/job/plant2human_job_example_os.yml
@@ -57,11 +57,11 @@ OUTPUT_FILE_NAME_HIT_SPECIES: "foldseek_result_hit_species.txt"  # default value
 SW_INPUT_FASTA_FILE_QUERY_SPECIES:  # default value of type 'File'.
     class: File
     format: http://edamontology.org/format_1929
-    location: ../test/oryza_sativa_test/uniprotkb_rice_all_240820.fasta
+    location: ../test/oryza_sativa_test/uniprotkb_39947_all.fasta
 W_INPUT_FASTA_FILE_HIT_SPECIES:  # default value of type 'File'.
     class: File
     format: http://edamontology.org/format_1929
-    location: ../test/oryza_sativa_test/uniprotkb_human_all_241107.fasta
+    location: ../test/oryza_sativa_test/uniprotkb_9606_all.fasta
 
 
 ROUTE_DATASET: "uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol"  # default value of type 'string'.

diff --git a/job/plant2human_job_example_zm.yml b/job/plant2human_job_example_zm.yml
@@ -0,0 +1,79 @@
+INPUT_DIRECTORY:  # default value of type 'Directory'.
+    class: Directory
+    location: ../test/zea_mays_test/zea_mays_random_gene_mmcif/
+FILE_MATCH_PATTERN: "*.cif"  # default value of type 'string'.
+
+
+# foldseek inputs
+FOLDSEEK_INDEX:  # default value of type 'File'.
+    class: File
+    location: ../index/index_swissprot/swissprot
+    secondaryFiles:
+      - class: File
+        location: ../index/index_swissprot/swissprot_ca
+      - class: File
+        location: ../index/index_swissprot/swissprot_ca.dbtype
+      - class: File
+        location: ../index/index_swissprot/swissprot_ca.index
+      - class: File
+        location: ../index/index_swissprot/swissprot_h
+      - class: File
+        location: ../index/index_swissprot/swissprot_h.dbtype
+      - class: File
+        location: ../index/index_swissprot/swissprot_h.index
+      - class: File
+        location: ../index/index_swissprot/swissprot_mapping
+      - class: File
+        location: ../index/index_swissprot/swissprot_ss
+      - class: File
+        location: ../index/index_swissprot/swissprot_ss.dbtype
+      - class: File
+        location: ../index/index_swissprot/swissprot_ss.index
+      - class: File
+        location: ../index/index_swissprot/swissprot_taxonomy
+      - class: File
+        location: ../index/index_swissprot/swissprot.dbtype
+      - class: File
+        location: ../index/index_swissprot/swissprot.index
+      - class: File
+        location: ../index/index_swissprot/swissprot.lookup
+      - class: File
+        location: ../index/index_swissprot/swissprot.version
+
+OUTPUT_FILE_NAME1: "foldseek_output_swissprot_zm_random_evalue01.tsv"  # default value of type 'string'.
+EVALUE: 0.1  # default value of type 'double'.
+THREADS: 16  # default value of type 'int'.
+SPLIT_MEMORY_LIMIT: "120G"  # default value of type 'string'.
+TAXONOMY_ID_LIST: "9606,10090,3702,39947"  # default value of type 'string'.
+
+# filtering
+OUTPUT_FILE_NAME2: "foldseek_zm_random_9606.tsv"  # default value of type 'string'.
+WF_COLUMN_NUMBER_QUERY_SPECIES: 1  # default value of type 'int'.
+OUTPUT_FILE_NAME_QUERY_SPECIES: "foldseek_result_query_species.txt"  # default value of type 'string'.
+WF_COLUMN_NUMBER_HIT_SPECIES: 2  # default value of type 'int'.
+OUTPUT_FILE_NAME_HIT_SPECIES: "foldseek_result_hit_species.txt"  # default value of type 'string'.
+
+
+SW_INPUT_FASTA_FILE_QUERY_SPECIES:  # default value of type 'File'.
+    class: File
+    format: http://edamontology.org/format_1929
+    location: ../test/zea_mays_test/uniprotkb_4577_all.fasta
+SW_INPUT_FASTA_FILE_HIT_SPECIES:  # default value of type 'File'.
+    class: File
+    format: http://edamontology.org/format_1929
+    location: ../test/zea_mays_test/uniprotkb_9606_all.fasta
+
+
+ROUTE_DATASET: "uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol"  # default value of type 'string'.
+OUTPUT_FILE_NAME3: "foldseek_hit_species_togoid_convert.tsv"  # default value of type 'string'.
+
+OUT_NOTEBOOK_NAME: "plant2human_report.ipynb"  # default value of type 'string'.
+QUERY_IDMAPPING_TSV:  # default value of type 'File'.
+    class: File
+    format: http://edamontology.org/format_3475
+    location: ../test/zea_mays_test/zea_mays_random_gene_idmapping_all.tsv
+
+QUERY_GENE_LIST_TSV:  # default value of type 'File'.
+    class: File
+    format: http://edamontology.org/format_3475
+    location: ../test/zea_mays_test/zea_mays_random_gene_list.tsv