Skip to content

Commit

Permalink
UPDATE README & zey mays test files
Browse files Browse the repository at this point in the history
  • Loading branch information
yonesora56 committed Nov 16, 2024
1 parent f682b43 commit 63a0d2e
Show file tree
Hide file tree
Showing 33 changed files with 33,242 additions and 105 deletions.
21 changes: 13 additions & 8 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,18 +1,23 @@
.vscode
.cache
.DS_Store
/Archive
/cwl_cache
/index
/Data
/metascape_result
/notebooks/cache
/out/rice_up
/out/rice_down
/r-env
/scripts/archive
/test/oryza_sativa_test/uniprotkb_human_all_241107.fasta
/test/oryza_sativa_test/uniprotkb_rice_all_240820.fasta
/test/oryza_sativa_test/uniprotkb_39947_all.fasta
/test/oryza_sativa_test/uniprotkb_9606_all.fasta
/test/zea_mays_test/uniprotkb_9606_all.fasta
/test/zea_mays_test/uniprotkb_4577_all.fasta
/test/oryza_sativa_test/index_hit_species/*
/test/oryza_sativa_test/index_query_species/*
/test/zea_mays_test/index_hit_species/*
/test/zea_mays_test/index_query_species/*
/test/zea_mays_test/result_needle/*
/test/zea_mays_test/result_water/*
/test/zey_mays_test/zea_mays_random_gene_afinfo/*
/test/zey_mays_test/zea_mays_random_gene_mmcif/*
/test/zey_mays_test/split_fasta_hit_species/*
/test/zey_mays_test/split_fasta_query_species/*
/test/zea_mays_test/foldseek_output_swissprot_zm_random_evalue01.tsv
cmd_history.txt
76 changes: 74 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Please check the official website for details.

Here, we will explain how to use the list of rice genes as an example.

### 1. Creation of a TSV file of gene and UniProt ID correspondences
### **1. Creation of a TSV file of gene and UniProt ID correspondences**

First, you will need the following gene list tsv file. (Please set the column name as "From")

Expand Down Expand Up @@ -58,4 +58,76 @@ The actual execution results are output together with the [jupyter notebook](./t

 

### 2.
### **2. Creating and preparing indexes**

I'm sorry, but the [main workflow](./Workflow/plant2human_v1.0.1.cwl) does not currently include the creation of an index (both for foldseek index and BLAST index).
Please perform the following processes in advance.

### 2-1. Creating a foldseek index

In this workflow, the target of the structural similarity search is specified as the AlphaFold database in order to perform comparisons across a wider range of species.
Index creation using the `foldseek databases` command is possible with [CWL Command Line Tool file](./Tools/02_foldseek_database.cwl).

Please select the database you want to use from `Alphafold/UniProt`, `Alphafold/UniProt50-minimal`, `Alphafold/UniProt50`, `Alphafold/Proteome`, `Alphafold/Swiss-Prot`.
You can check the details of this database using the following command.

```bash
docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases --help
```

For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following command,

```bash
# using docker container
docker run -u $(id -u):$(id -g) --rm -v $(pwd):/home -e HOME=/home --workdir /home quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases Alphafold/Swiss-Prot swissprot tmp --threads 8

# making directory
mkdir ./index/index_swissprot

# moving index file
mv swissprot* ./index/index_swissprot/
```
Note: We have written a [CWL file describing above process](./Tools/02_foldseek_database.cwl), but we have confirmed that an error occurs and are in the process of correcting it.

 

### 2-2. Downloading a BLAST index file

An index FASTA file must be downloaded to obtain the amino acid sequence using the `blastdbcmd` command from the UniProt database.
In this case, since it is a rice and human comparison, it can be downloaded as follows.

```bash
# Oryza sativa (all uniprot entries)
curl -o uniprotkb_39947_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A39947%29"

gzip -d uniprotkb_39947_all.fasta.gz

# Homo sapiens (all uniprot entries)
curl -o uniprotkb_9606_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A9606%29"

gzip -d uniprotkb_9606_all.fasta.gz
```

 

### 3. Execution of the [main workflow](./Workflow/plant2human_v1.0.1.cwl)

In this process, we perform a structural similarity search using `foldseek easy-search`, and then perform a pairwise alignment of the amino acid sequences of the hit pairs using `needle` and `water`.
Finally, we create a scatter plot based on this information and output a [jupyter notebook](./test/oryza_sativa_test/plant2human_report.ipynb) as a report.
Examples of commands are as follows.

```bash
cwltool --debug --outdir ./test/oryza_sativa_test ./Workflow/plant2human_v1.0.1.cwl ./job/plant2human_job_example_os.yml
```

 

For example, you can visualize the results of structural similarity and global alignment, as shown below.

![image](./image/rice_test_scatter_plot.png)

 

The following scatter diagram can also be obtained from the test results of [Zey mays random 100 genes](./test/zea_mays_test).

![image](./image/zey_mays_scatter_plot.png)
93 changes: 0 additions & 93 deletions Tools/find_enrichment.cwl

This file was deleted.

Binary file added image/rice_test_scatter_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image/zey_mays_scatter_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified job/commandlinetool/36_rice_up_get_panhomology.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_blastdbcmd_human_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_blastdbcmd_human_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_blastdbcmd_rice_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_blastdbcmd_rice_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_extract_id_human_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_extract_id_human_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_extract_id_rice_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_extract_id_rice_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_seqretsplit_human_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_seqretsplit_human_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_seqretsplit_rice_down.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/3_seqretsplit_rice_up.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/40_extract_id_human.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/40_extract_id_rice.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/40_id_mapping_panhomology.yml
100755 → 100644
Empty file.
Empty file modified job/commandlinetool/makeblastdb_human.yml
100755 → 100644
Empty file.
4 changes: 2 additions & 2 deletions job/plant2human_job_example_os.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,11 @@ OUTPUT_FILE_NAME_HIT_SPECIES: "foldseek_result_hit_species.txt" # default value
SW_INPUT_FASTA_FILE_QUERY_SPECIES: # default value of type 'File'.
class: File
format: http://edamontology.org/format_1929
location: ../test/oryza_sativa_test/uniprotkb_rice_all_240820.fasta
location: ../test/oryza_sativa_test/uniprotkb_39947_all.fasta
W_INPUT_FASTA_FILE_HIT_SPECIES: # default value of type 'File'.
class: File
format: http://edamontology.org/format_1929
location: ../test/oryza_sativa_test/uniprotkb_human_all_241107.fasta
location: ../test/oryza_sativa_test/uniprotkb_9606_all.fasta


ROUTE_DATASET: "uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol" # default value of type 'string'.
Expand Down
79 changes: 79 additions & 0 deletions job/plant2human_job_example_zm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
INPUT_DIRECTORY: # default value of type 'Directory'.
class: Directory
location: ../test/zea_mays_test/zea_mays_random_gene_mmcif/
FILE_MATCH_PATTERN: "*.cif" # default value of type 'string'.


# foldseek inputs
FOLDSEEK_INDEX: # default value of type 'File'.
class: File
location: ../index/index_swissprot/swissprot
secondaryFiles:
- class: File
location: ../index/index_swissprot/swissprot_ca
- class: File
location: ../index/index_swissprot/swissprot_ca.dbtype
- class: File
location: ../index/index_swissprot/swissprot_ca.index
- class: File
location: ../index/index_swissprot/swissprot_h
- class: File
location: ../index/index_swissprot/swissprot_h.dbtype
- class: File
location: ../index/index_swissprot/swissprot_h.index
- class: File
location: ../index/index_swissprot/swissprot_mapping
- class: File
location: ../index/index_swissprot/swissprot_ss
- class: File
location: ../index/index_swissprot/swissprot_ss.dbtype
- class: File
location: ../index/index_swissprot/swissprot_ss.index
- class: File
location: ../index/index_swissprot/swissprot_taxonomy
- class: File
location: ../index/index_swissprot/swissprot.dbtype
- class: File
location: ../index/index_swissprot/swissprot.index
- class: File
location: ../index/index_swissprot/swissprot.lookup
- class: File
location: ../index/index_swissprot/swissprot.version

OUTPUT_FILE_NAME1: "foldseek_output_swissprot_zm_random_evalue01.tsv" # default value of type 'string'.
EVALUE: 0.1 # default value of type 'double'.
THREADS: 16 # default value of type 'int'.
SPLIT_MEMORY_LIMIT: "120G" # default value of type 'string'.
TAXONOMY_ID_LIST: "9606,10090,3702,39947" # default value of type 'string'.

# filtering
OUTPUT_FILE_NAME2: "foldseek_zm_random_9606.tsv" # default value of type 'string'.
WF_COLUMN_NUMBER_QUERY_SPECIES: 1 # default value of type 'int'.
OUTPUT_FILE_NAME_QUERY_SPECIES: "foldseek_result_query_species.txt" # default value of type 'string'.
WF_COLUMN_NUMBER_HIT_SPECIES: 2 # default value of type 'int'.
OUTPUT_FILE_NAME_HIT_SPECIES: "foldseek_result_hit_species.txt" # default value of type 'string'.


SW_INPUT_FASTA_FILE_QUERY_SPECIES: # default value of type 'File'.
class: File
format: http://edamontology.org/format_1929
location: ../test/zea_mays_test/uniprotkb_4577_all.fasta
SW_INPUT_FASTA_FILE_HIT_SPECIES: # default value of type 'File'.
class: File
format: http://edamontology.org/format_1929
location: ../test/zea_mays_test/uniprotkb_9606_all.fasta


ROUTE_DATASET: "uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol" # default value of type 'string'.
OUTPUT_FILE_NAME3: "foldseek_hit_species_togoid_convert.tsv" # default value of type 'string'.

OUT_NOTEBOOK_NAME: "plant2human_report.ipynb" # default value of type 'string'.
QUERY_IDMAPPING_TSV: # default value of type 'File'.
class: File
format: http://edamontology.org/format_3475
location: ../test/zea_mays_test/zea_mays_random_gene_idmapping_all.tsv

QUERY_GENE_LIST_TSV: # default value of type 'File'.
class: File
format: http://edamontology.org/format_3475
location: ../test/zea_mays_test/zea_mays_random_gene_list.tsv
Loading

0 comments on commit 63a0d2e

Please sign in to comment.