Notes about creating necessary inputs for CAT prepare based on the GTDB data.
Some commands in the notebooks are provided as markdown code blocks. To be
able to run thesea you need a Linux environment (was Ubuntu LTS 20.04.2 was
used here), with the following core utils :
wgetcutgrepsortuniqsed
I also used parallel at some point, but it might not be necessary.
For fetching protein files from the NCBI ftp and parsing the fasta files
ncbi-genome-download(0.3.0was used here)biopython(1.79was used here)
CAT itself: cat (5.2.3 was used here)
All other stuff is in standard python 3.9.4.
- Recommended
Create an isolated conda environment with the provided
environment.yaml. This will also install the ipykernel module that should
make the whole environment visible to your jupyter instance.
So
$ git clone <this repo>
$ cd <this repo>
$ conda env create -n gtdb --file=environment.yaml
You might need to install nb_conda_kernels in the environment you are
running jupyter from, to make the gtdb kernel available.
Testing this:
You can use a platform specific environmnet provided within the conda_envs
directory to get the exact same versions of packages, always.
Use the appropriate lock file for your platform. E.g if you are on OSX
$ conda create -n gtdb --file conda_envs/conda-osx-64.lock
These files were produced with conda-lock and
$ conda-lock -f environment.yaml \
--filename-template conda_envs/conda-{platform}.lock
# Windows fails... Microtears (TM)
There are 2 main jupyter notebooks.
taxonomy.ipynb
Includes notes on the process of creating NCBI-like taxonomy files
nodes.dmpnames.dmp
- Currently unique numeric ids are produced for all ranks in the taxonomy.
This might be unecessary, since the prefixed names (e.g.
p__UBA9089) are already unique. Not sure if there if CAT expectsints or just any unique string will do
sequences.ipynb
Includes notes on the process of fetching protein sequences for all genomes represented in the GTDB. These results in
-
gtdb.nr.faAll proteins fetched for as many genomes possible. Not all assembly accessions have proteins.faa.gz files available. -
prot.accession2taxid.txtA mapping of all protein accessions, included in the fasta file above, to their respective taxonomy unique identifier.
Note that these are written uncompressed and they take up quite some space.
gtdb.nr.fa-309Gprot.accession2taxid.txt-23G
All protein fasta files take up 156G in total.
So, don't try this at home.
All files generated by the code and commands in the notebooks are written in the
results dir.
This is currently structured as follows
$ tree -d -L3 results
results
├── db --------------- Contains the gtdb.nr.gz and prot.taxid2accession.gz
├── proteins ---------|
│ ├── genbank |
│ │ ├── archaea |
│ │ └── bacteria | -- The raw fasta files for proteins,
│ └── refseq | per NCBI section, per domain
│ ├── archaea | (default ncbi-genome-download output structure)
│ └── bacteria |
└── taxonomy --------- Contains nodes.dmp and names.dmp
The gzipped files for the gtdb.nr.gz and prot.taxid2accession.gz are
manually created with
$ cat results/db/gtdb.nr.fa | gzip -c > results/db/gtdb.nr.gz
$ cat results/db/prot.accession2taxid.txt | gzip -c > results/db/prot.taxid2accession.gz
# Optionally remove the original files if gzip finished successfully.
Use the --existing option with CAT prepare.
The prepare module checks for the existence of certain files based on their
names, so it needs to be fooled.
- Prepare some clean dirs and copy/rename/move files accordingly
## <date> is of the form YYYY-MM-DD
$ mkdir -p results/CAT_prepare/CAT_taxonomy.<date>
$ mkdir -p results/CAT_preapare/CAT_database.<date>
# Copy the required files
## Sequences
$ mv results/db/gtdb.nr.gz results/CAT_prepare/CAT_database.<date>/<date>.nr.gz
## Taxonomy
$ mv results/db/prot.accession2taxid.gz results/CAT_prepare/CAT_taxonomy.<date>/<date>.prot.accession2taxid.FULL.gz
$ mv results/taxonomy/names.dmp results/CAT_prepare/CAT_taxonomy.<date>/names.dmp
$ mv results/taxonomy/nodes.dmp results/CAT_prepare/CAT_taxonomy.<date>/nodes.dmp
- Launch the job
-
Don't forget to activate the environment, or make sure you have
CATandDIAMONDavailable in ypur$PATH. -
Make sure you are on a machine with enough memory (at least
200GB).
# Get in the dir from above
$ cd results/CAT_prepare
# Do it
$ CAT prepare --existing --verbose \
-d CAT_database.<date> \
-t CAT_taxonomy.<date>