A clean up version of the G1034 Actinomycetota genomes from NCBI
This repository contains a workflow to fetch the 1034 genbank files of the NBC Collection from NCBI.
List of samples are taken from the supplementary materials of the publication and matched with the BioProject.
[TO DO]
The folder structure is designed to be compatible with BGCFlow.
-
You need mamba installed. We recommend installing miniforge.
-
Install BGCFlow
Skip this if you already have BGCFlow installed
# create and activate a new conda environment mamba create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase conda activate bgcflow # install `BGCFlow` wrapper pip install bgcflow_wrapper # make sure to use bgcflow_wrapper version >= 0.2.7 bgcflow --version # Set conda channel priorities to flexible conda config --set channel_priority disabled conda config --describe channel_priority # Deploy and test run BGCFlow bgcflow clone bgcflow # clone `BGCFlow` a directory named bgcflow (cd bgcflow && bgcflow init) # initiate `BGCFlow` config and examples from template (cd bgcflow && bgcflow run -n) # do a dry run, remove the flag "-n" to run the example dataset
-
Clone the repo
git clone [email protected]:NBChub/G1034_NCBI_dataset.git cd G1034_NCBI_dataset
-
Install the conda environment
mamba env create -f env.yaml
-
Run the workflow to fetch the genbanks from NCBI
Run the command as it is or modify the snakemake parameters to your liking
conda activate g1034 (cd config/G1034_20241115/ && snakemake --use-conda -c 8 -n) # remove the -n to execute conda deactivate
PS: Depending on the network traffic, some downloads might fail. Re-run the command again to retry downloading.
-
Remove manually curated problematic genes using jupyter notebook:
conda activate g1034 (cd config/G1034_20241115/notebooks && jupyter notebook 01_manual_curation.ipynb) conda deactivate
-
Create a symlink to existing BGCFlow clone.
BGCFLOW_PATH="../bgcflow" # CHANGE THIS ACCORDINGLY ln -s $BGCFLOW_PATH/.snakemake/ .snakemake ln -s $BGCFLOW_PATH/workflow/ workflow ln -s $BGCFLOW_PATH/resources/ resources
The downloaded genbanks will be located in config/G1034_20241115/input_files
The notebooks to prepare the list of downloads is available at config/G1034_20241115/notebooks/00_clean_up.ipynb