Skip to content
forked from NBChub/bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

License

Notifications You must be signed in to change notification settings

pascalaldo/bgcflow

 
 

Repository files navigation

BGCFlow (PanKB fork)

This is a fork of BGCFlow intended for use with PanKB (repo | website). For more information see the main BGCFlow repo and the BGCFlow paper.

BGCFlow is a systematic workflow for building pangenomes and related analytics. This fork contains additions for PanKB and a minimal workflow that runs the minimum required steps for PanKB.

At present, BGCFlow is only tested to work on Linux systems with conda.

Running BGCFlow in an Azure VM

Setting up a VM

  1. Create the VM. Create a standard Azure x64 Linux VM with the most recent Ubuntu LTS version (general instructions here, ignore the web server section and adjust settings accordingly). During setup you can keep the VM small to save costs (you can resize CPU and RAM later). You can keep the main disk size at the default 30GB, but azcopy sometimes creates huge log files and it can be convenient to have a larger disk (though this can also be easily solved by symlinking ~/.azcopy to somewhere in the data disk we will mount in a later step). For security and availbaility settings follow the guidelines from DTU Biosustain (you can also check other VMs for reference). Network settings should allow only SSH from trusted IPs (DTU IPs and any place you work remotely from).
  2. Mount a data disk. Create and mount an additional disk for data storage. BGCFlow generates large amounts of data, so 2TB is usually the minimum to be comfortable. If you are going to be handling results from multiple runs, go for at least 4TB. Format the entire disk as ext4. We usually mount the data disk at /data (but it's not a hard requirement). Note that formatting and mounting new disks requires some general knowledge on file systems and the common tools for listing, partitioning, formatting and mounting disks in Linux (such as lsblk, fdisk, mkfs.* and the fstab config file).
  3. Mount the resources disk. Mount the bgcflow-resources disk (from the rg-recon resource group). This contains mainly databases and other static files used by some of the tools of the pipeline (more details here). Mounting the resources disk only requires getting its UUID with lsblk and adding it to fstab. We usually mount it at /resources (again, not a hard requirement).
  4. Symlink the resources disk. Clone this repository if you haven't yet (to a location of your choice). Symlink the resources directory to the mount location of the resources disk. From the root of this repository:
rm -rv ./resources
ln -sv /resources resources

Running tree -L 1 ./resources should yield something like:

./resources
├── automlst-simplified-wrapper-main
├── automlst-simplified-wrapper-main.back
├── checkm
├── eggnog_db
├── gtdb_download
├── gtdbtk
└── lost+found

8 directories, 0 files
  1. Install conda (we usually use Miniforge). If the main VM disk is small, it is recommended you put the conda installation in the data disk rather than the default location in the home directory.
  2. Create a conda environment with snakemake 7.31.1 and python 3.11:
conda create -n bgcflow python=3.11 bioconda::snakemake=7.31.1
  1. Ensure strict channel priorities for conda is disabled:
conda config --set channel_priority disabled

Some additional BGCFlow installation info can be found in the main BGCFlow wiki (note that we do not use the BGCFlow wrapper nor we need the GCC).

Running the pipeline

  1. Resize the VM. There are no hard requirements, but a larger CPU will run faster (large runs can take a few weeks to complete). We usually use a minimum of 32 cores and 128GB of RAM for large runs.
  2. Create the config file. Create a config directory and, inside it, a config.yaml file that should look something like this:
taxons:
  - name: <family_name>
    reference_only: False

use_ncbi_data_for_checkm: True
pankb_alleleome_only: False

rules:
  pankb_nova: True
  pankb_minimal: True
  pankb: True
  alleleome: True

Replace <family_name> with whatever family you want to compute PanKB data for and adjust the other settings as needed.

  1. Create the data directory. Create an empty directory in the data disk (any location is ok) and symlink it to ./data (from the root of this repo), such that we effectively have an empty directory named data at the root of the repo. At this point, running tree -l -L 2 .should yield this structure:
.
├── CITATION.cff
├── Dockerfile
├── LICENSE
├── README.md
├── config
│   └── config.yaml
├── data -> /path/to/your/empty/data/directory
├── envs.yaml
├── resources -> /resources
│   ├── automlst-simplified-wrapper-main
│   ├── automlst-simplified-wrapper-main.back
│   ├── checkm
│   ├── eggnog_db
│   ├── gtdb_download
│   └── gtdbtk
└── workflow
    └── ...

You can also create the config.yaml file close to the data directory to keep things organized and symlink it rather than create it in the previous step.

  1. Activate the conda environment with:
conda activate bgcflow
  1. Run BGCFlow with:
snakemake --snakefile workflow/PanKB_Family --use-conda --rerun-triggers mtime -c <n_cores> --rerun-incomplete --keep-going --resources ncbi_api=1

Replace <n_cores> with the number of cores of your VM (or how many you want to allocate).

Long-term storage of BGCFlow results

We use the pankbpipeline storage account in the rg-recon resource group to store the results and all the intermediate files from our BGCFlow runs for PanKB. Simply add it your new data as an additional directory in the runs blob and remember to also upload the config.yaml.

You can do it easily with azcopy (just be wary of the huge log files issue described above). For instance, if the (now populated) empty data directory created previously is /data/bgcflow_data/cyanobacteria/data, you can simply:

azcopy copy --dry-run --recursive --overwrite=false /data/bgcflow_data/cyanobacteria/data https://pankbpipeline.blob.core.windows.net/runs/cyanobacteria?$SAS

(assumes you have a $SAS variable with a valid SAS token for the Azure storage account)

With azcopy, alaways first do a --dry-run to check that everything will end up in the intended location and use --overwrite=false unless you specifically need to.

Storage of PanKB-relevant results

The files needed specifically for PanKB need to be stored the the pankb storage account. Instructions for that can be found in the pankb_db README.

About

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 67.8%
  • Jupyter Notebook 29.4%
  • Dockerfile 1.6%
  • Shell 0.5%
  • Makefile 0.3%
  • Jinja 0.2%
  • Other 0.2%