Skip to content

WesterdijkInstitute/MycoCosm-Loci-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

MycoCosm-Loci-Extractor

Given a list of (formatted) protein labels, use pre-existing GenBank genome files to locate the CDS feature encoding the target proteins and extracts a region around it, saving it as a GenBank file.

Requirements

  • Python 3.12+
  • Biopython (tested with v1.83)

Usage

usage: loci_list_extractor.py [-h] -t TARGET_LIST -j JGITAXONOMY -b BASE_DATA_FOLDER [-o OUTPUT_FOLDER] [-e EXTENSION]

Loci extractor

options:
  -h, --help            show this help message and exit
  -t, --target_list TARGET_LIST
                        Text file with a list of protein targets
  -j, --JGItaxonomy JGITAXONOMY
                        Path to the JGI_taxonomy.tsv file
  -b, --base_data_folder BASE_DATA_FOLDER
                        Base folder with antiSMASH results. Must match the folder structure contained in the JGI_taxonomy.tsv file
  -o, --output_folder OUTPUT_FOLDER
                        Base folder for output. Default: './output/'
  -e, --extension EXTENSION
                        Number of kilo base pairs to extend the region up- and downstream of the target protein. Default: 20 kbp

Notes

The TARGET_LIST expects a text file with a simple list of protein ids (one per line). The format of the ids is GROUP|PROJECT_NAME|NUM where GROUP is a taxonomic group (optional), PROJECT_NAME is a MycoCosm genome sequencing project name and NUM is the protein id within MycoCosm.

The JGI_taxonomy.tsv file is generated by the MycoCosm Genome Downloader project

About

Uses protein labels to extract a loci from MycoCosm genome projects

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages