Given a list of (formatted) protein labels, use pre-existing GenBank genome files to locate the CDS feature encoding the target proteins and extracts a region around it, saving it as a GenBank file.
- Python 3.12+
- Biopython (tested with v1.83)
usage: loci_list_extractor.py [-h] -t TARGET_LIST -j JGITAXONOMY -b BASE_DATA_FOLDER [-o OUTPUT_FOLDER] [-e EXTENSION]
Loci extractor
options:
-h, --help show this help message and exit
-t, --target_list TARGET_LIST
Text file with a list of protein targets
-j, --JGItaxonomy JGITAXONOMY
Path to the JGI_taxonomy.tsv file
-b, --base_data_folder BASE_DATA_FOLDER
Base folder with antiSMASH results. Must match the folder structure contained in the JGI_taxonomy.tsv file
-o, --output_folder OUTPUT_FOLDER
Base folder for output. Default: './output/'
-e, --extension EXTENSION
Number of kilo base pairs to extend the region up- and downstream of the target protein. Default: 20 kbp
The TARGET_LIST expects a text file with a simple list of protein ids (one per line). The format of the ids is GROUP|PROJECT_NAME|NUM where GROUP is a taxonomic group (optional), PROJECT_NAME is a MycoCosm genome sequencing project name and NUM is the protein id within MycoCosm.
The JGI_taxonomy.tsv file is generated by the MycoCosm Genome Downloader project