-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi LMAT team!
I have been trying to follow the documentation in lmat-doc.txt to build a custom database for use with LMAT. I've been having issues doing so. I'll try to document a specific case here.
One step in the process of building a custom database is constructing a mapping file between NCBI Taxonomy Database identifiers and the full deflines from the multi-FASTA formatted file containing the reference sequences. I've followed the documentation below in that regard:
The mapping is specified as a tab delimited file with the first column containing the tax id and the second
column should contain the header associated with sequence stored in the input fasta file (WORK/test.fa below)
For example:
418127 >ref|NC_009782.1|gnl|NCBI_GENOMES|21340|gi|156978331|Staphylococcus aureus subsp. aureus Mu3, complete genome
When I provide my constructed GenomeToTaxID.txt file to build_header_table.py, it breaks:
reading: /media/ephemeral/taltman/lmat/GenomeToTaxID.txt
Traceback (most recent call last):
File "./build_header_table.py", line 44, in <module>
gi_to_tid[t[4]] = t[0]
IndexError: list index out of range
Poking into the Python script, it seems to be expecting a file with at least five columns, not two. Changing t[4] to t[1] seems to fix it.
So, either there is a documentation bug, or there is a software bug.
Any feedback would be greatly appreciated. Thanks!