-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chimera detection --uchime_ref unexpected behaviour #504
Comments
Thank you for reporting this. Could you please provide some test data and the exact command line used? I'll try to reproduce the issue, and then make a minimal reproducible example if possible. |
Sure, First command reports chimera with lower score When I move desired parent A to the top of the design FASTA the results are as expected |
Thanks, I confirm there is something strange here, maybe a bug. The wrong parent A is selected when there are more than 17,195 entries between the QUERY="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC"
PARENT_A_HIGH="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACCGCGTACCTGCAGATGAACAGCCTGCGTGCGGAAGATACGGCGGTTTACTATTGCTCGCGCCACGGCGGTGACGGCCACTACGCCATGGAC" # SAMP23224_sim_16939
PARENT_A_LOW="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAGGGTCGTTTCACGATCTCGGCCGATACGTCGAAGAACACCGCGTACTTACAGATGAACAGCCTGCGCGCGGAAGACACGGCGGTGTATTACTGTAGCCGCTGGGGCGGCGCCCTGTTCTACGCGATGGAC" # SAMP23224_sim_13735
PARENT_B="TATCTACCCGACCAATGGCTATACGCGCTACGCCGACTCGGTTAAAGGTCGCTTCACGATCTCGGCGGATACGAGCAAGAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC" # SAMP23224_sim_16714
FILLER_FILE="filler.fasta" # 37,198 lines
## PARENT_B PARENT_A_LOW ... no simulated entries ... PARENT_A_HIGH
vsearch \
--uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
--db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
# cat ${FILLER_FILE}
printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
) \
--quiet \
--dbmask none \
--uchimeout - # score is 2.4872
## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34390
vsearch \
--uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
--db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
head -n ${N_FILLER} ${FILLER_FILE}
printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
) \
--quiet \
--dbmask none \
--uchimeout - # score is 2.4872
## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34392
vsearch \
--uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
--db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
head -n ${N_FILLER} ${FILLER_FILE}
printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
) \
--quiet \
--dbmask none \
--uchimeout - # score is 0.9055 I don't observe that pattern when using monotonous sequences (only 'A') as filler sequences. |
I too can confirm that there seems to be a bug here. I'm looking into it. |
I've looked at the examples provided by @frederic-mahe, and for those examples I do not think a bug is exposed. It is rather a consequence of the heuristics of the algorithm. Actually, sequence number 17196 in the The uchime algorithm divides the query into 4 equally long parts and initially finds a few (actually 4 in vsearch) candidate sequences that are very similar to each part. It then goes on to find the best pair of parents among those (up to 16 sequences). What happens in @frederic-mahe's example is that a the I am not sure whether the same happens in the original case presented by @nvucic, but it may be related. Perhaps there are several sequences that are all 100% identical in one of the parts. Then the sequences selected may depend on the order in the file. So it may be due to the heuristics. To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently. I am currently working on an updated chimera algorithm for long reads, and will have these issues in mind. |
@torognes just enabling this as a parameter would be very useful. Thank you for looking into this! |
Hello,
I've been using --uchime_ref to identify chimeras providing a list of reference sequences in FASTA format. I noticed that vsearch does not report match with highest score when there are large number of sequences (~5k) between second best and best match in cases where second match comes first in the FASTA file. I could provide test data if necessary.
Can you suggest any kind of workarund or fix for this?
Best,
Nemanja
The text was updated successfully, but these errors were encountered: