Replies: 1 comment
-
This is expected. The contig names are per-sample. Different downstream users have different requirements and the producer is not expected to be universal :) So, your current approach with post-assembly renaming is correct. Note that attaching hash could not be so convenient for the majority of users. They might prefer to have some e.g. "sample identifier" appended. We might consider adding some option like this if time would permit. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Is your feature request related to a problem? Please describe. For generic questions use Q&A section in the Discussions forum above.
I have noticed when I perform spades assembly per sample, I have an 1-10% chance that the contig header has an exact match with another header from a different sample. So what I now do, is re-opening and parsing all of those headers to rename them. This is required when I perform contig annotation.
Describe the solution you'd like
I would like that the headers are replaced by a hash or other random generator that has less chance of creating a duplicate. At the moment a header that starts with
NODE_1_
is a bit too generic, and the numbering could be replaced by a random code.At the moment, my data is very fragmented, meaning this will result in millions of contigs.
Describe alternatives you've considered
The alternative for now, I have a python script that parses all samples, creates a mapping file of all contig headers and performs re-naming and re-writing of the contig files with the new header and the mapping file (to link the old name with the new name for each sample)
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions