Duplicated headers when combining contigs from different samples #1501

agusinac · 2025-07-25T11:10:52Z

agusinac
Jul 25, 2025

Is your feature request related to a problem? Please describe. For generic questions use Q&A section in the Discussions forum above.

I have noticed when I perform spades assembly per sample, I have an 1-10% chance that the contig header has an exact match with another header from a different sample. So what I now do, is re-opening and parsing all of those headers to rename them. This is required when I perform contig annotation.

Describe the solution you'd like

I would like that the headers are replaced by a hash or other random generator that has less chance of creating a duplicate. At the moment a header that starts with NODE_1_ is a bit too generic, and the numbering could be replaced by a random code.

At the moment, my data is very fragmented, meaning this will result in millions of contigs.

Describe alternatives you've considered

The alternative for now, I have a python script that parses all samples, creates a mapping file of all contig headers and performs re-naming and re-writing of the contig files with the new header and the mapping file (to link the old name with the new name for each sample)

Additional context

No response

asl · 2025-07-25T16:07:51Z

asl
Jul 25, 2025
Maintainer

I have noticed when I perform spades assembly per sample, I have an 1-10% chance that the contig header has an exact match with another header from a different sample.

This is expected. The contig names are per-sample. Different downstream users have different requirements and the producer is not expected to be universal :) So, your current approach with post-assembly renaming is correct.

Note that attaching hash could not be so convenient for the majority of users. They might prefer to have some e.g. "sample identifier" appended. We might consider adding some option like this if time would permit.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicated headers when combining contigs from different samples #1501

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Duplicated headers when combining contigs from different samples #1501

Uh oh!

agusinac Jul 25, 2025

Is your feature request related to a problem? Please describe. For generic questions use Q&A section in the Discussions forum above.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Replies: 1 comment

Uh oh!

Uh oh!

asl Jul 25, 2025 Maintainer

agusinac
Jul 25, 2025

asl
Jul 25, 2025
Maintainer