Skip to content

Extract strategy duplicates HOGs in a paralog group #43

@nromashchenko

Description

@nromashchenko

Version: 1.2.0

Comand: orthoxml-tools filter --infile tests/test-data/case_filtering.orthoxml --threshold 0.3 --strategy extract --out debug.orthoxml

This extracts HOG_Mammals_1, HOG_Mammals_2 as rootHOGs. At the same time, they are still kept HOG_Tetrapoda as a paralog group as tetrapoda also pass the check.

Original issue:

With --strategy extract, the output file is bigger than the input file (3GB vs 2.3GB). For example the gene id 1053013941 appears two times in the output orthoXML

 $ grep 1053013941  FastOMA_HOGs_Feb2026_orthoXMLtools_extract_0.3.orthoxml 
        <gene id="1053013941" protId="CALSQU_R06387"/>
        <geneRef id="1053013941"/>
                                    <geneRef id="1053013941"/>

I ran this:

orthoxml-tools filter --infile FastOMA_HOGs_Feb2026_raw.orthoxml  --strategy extract  --threshold 0.3 --outfile FastOMA_HOGs_Feb2026_orthoXMLtools_extract_0.3.orthoxml 

Originally posted by @sinamajidian in #42

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions