Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field separators in quoted attributes cause error #212

Open
zwpwjwtz opened this issue Apr 1, 2023 · 4 comments · May be fixed by #215
Open

Field separators in quoted attributes cause error #212

zwpwjwtz opened this issue Apr 1, 2023 · 4 comments · May be fixed by #215

Comments

@zwpwjwtz
Copy link

zwpwjwtz commented Apr 1, 2023

Testing this GTF file with gffutils and an AttributeStringError exception was raised on line 349, parser.py.
After some exploration I noticed that lines containing attributes with ";" (field separator) in it actually caused the malfunction of parser.

Example:
NC_000964.3 RefSeq CDS 410 1747 . + 0 gene_id "BSU_00010"; ...... note "Evidence 1a: Function from experimental evidences in the studied strain; PubMedId: 2167836, 2846289, 12682299, 16120674, 1779750, 28166228; Product type f : factor"; ......
Since the semicolons were first extracted as field separators, the sub-attributes ("Evidence 1a", "PubMedId" and "Product type f") were then broken into separated fields, and the numbers after "PubMedId" were parsed as multiple values associated with the (wrong) "PubMedId" key. Since dialect["repeated key"] had been set by multiple definition of field "db_xref", an exception mentioned above was thus triggered.

I suggest that quotes get parsed in priority, before the field separators getting located and parsed. Although this may require the parser to behave like a streaming parser rather than a structured one, it guarantees that no content between quotes can escape and contaminate the other fields.

@daler
Copy link
Owner

daler commented Apr 1, 2023

Thanks for reporting. Agreed, it would be nice to have this but I do not have the bandwidth to work on this at the moment. Happy to review pull requests on this though!

@daler
Copy link
Owner

daler commented Jul 4, 2023

See discussion in #215

@zwpwjwtz
Copy link
Author

I can confirm that #215 fixed this issue using the test file. Another GTF that had the same problem can now be parsed smoothly. Thank you all for your help and contribution!

@zwpwjwtz
Copy link
Author

Hi, it has been a while since the last update (#215) and I haven't encountered any problem. Shall I close this issue as completed, or just leave it open for more discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants