Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_db() does not parse directives from GFF files starting in v0.11.0 #213

Open
dtdoering opened this issue Apr 7, 2023 · 3 comments

Comments

@dtdoering
Copy link

I am trying to read in a GFF that doesn't adhere to spec so I can use gffutils to fix it and then write out a corrected file. I've gotten everything working, except that the GFF header directive(s) don't appear in the output file. It appears they are not being parsed upon creation of the database.

Interestingly, this only happens when a FeatureDB is created from a file, not from e.g. a dedent()ed string as in the existing parser_test.py.

This behavior is exhibited in v0.11.1 and v0.11.0, but not v0.10.1 (thus, a workaround is to downgrade to v0.10.1).

To reproduce:

Use gffutils > v0.10.1.

Create a test file test.gff with the following:

##gff-version 3
.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.
.	.	.	.	.	.	.	.

Example code:

import gffutils
db = gffutils_create_db('test.gff', dbfn='test.db', force=True)
print(len(db.directives))

Expected output:

1

Observed output:

0

System/environment info:

OS: GNU/Linux

Python: 3.8.5 (still happens with 3.8.13)

Conda environment:

name: gffutils-py3_8_5
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_gnu
  - argcomplete=3.0.5=pyhd8ed1ab_0
  - argh=0.27.2=pyhd8ed1ab_0
  - biopython=1.81=py38h1de0b5d_0
  - ca-certificates=2022.12.7=ha878542_0
  - gffutils=0.11.1=pyh7cba7a3_0
  - ld_impl_linux-64=2.40=h41732ed_0
  - libblas=3.9.0=16_linux64_openblas
  - libcblas=3.9.0=16_linux64_openblas
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=12.2.0=h65d4601_19
  - libgfortran-ng=12.2.0=h69a702a_19
  - libgfortran5=12.2.0=h337968e_19
  - libgomp=12.2.0=h65d4601_19
  - liblapack=3.9.0=16_linux64_openblas
  - libopenblas=0.3.21=pthreads_h78a6416_3
  - libsqlite=3.40.0=h753d276_0
  - libstdcxx-ng=12.2.0=h46fd767_19
  - libzlib=1.2.13=h166bdaf_4
  - ncurses=6.3=h27087fc_1
  - numpy=1.24.2=py38h10c12cc_0
  - openssl=1.1.1t=h0b41bf4_0
  - packaging=23.0=pyhd8ed1ab_0
  - pip=23.0.1=pyhd8ed1ab_0
  - pyfaidx=0.7.2.1=pyh7cba7a3_1
  - python=3.8.5=h1103e12_9_cpython
  - python_abi=3.8=3_cp38
  - pyvcf3=1.0.3=pyhdfd78af_0
  - readline=8.2=h8228510_1
  - setuptools=67.6.1=pyhd8ed1ab_0
  - simplejson=3.18.4=py38h1de0b5d_0
  - six=1.16.0=pyh6c4a22f_0
  - sqlite=3.40.0=h4ff8645_0
  - tk=8.6.12=h27826a3_0
  - wheel=0.40.0=pyhd8ed1ab_0
  - xz=5.2.6=h166bdaf_0
  - zlib=1.2.13=h166bdaf_4
prefix: /home/dtdoering__lbl.gov/.conda/envs/gffutils-py3_8_5
daler added a commit that referenced this issue Jul 4, 2023
@daler
Copy link
Owner

daler commented Jul 4, 2023

@dtdoering I'm unable to reproduce in a test -- see the new test at a4b443b and passing tests here.

Are you able to reproduce this with the latest in the v0.12rc branch? Or, if you don't get around to it for a little while, with v0.12 once it's released?

@dtdoering
Copy link
Author

Hey @daler, getting back around to this now!

I pasted your test_issue_213 function into a new script, and am getting the same results as you on the latest v0.12 tag (i.e. no issue).

However, I must have done a poor job with my initial description of the issue because I am now only able to reproduce the issue again by supplying a dialect argument in the create_db calls in your test function. This is something I had been/am doing in my real-world code when i first encountered the issue, so I'm not sure how I was reproducing the issue without that argument. My apologies! 😅

At any rate, by supplying dialect={'fmt': 'gff3'} to the create_db calls in your test function, I am now indeed getting AssertionErrors, whereas I don't get them if I omit the dialect argument and let it figure out the dialect.

@dtdoering
Copy link
Author

For the sake of completeness, here is the script (gffutils_debug.py) that I'm running your test with:

#!/usr/bin/env python

import sys
import os
import tempfile
from textwrap import dedent

gffutils_git_path = os.path.join(os.environ.get('HOME'), 'sft', 'gffutils')

sys.path.insert(1, gffutils_git_path)

import gffutils

print(f"gffutils version: {gffutils.__version__}")

def test_issue_213():
    # GFF header directives seem to be not parsed when building a db from
    # a file, even though it seems to work fine from a string.
    data = dedent(
        """
    ##gff-version 3
    .	.	.	.	.	.	.	.
    .	.	.	.	.	.	.	.
    .	.	.	.	.	.	.	.
    .	.	.	.	.	.	.	.
    """
    )

    # Ensure directives are parsed from DataIterator
    it = gffutils.iterators.DataIterator(data, from_string=True)
    assert it.directives == ["gff-version 3"]


    # Ensure they're parsed into the db from a string
    db = gffutils.create_db(data, dbfn=":memory:", from_string=True, verbose=False, dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives

    # Ensure they're parsed into the db from a file
    tmp = tempfile.NamedTemporaryFile(delete=False).name
    with open(tmp, "w") as fout:
        fout.write(data + "\n")
    db = gffutils.create_db(tmp, ":memory:", dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives
    assert len(db.directives) == 1

    # Ensure they're parsed into the db from a file, and going to a file (to
    # exactly replicate example in #213)
    db = gffutils.create_db(tmp, dbfn='issue_213.db', force=True, dialect={'fmt': 'gff3'})
    assert db.directives == ["gff-version 3"], db.directives
    assert len(db.directives) == 1

test_issue_213()

Then I run it with:

~/gffutils_debug.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants