Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue importing db: TypeError: cannot unpack non-iterable NoneType object #236

Closed
bschilder opened this issue Nov 30, 2024 · 9 comments
Closed

Comments

@bschilder
Copy link

bschilder commented Nov 30, 2024

Hello,

Thanks for the tool, love the concept. Though I'm having some issues getting the db for work. I've tried this with two different files (gff and gff3) and encountered the same error.

Thanks in advance for your help,
Brian

Reprex

Download gff3

Download annotations fro Gencode.

!wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gff3.gz
!gunzip https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gff3.gz

Create db

dbfn = "gencode.v47.annotation.db"
db = gffutils.create_db("gencode.v47.annotation.gff3",
                            dbfn=dbfn,
                            force=False,
                            keep_order=True,
                            merge_strategy='merge', 
                            sort_attribute_values=True)

Import db

  db = gffutils.FeatureDB(dbfn, keep_order=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [6], in <cell line: 11>()
     11 if os.path.exists(dbfn):
     12     print("Using existing db.")
---> 13     db = gffutils.FeatureDB(dbfn)
     14 else:
     15     print("Creating db.")

File ~/.local/lib/python3.9/site-packages/gffutils/interface.py:199, in FeatureDB.__init__(self, dbfn, default_encoding, keep_order, pragmas, sort_attribute_values, text_factory)
    191 # Load some meta info
    192 # TODO: this is a good place to check for previous versions, and offer
    193 # to upgrade...
    194 c.execute(
    195     """
    196     SELECT version, dialect FROM meta
    197     """
    198 )
--> 199 version, dialect = c.fetchone()
    200 self.version = version
    201 self.dialect = helpers._unjsonify(dialect)

TypeError: cannot unpack non-iterable NoneType object

Versioning

  • python 3.9.5
  • gffutils 0.13

All packages

``` absl-py @ file:///dev/shm/jax/0.2.24/foss-2021a-CUDA-11.3.1/absl-py-0.15.0 alabaster @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/alabaster/alabaster-0.7.12 anyio==4.6.2.post1 appdirs @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/appdirs/appdirs-1.4.4 argcomplete==3.5.1 argh==0.31.3 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asn1crypto @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/asn1crypto/asn1crypto-1.4.0 astor @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/astor/astor-0.8.1 asttokens==2.0.5 astunparse @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/astunparse/astunparse-1.6.3 async-lru==2.0.4 atomicwrites @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/atomicwrites/atomicwrites-1.4.0 attrs==24.2.0 babel==2.16.0 backcall==0.2.0 bcbio-gff==0.7.1 bcrypt @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/bcrypt/bcrypt-3.2.0 beautifulsoup4==4.12.3 biopython==1.84 bitstring @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/bitstring/bitstring-3.1.7 bleach==6.2.0 blist @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/blist/blist-1.3.6 Bottleneck @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/Bottleneck/Bottleneck-1.3.2 CacheControl @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/CacheControl/CacheControl-0.12.6 cachetools @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/cachetools/cachetools-4.2.2 cachy @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/cachy/cachy-0.3.0 certifi @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/certifi/certifi-2020.12.5 cffi @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/cffi/cffi-1.14.5 chardet @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/chardet/chardet-4.0.0 charset-normalizer==2.0.12 clang @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/clang/clang-5.0 cleo @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/cleo-0.8.1-py2.py3-none-any.whl click @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/click/click-7.1.2 clikit @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/clikit-0.6.2-py2.py3-none-any.whl colorama @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/colorama/colorama-0.4.4 coloredlogs==15.0.1 crashtest @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/crashtest-0.3.1-py3-none-any.whl cryptography @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/cryptography/cryptography-3.4.7 cycler==0.11.0 Cython @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/Cython/Cython-0.29.23 cyvcf2==0.31.1 deap @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/deap/deap-1.3.1 debugpy==1.5.1 decorator @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/decorator/decorator-5.0.7 defusedxml==0.7.1 dill @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/dill/dill-0.3.3 distlib @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/distlib/distlib-0.3.1 dm-tree==0.1.6 docopt @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/docopt/docopt-0.6.2 docutils @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/docutils/docutils-0.17.1 ecdsa @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/ecdsa/ecdsa-0.16.1 entrypoints==0.4 exceptiongroup==1.2.2 executing==0.8.3 expecttest @ file:///dev/shm/expecttest/0.1.3/GCCcore-10.3.0/expecttest-0.1.3 fastjsonschema==2.20.0 filelock @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/filelock/filelock-3.0.12 flatbuffers @ file:///dev/shm/flatbufferspython/2.0/GCCcore-10.3.0/flatbuffers-2.0 flit @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/flit/flit-3.2.0 flit_core @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/flitcore/flit_core-3.2.0 fonttools==4.29.1 fqdn==1.5.1 fsspec @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/fsspec/fsspec-2021.4.0 future @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/future/future-0.18.2 gast @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/gast/gast-0.4.0 gffpandas==1.2.0 gffutils==0.13 google-api-core==2.23.0 google-auth @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/googleauth/google-auth-1.35.0 google-auth-oauthlib @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/googleauthoauthlib/google-auth-oauthlib-0.4.5 google-cloud-bigquery==3.27.0 google-cloud-core==2.4.1 google-crc32c==1.6.0 google-pasta @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/googlepasta/google-pasta-0.2.0 google-resumable-media==2.7.2 googleapis-common-protos==1.66.0 greenlet==3.1.1 grpcio @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/grpcio/grpcio-1.39.0 grpcio-status==1.68.0 gviz-api @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/gvizapi/gviz_api-1.9.0 h11==0.14.0 h5py @ file:///dev/shm/h5py/3.2.1/foss-2021a/h5py-3.2.1 html5lib @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/html5lib/html5lib-1.1 httpcore==1.0.7 httpx==0.27.2 humanfriendly==10.0 idna @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/idna/idna-2.10 imagesize @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/imagesize/imagesize-1.2.0 importlib_metadata==8.5.0 iniconfig @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/iniconfig/iniconfig-1.1.1 intervaltree @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/intervaltree/intervaltree-3.1.0 intreehooks @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/intreehooks/intreehooks-1.0 ipaddress @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/ipaddress/ipaddress-1.0.23 ipykernel==6.9.1 ipython==8.1.1 ipython-genutils==0.2.0 ipython-sql==0.5.0 isoduration==20.11.0 jax @ file:///dev/shm/jax/0.2.24/foss-2021a-CUDA-11.3.1/jax/jax-jax-v0.2.24 jaxlib @ file:///dev/shm/jax/0.2.24/foss-2021a-CUDA-11.3.1/jax-jaxlib-v0.1.73/dist/jaxlib-0.1.73-cp39-none-manylinux2010_x86_64.whl jedi==0.18.1 jeepney @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/jeepney-0.6.0-py3-none-any.whl Jinja2==3.1.4 joblib @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/joblib/joblib-1.0.1 json5==0.10.0 jsonpointer==3.0.0 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 jupyter-events==0.10.0 jupyter-lsp==2.2.5 jupyter_client==7.4.9 jupyter_core==5.7.2 jupyter_server==2.14.2 jupyter_server_terminals==0.5.3 jupyterlab==4.3.1 jupyterlab-lsp==5.1.0 jupyterlab_pygments==0.3.0 jupyterlab_server==2.27.3 keras @ file:///grid/it/data/elzar/easybuild/sources/t/TensorFlow/extensions/keras-2.6.0-py2.py3-none-any.whl Keras-Preprocessing @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/Keras_Preprocessing/Keras_Preprocessing-1.1.2 keyring @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/keyring/keyring-21.8.0 keyrings.alt @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/keyringsalt/keyrings.alt-4.0.2 kiwisolver==1.3.2 liac-arff @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/liacarff/liac-arff-2.5.0 lockfile @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/lockfile/lockfile-0.12.2 MACS2==2.2.6 Markdown @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/Markdown/Markdown-3.3.4 MarkupSafe==3.0.2 matplotlib==3.5.1 matplotlib-inline==0.1.3 mistune==3.0.2 mock @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/mock/mock-4.0.3 more-itertools @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/moreitertools/more-itertools-8.7.0 mpi4py @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/mpi4py/mpi4py-3.0.3 mpmath @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/mpmath/mpmath-1.2.1 msgpack @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/msgpack/msgpack-1.0.2 nbclient==0.10.0 nbconvert==7.16.4 nbformat==5.10.4 nest-asyncio==1.5.4 netaddr @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/netaddr/netaddr-0.8.0 netifaces @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/netifaces/netifaces-0.10.9 nose @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/nose/nose-1.3.7 notebook_shim==0.2.4 numexpr @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/numexpr/numexpr-2.7.3 numpy==1.22.3 oauthlib @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/oauthlib/oauthlib-3.1.1 opt-einsum @ file:///dev/shm/jax/0.2.24/foss-2021a-CUDA-11.3.1/opt_einsum/opt_einsum-3.3.0 overrides==7.7.0 packaging==24.2 pandas @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/pandas/pandas-1.2.4 pandocfilters==1.5.1 paramiko @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/paramiko/paramiko-2.7.2 parso==0.8.3 pastel @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/pastel-0.2.1-py2.py3-none-any.whl pathlib2 @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pathlib2/pathlib2-2.3.5 paycheck @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/paycheck/paycheck-1.0.2 pbr @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pbr/pbr-5.6.0 pexpect @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pexpect/pexpect-4.8.0 pickleshare==0.7.5 Pillow @ file:///dev/shm/Pillow/8.2.0/GCCcore-10.3.0/Pillow-8.2.0 pkginfo @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pkginfo/pkginfo-1.7.0 platformdirs==4.3.6 pluggy @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pluggy/pluggy-0.13.1 poetry @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/poetry/poetry-1.1.6 poetry-core @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/poetrycore/poetry-core-1.0.3 portpicker @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/portpicker/portpicker-1.4.0 prettytable==3.12.0 prometheus_client==0.21.0 promise==2.3 prompt-toolkit==3.0.28 proto-plus==1.25.0 protobuf @ file:///dev/shm/protobufpython/3.17.3/GCCcore-10.3.0/protobuf-3.17.3 psutil @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/psutil/psutil-5.8.0 ptyprocess @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/ptyprocess-0.7.0-py2.py3-none-any.whl pure-eval==0.2.2 py @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/py/py-1.10.0 py-expression-eval @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/py_expression_eval/py_expression_eval-0.3.13 pyasn1 @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pyasn1/pyasn1-0.4.8 pyasn1-modules @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/pyasn1modules/pyasn1-modules-0.2.8 pybind11 @ file:///dev/shm/pybind11/2.6.2/GCCcore-10.3.0/pybind11-2.6.2 pycparser @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pycparser/pycparser-2.20 pycrypto @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pycrypto/pycrypto-2.6.1 pyfaidx==0.8.1.3 Pygments @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/Pygments/Pygments-2.9.0 pygrgl==1.3 pylev @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pylev/pylev-1.3.0 PyNaCl @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/PyNaCl/PyNaCl-1.4.0 pyparsing @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pyparsing/pyparsing-2.4.7 pyrsistent @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pyrsistent/pyrsistent-0.17.3 pysam==0.22.1 pytest @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pytest/pytest-6.2.4 python-dateutil==2.9.0.post0 python-json-logger==2.0.7 pytoml @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pytoml/pytoml-0.1.21 pytz @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/pytz/pytz-2021.1 PyVCF==0.6.8 PyYAML @ file:///dev/shm/PyYAML/5.4.1/GCCcore-10.3.0/PyYAML-5.4.1 pyzmq==26.2.0 referencing==0.35.1 regex @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/regex/regex-2021.4.4 requests==2.32.3 requests-oauthlib @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/requestsoauthlib/requests-oauthlib-1.3.0 requests-toolbelt @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/requeststoolbelt/requests-toolbelt-0.9.1 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rpds-py==0.21.0 rsa @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/rsa/rsa-4.7.2 sacremoses==0.0.47 scandir @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/scandir/scandir-1.10.0 scipy @ file:///dev/shm/SciPybundle/2021.05/foss-2021a/scipy/scipy-1.6.3 SecretStorage @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/SecretStorage/SecretStorage-3.3.1 semantic-version @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/semantic_version/semantic_version-2.8.5 Send2Trash==1.8.3 setuptools-rust @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/setuptoolsrust/setuptools-rust-0.12.1 setuptools-scm @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/setuptools_scm/setuptools_scm-6.0.1 shellingham @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/shellingham/shellingham-1.4.0 simplegeneric @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/simplegeneric/simplegeneric-0.8.1 simplejson @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/simplejson/simplejson-3.17.2 six @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/six/six-1.16.0 sniffio==1.3.1 snowballstemmer @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/snowballstemmer/snowballstemmer-2.1.0 sortedcontainers @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sortedcontainers/sortedcontainers-2.3.0 soupsieve==2.6 Sphinx @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/Sphinx/Sphinx-4.0.0 sphinx-bootstrap-theme @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxbootstraptheme/sphinx-bootstrap-theme-0.7.1 sphinxcontrib-applehelp @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribapplehelp/sphinxcontrib-applehelp-1.0.2 sphinxcontrib-devhelp @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribdevhelp/sphinxcontrib-devhelp-1.0.2 sphinxcontrib-htmlhelp @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribhtmlhelp/sphinxcontrib-htmlhelp-1.0.3 sphinxcontrib-jsmath @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribjsmath/sphinxcontrib-jsmath-1.0.1 sphinxcontrib-qthelp @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribqthelp/sphinxcontrib-qthelp-1.0.3 sphinxcontrib-serializinghtml @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribserializinghtml/sphinxcontrib-serializinghtml-1.1.4 sphinxcontrib-websupport @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/sphinxcontribwebsupport/sphinxcontrib-websupport-1.2.4 SQLAlchemy==2.0.36 sqlparse==0.5.2 stack-data==0.2.0 tabulate @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/tabulate/tabulate-0.8.9 tblib @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/tblib/tblib-1.7.0 tensorboard @ file:///grid/it/data/elzar/easybuild/sources/t/TensorFlow/extensions/tensorboard-2.6.0-py3-none-any.whl tensorboard-data-server @ file:///grid/it/data/elzar/easybuild/sources/t/TensorFlow/extensions/tensorboard_data_server-0.6.1-py3-none-any.whl tensorboard-plugin-profile @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/tensorboard_plugin_profile/tensorboard_plugin_profile-2.5.0 tensorboard-plugin-wit @ file:///grid/it/data/elzar/easybuild/sources/t/TensorFlow/extensions/tensorboard_plugin_wit-1.8.0-py3-none-any.whl tensorflow @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/tensorflow-2.6.0-cp39-cp39-linux_x86_64.whl tensorflow-estimator @ file:///grid/it/data/elzar/easybuild/sources/t/TensorFlow/extensions/tensorflow_estimator-2.6.0-py2.py3-none-any.whl termcolor @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/termcolor/termcolor-1.1.0 terminado==0.18.1 threadpoolctl @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/threadpoolctl/threadpoolctl-2.1.0 tinycss2==1.4.0 tokenizers==0.11.4 toml @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/toml/toml-0.10.2 tomli==2.1.0 tomlkit @ file:///grid/it/data/elzar/easybuild/sources/p/Python/extensions/tomlkit-0.7.0-py2.py3-none-any.whl toolz==0.11.2 torch==1.10.0 tornado==6.4.2 tqdm==4.62.3 traitlets==5.14.3 types-python-dateutil==2.9.0.20241003 typing-extensions @ file:///dev/shm/typingextensions/3.10.0.0/GCCcore-10.3.0/typing_extensions-3.10.0.0 ujson @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/ujson/ujson-4.0.2 uri-template==1.3.0 urllib3 @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/urllib3/urllib3-1.26.4 vcf2seq==0.7.4 virtualenv @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/virtualenv/virtualenv-20.4.6 wcwidth @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/wcwidth/wcwidth-0.2.5 webcolors==24.11.1 webencodings @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/webencodings/webencodings-0.5.1 websocket-client==1.8.0 Werkzeug @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/Werkzeug/Werkzeug-2.0.1 wrapt @ file:///dev/shm/TensorFlow/2.6.0/foss-2021a-CUDA-11.3.1/wrapt/wrapt-1.12.1 xlrd @ file:///dev/shm/Python/3.9.5/GCCcore-10.3.0/xlrd/xlrd-2.0.1 zipp==3.21.0 ```
@bschilder
Copy link
Author

@daler might you be able to provide some guidance on this?

@daler
Copy link
Owner

daler commented Dec 3, 2024

Does the issue occur when using a smaller file? E.g., head -n 10000 from one of the files?

@daler
Copy link
Owner

daler commented Dec 3, 2024

Also, the merge_strategy="merge" makes it really slow to build a database. It's clear why the default of merge_strategy="error" doesn't work, for example with two different CDS entries for the same transcript in the GFF, CDS:ENST00000641515.2.

chr1    HAVANA  CDS     65565   65573   .       +       0       ID=CDS:ENST00000641515.2;Parent=ENST00000641515.2;gene_id=ENSG00000186092.7;transcript_id=ENST00000641515.2;gene_type=protein_coding;gene_name=OR4F5;transcript_type=protein_coding;transcript_name=OR4F5-201;exon_number=2;exon_id=ENSE00003813641.1;level=2;protein_id=ENSP00000493376.2;hgnc_id=HGNC:14825;tag=RNA_Seq_supported_partial,basic,Ensembl_canonical,GENCODE_Primary,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS30547.2;havana_gene=OTTHUMG00000001094.4;havana_transcript=OTTHUMT00000003223.4
chr1    HAVANA  CDS     69037   70008   .       +       0       ID=CDS:ENST00000641515.2;Parent=ENST00000641515.2;gene_id=ENSG00000186092.7;transcript_id=ENST00000641515.2;gene_type=protein_coding;gene_name=OR4F5;transcript_type=protein_coding;transcript_name=OR4F5-201;exon_number=3;exon_id=ENSE00003813949.1;level=2;protein_id=ENSP00000493376.2;hgnc_id=HGNC:14825;tag=RNA_Seq_supported_partial,basic,Ensembl_canonical,GENCODE_Primary,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS30547.2;havana_gene=OTTHUMG00000001094.4;havana_transcript=OTTHUMT00000003223.4

In such a case, it's unclear how best (or even) to merge them since they have different start/stops and are children of different exons. Since it's unclear which one should be returned if you asked for CDS:ENST00000641515.2, I think I would prefer merge_strategy="create_unique".

I wonder if there was an issue in creating the database in your case -- out of memory, or timed out or something -- because adding the version to the database is the last thing to happen, and the error message implies that information doesn't exist. Testing with the first, say, 10k lines will help diagnose that.

@bschilder
Copy link
Author

bschilder commented Dec 3, 2024

Thanks for the reply @daler

Does the issue occur when using a smaller file? E.g., head -n 10000 from one of the files?

!head -1000 GRCh38/gencode.v47.annotation.gff3 > tmp.gff3

Using a small subset seems to work fine.

dbfn='tmp.db'
 db = gffutils.create_db("tmp.gff3",
                            dbfn=dbfn,
                            force=False,
                            keep_order=True,
                            merge_strategy='merge', 
                            sort_attribute_values=True)
db = gffutils.FeatureDB(dbfn, keep_order=True)
db
# <gffutils.interface.FeatureDB at 0x155362e9b910>

Also, the merge_strategy="merge" makes it really slow to build a database. It's clear why the default of merge_strategy="error" doesn't work, for example with two different CDS entries for the same transcript in the GFF, CDS:ENST00000641515.2.

chr1    HAVANA  CDS     65565   65573   .       +       0       ID=CDS:ENST00000641515.2;Parent=ENST00000641515.2;gene_id=ENSG00000186092.7;transcript_id=ENST00000641515.2;gene_type=protein_coding;gene_name=OR4F5;transcript_type=protein_coding;transcript_name=OR4F5-201;exon_number=2;exon_id=ENSE00003813641.1;level=2;protein_id=ENSP00000493376.2;hgnc_id=HGNC:14825;tag=RNA_Seq_supported_partial,basic,Ensembl_canonical,GENCODE_Primary,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS30547.2;havana_gene=OTTHUMG00000001094.4;havana_transcript=OTTHUMT00000003223.4
chr1    HAVANA  CDS     69037   70008   .       +       0       ID=CDS:ENST00000641515.2;Parent=ENST00000641515.2;gene_id=ENSG00000186092.7;transcript_id=ENST00000641515.2;gene_type=protein_coding;gene_name=OR4F5;transcript_type=protein_coding;transcript_name=OR4F5-201;exon_number=3;exon_id=ENSE00003813949.1;level=2;protein_id=ENSP00000493376.2;hgnc_id=HGNC:14825;tag=RNA_Seq_supported_partial,basic,Ensembl_canonical,GENCODE_Primary,MANE_Select,appris_principal_1,CCDS;ccdsid=CCDS30547.2;havana_gene=OTTHUMG00000001094.4;havana_transcript=OTTHUMT00000003223.4

In such a case, it's unclear how best (or even) to merge them since they have different start/stops and are children of different exons. Since it's unclear which one should be returned if you asked for CDS:ENST00000641515.2, I think I would prefer merge_strategy="create_unique".

Ok, though I should note I'm using the official Gencode. release annotations. Does this imply they are not following the expected standards for gff3?

I'll try merge_strategy="create_unique" with the full gff3 file and see if that helps (this will take a while).

I wonder if there was an issue in creating the database in your case -- out of memory, or timed out or something -- because adding the version to the database is the last thing to happen, and the error message implies that information doesn't exist. Testing with the first, say, 10k lines will help diagnose that.

I don't think memory should normally be an issue, I'm running this on my interactive HPC with 16 cores and 258Gb RAM. Unless gffutils is trying to process the data in a way that leads to an explosion of memory usage.

Another possibility is that it takes so long to create the database, that my interactive session times out after 12 hours (the max i can request it for). Though I'd hope it wouldn't take that long to process one file.

@bschilder
Copy link
Author

bschilder commented Dec 3, 2024

FYI, the gencode.v47.annotation.gff3 file is 1.7Gb large. What's the largest file you've successfully run gffutils on @daler ?

@daler
Copy link
Owner

daler commented Dec 4, 2024

I use it on GENCODE files all the time; you can also leave it gzipped to save a little on space. It just so happens that the arguments you're using triggers complex behavior that in some cases can be helpful, but probably not in the general case.

The following runs in ~8 mins with under 200 MB RAM total:

gffutils.create_db(
    "gencode.gff.gz",
    dbfn="gencode_gff.db",
    merge_strategy="create_unique",
    verbose=True)

Or, for GTF,

gffutils.create_db(
    "genecode.gtf.gz",
    dbfn="gencode_gtf.db",
    merge_strategy="create_unique",
    disable_infer_transcripts=True,
    disable_infer_genes=True,
    verbose=True)

Regarding specs...

GFF expects every feature to have a unique ID (see this entry in the spec); GTF spec does not include transcript or gene features; per the spec, they are expected to be inferred from exons.

So no, GENCODE GFF and GTF files do not follow the specs, hence needing to build in detection and warningwhen trying to build a db from GENCODE files. But honestly, hardly anyone follows the specs . . . hence needing to build gffutils in the first place to deal with all that messiness!

For your original example, when you use the merge_strategy="merge", that triggers some rather complex behavior that involves scanning everything in the database so far to figure out what the merge candidates are. I haven't checked it, but this is probably something approaching O(n^2) complexity and I would not be surprised if spending all that effort on a GENCODE-size file took >12 hrs. In that case, I bet what happened is that the job timed out and never created the version entry, giving the original error you reported.

Also, keep_order=True and sort_attribute_values=True are really only useful for tests, or when it's very important to retain round-trip invariance. They don't add that much work, but it's something. It's the merge stuff that's super time-consuming though.

@daler
Copy link
Owner

daler commented Dec 16, 2024

Closing because I think everything is behaving as expected, but please reopen if you have any issues with the adjusted arguments.

@daler daler closed this as completed Dec 16, 2024
@bschilder
Copy link
Author

Thanks so much for the detailed response @daler. Trying this again now.

Just a note, my example used gff3 format (not gff or gtf as in your examples). Not sure if this makes a difference.

@bschilder
Copy link
Author

bschilder commented Dec 16, 2024

@daler, I'm still encountering same issues as before with the gff3 file in my initial reproducible example. Namely, the function hangs indefinitely, even after modifying the arguments.

gff_fn = "GRCh38/gencode.v47.annotation.gff3"
dbfn = f"{gff_fn}.db"
 db = gffutils.create_db(gff_fn,
                            dbfn=dbfn,
                            merge_strategy='create_unique') 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants