GitHub

The doc_enc library is devoted to the computation of cross-lingual vector representations of long texts' embeddings applicable in information retrieval and classification tasks.

Full documentation in Russian

Quick start:

First of all, you should download a small dataset of documents and a pre-trained model:

curl -O dn11.isa.ru:8080/doc-enc-data/datasets.docs.mini.v1.tar.gz
tar xf datasets.docs.mini.v1.tar.gz
find docs-mini/texts/ -name "*.txt" > files.txt
curl -O http://dn11.isa.ru:8080/doc-enc-data/models.def.pt

The data is available for downloading by the link, in case the above links are not working. To start the conversion process, it is convenient to use a pre-built Docker image. Before doing so, ensure that you have installed the NVIDIA Container Toolkit by following the instructions provided in the NVIDIA Container Toolkit installation guide.

docker run  --gpus=1  --rm  -v $(pwd):/temp/ -w /temp  \
  semvectors/doc_enc:0.1.2 \
  docenccli docs -i /temp/files.txt -o /temp/vecs -m /temp/models.def.pt

Vectors will be stored in the vecs directory alongside their corresponding file names using the numpy.savez function. Below is an example of how to load the vectors from these files:

import numpy as np

obj = np.load('vecs/0000.npz')
print(obj['ids'][:2])
print(obj['embs'][:2])

Name		Name	Last commit message	Last commit date
Latest commit History 285 Commits
doc_enc		doc_enc
eval		eval
examples		examples
finetune		finetune
hydra_plugins/doc_enc_search_path		hydra_plugins/doc_enc_search_path
nix		nix
train		train
.envrc		.envrc
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.ruff.toml		.ruff.toml
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start:

About

Releases 2

Packages

Languages

License

SemVectors/doc_enc

Folders and files

Latest commit

History

Repository files navigation

Quick start:

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages