Extract Sentence Embeddings from Hugging Face pre-trained models.
This repo contains code for both tensorflow and pytorch. We can extract sentence embeddings for our dataset using any pre-trained Hugging Face models. Sometimes out of the box embeddings work or sometimes they won't. If you want to train/finetune on your own dataset, checkout sentence-transformers.
These can be used for any semantic similarity search tasks, clustering etc.
- tensorflow 2.0.0
- pytorch 1.6.0
- transformers 3.0.2
The code works in the following way
- Load model and its respective tokenizer.
- Tokenize our sentences
- Get token embeddings
- Convert token embeddings to single sentence embeddings[1].
[1]. There are many techniques to convert token embeddings to sentence embeddings, but SOTA is mean pooling.
Benchmarks using SentEval are coming Soon.
This repo is inspired by sentence-transformers. The pytorch code is from their repo.