Skip to content

Commit e28ec8a

Browse files
omrylcnqgallouedec
andauthored
Fix sentence-transformers examples links (#2843)
* Update matryoshka.md * Other link fixes --------- Co-authored-by: Quentin Gallouédec <[email protected]>
1 parent e83855a commit e28ec8a

File tree

5 files changed

+24
-24
lines changed

5 files changed

+24
-24
lines changed

embedding-quantization.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -340,13 +340,13 @@ The following [demo](https://huggingface.co/spaces/sentence-transformers/quantiz
340340
The following scripts can be used to experiment with embedding quantization for retrieval & beyond. There are three categories:
341341

342342
* **Recommended Retrieval**:
343-
* [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_recommended.py): This script combines binary search with scalar rescoring, much like the above demo, for cheap, efficient, and performant retrieval.
343+
* [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_recommended.py): This script combines binary search with scalar rescoring, much like the above demo, for cheap, efficient, and performant retrieval.
344344
* **Usage**:
345-
* [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using FAISS, by using the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> utility function.
346-
* [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using USearch, by using the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> utility function.
345+
* [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using FAISS, by using the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> utility function.
346+
* [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using USearch, by using the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> utility function.
347347
* **Benchmarks**:
348-
* [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using FAISS. It uses the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> utility function. Our benchmarks especially show show speedups for `ubinary`.
349-
* [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using USearch. It uses the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> utility function. Our experiments show large speedups on newer hardware, particularly for `int8`.
348+
* [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using FAISS. It uses the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> utility function. Our benchmarks especially show show speedups for `ubinary`.
349+
* [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using USearch. It uses the <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> utility function. Our experiments show large speedups on newer hardware, particularly for `int8`.
350350

351351
### Future work
352352

matryoshka.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,9 @@ References:
111111

112112
See the following complete scripts as examples of how to apply the `MatryoshkaLoss` in practice:
113113

114-
* **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
115-
* **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli_reduced_dim.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation.
116-
* **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `MatryoshkaLoss` to train an embedding model on the training set of the `STSBenchmark` dataset. It is an adaptation of the [STS](../sts/README) documentation.
114+
* **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation.
115+
* **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli_reduced_dim.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation.
116+
* **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `MatryoshkaLoss` to train an embedding model on the training set of the `STSBenchmark` dataset. It is an adaptation of the [STS](../sts/README) documentation.
117117

118118
## How do I use 🪆 Matryoshka Embedding models?
119119

@@ -129,7 +129,7 @@ Keep in mind that although processing smaller embeddings for downstream tasks (r
129129

130130
In Sentence Transformers, you can load a Matryoshka Embedding model just like any other model, but you can specify the desired embedding size using the `truncate_dim` argument. After that, you can perform inference using the [`SentenceTransformers.encode`](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode) function, and the embeddings will be automatically truncated to the specified size.
131131

132-
Let's try to use a model that I trained using [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base):
132+
Let's try to use a model that I trained using [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base):
133133

134134
```python
135135
from sentence_transformers import SentenceTransformer
@@ -202,8 +202,8 @@ similarities = cos_sim(embeddings[0], embeddings[1:])
202202

203203
Now that Matryoshka models have been introduced, let's look at the actual performance that we may be able to expect from a Matryoshka embedding model versus a regular embedding model. For this experiment, I have trained two models:
204204

205-
* [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): Trained by running [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base).
206-
* [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): Trained by running a modified version of [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) where the training loss is only `MultipleNegativesRankingLoss` rather than `MatryoshkaLoss` on top of `MultipleNegativesRankingLoss`. I also use [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) as the base model.
205+
* [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): Trained by running [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base).
206+
* [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): Trained by running a modified version of [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) where the training loss is only `MultipleNegativesRankingLoss` rather than `MatryoshkaLoss` on top of `MultipleNegativesRankingLoss`. I also use [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) as the base model.
207207

208208
Both of these models were trained on the AllNLI dataset, which is a concatenation of the [SNLI](https://huggingface.co/datasets/snli) and [MultiNLI](https://huggingface.co/datasets/multi_nli) datasets. I have evaluated these models on the [STSBenchmark](https://huggingface.co/datasets/mteb/stsbenchmark-sts) test set using multiple different embedding dimensions. The results are plotted in the following figure:
209209

@@ -231,4 +231,4 @@ In this demo, you can dynamically shrink the output dimensions of the [`nomic-ai
231231
* Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., ... & Farhadi, A. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35, 30233-30249. https://arxiv.org/abs/2205.13147
232232
* Matryoshka Embeddings — Sentence-Transformers documentation. (n.d.). https://sbert.net/examples/training/matryoshka/README.html
233233
* UKPLab. (n.d.). GitHub. https://github.com/UKPLab/sentence-transformers
234-
* Unboxing Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning. (n.d.). https://blog.nomic.ai/posts/nomic-embed-matryoshka
234+
* Unboxing Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning. (n.d.). https://blog.nomic.ai/posts/nomic-embed-matryoshka

static-embeddings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1327,7 +1327,7 @@ Additionally, there are quite a few possible extensions that are likely to impro
13271327
2. Model Souping: Combining weights from multiple models trained in the same way with different seeds or data distributions.
13281328
3. Curriculum Learning: Train on examples of increasing difficulties.
13291329
4. [Guided False In-Batch Negatives Filtering](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#sentence_transformers.losses.CachedGISTEmbedLoss): Exclude false negatives via an efficient pre-trained embedding model.
1330-
5. [Seed Optimization for the Random Weight Initialization](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/data_augmentation/train_sts_seed_optimization.py): Train the first steps with various seeds to find one with a useful weight initialization.
1330+
5. [Seed Optimization for the Random Weight Initialization](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/data_augmentation/train_sts_seed_optimization.py): Train the first steps with various seeds to find one with a useful weight initialization.
13311331
6. Tokenizer Retraining: Retrain a tokenizer with modern texts and learnings.
13321332
7. Gradient Caching: Applying GradCache via [`CachedMultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#sentence_transformers.losses.CachedMultipleNegativesRankingLoss) allows for larger batches, which often result in superior performance.
13331333
8. [Model Distillation](https://sbert.net/examples/training/distillation/README.html): Rather than training exclusively using supervised training data, we can also feed unsupervised data through a larger embedding model and distil those embeddings into the static embedding-based student model.

zh/embedding-quantization.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -339,17 +339,17 @@ int8
339339

340340
- **推荐检索**:
341341

342-
- [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_recommended.py): 此脚本结合了二进制搜索和标量重打分,与上面的演示类似,以实现廉价、高效且性能良好的检索。
342+
- [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_recommended.py): 此脚本结合了二进制搜索和标量重打分,与上面的演示类似,以实现廉价、高效且性能良好的检索。
343343

344344
- **使用**:
345345

346-
- [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss.py): 此脚本展示了使用 FAISS 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> 实用函数。
347-
- [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch.py): 此脚本展示了使用 USearch 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> 实用函数。
346+
- [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss.py): 此脚本展示了使用 FAISS 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> 实用函数。
347+
- [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch.py): 此脚本展示了使用 USearch 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> 实用函数。
348348

349349
- **基准测试**:
350350

351-
- [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 FAISS。它使用了 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> 实用函数。我们的基准测试特别显示了 `ubinary` 的速度提升。
352-
- [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 USearch。它使用了 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> 实用函数。我们的实验在新硬件上显示了巨大的速度提升,特别是对于 `int8`
351+
- [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 FAISS。它使用了 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_faiss"><code>semantic_search_faiss</code></a> 实用函数。我们的基准测试特别显示了 `ubinary` 的速度提升。
352+
- [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 USearch。它使用了 <a href="https://sbert.net/docs/package_reference/quantization.html#sentence_transformers.quantization.semantic_search_usearch"><code>semantic_search_usearch</code></a> 实用函数。我们的实验在新硬件上显示了巨大的速度提升,特别是对于 `int8`
353353

354354
### 未来工作
355355

0 commit comments

Comments
 (0)