diff --git a/embedding-quantization.md b/embedding-quantization.md index 1fbd09d227..45e764d339 100644 --- a/embedding-quantization.md +++ b/embedding-quantization.md @@ -340,13 +340,13 @@ The following [demo](https://huggingface.co/spaces/sentence-transformers/quantiz The following scripts can be used to experiment with embedding quantization for retrieval & beyond. There are three categories: * **Recommended Retrieval**: - * [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_recommended.py): This script combines binary search with scalar rescoring, much like the above demo, for cheap, efficient, and performant retrieval. + * [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_recommended.py): This script combines binary search with scalar rescoring, much like the above demo, for cheap, efficient, and performant retrieval. * **Usage**: - * [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using FAISS, by using the semantic_search_faiss utility function. - * [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using USearch, by using the semantic_search_usearch utility function. + * [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using FAISS, by using the semantic_search_faiss utility function. + * [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch.py): This script showcases regular usage of binary or scalar quantization, retrieval, and rescoring using USearch, by using the semantic_search_usearch utility function. * **Benchmarks**: - * [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using FAISS. It uses the semantic_search_faiss utility function. Our benchmarks especially show show speedups for `ubinary`. - * [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using USearch. It uses the semantic_search_usearch utility function. Our experiments show large speedups on newer hardware, particularly for `int8`. + * [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using FAISS. It uses the semantic_search_faiss utility function. Our benchmarks especially show show speedups for `ubinary`. + * [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch_benchmark.py): This script includes a retrieval speed benchmark of `float32` retrieval, binary retrieval + rescoring, and scalar retrieval + rescoring, using USearch. It uses the semantic_search_usearch utility function. Our experiments show large speedups on newer hardware, particularly for `int8`. ### Future work diff --git a/matryoshka.md b/matryoshka.md index d6c9fa98fc..9ca0339a40 100644 --- a/matryoshka.md +++ b/matryoshka.md @@ -111,9 +111,9 @@ References: See the following complete scripts as examples of how to apply the `MatryoshkaLoss` in practice: -* **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. -* **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli_reduced_dim.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation. -* **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `MatryoshkaLoss` to train an embedding model on the training set of the `STSBenchmark` dataset. It is an adaptation of the [STS](../sts/README) documentation. +* **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model using Natural Language Inference (NLI) data. It is an adaptation of the [NLI](../nli/README) documentation. +* **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli_reduced_dim.py)**: This example uses the `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model with a small maximum output dimension of 256. It trains using Natural Language Inference (NLI) data, and is an adaptation of the [NLI](../nli/README) documentation. +* **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_sts.py)**: This example uses the `CoSENTLoss` with `MatryoshkaLoss` to train an embedding model on the training set of the `STSBenchmark` dataset. It is an adaptation of the [STS](../sts/README) documentation. ## How do I use 🪆 Matryoshka Embedding models? @@ -129,7 +129,7 @@ Keep in mind that although processing smaller embeddings for downstream tasks (r In Sentence Transformers, you can load a Matryoshka Embedding model just like any other model, but you can specify the desired embedding size using the `truncate_dim` argument. After that, you can perform inference using the [`SentenceTransformers.encode`](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode) function, and the embeddings will be automatically truncated to the specified size. -Let's try to use a model that I trained using [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base): +Let's try to use a model that I trained using [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base): ```python from sentence_transformers import SentenceTransformer @@ -202,8 +202,8 @@ similarities = cos_sim(embeddings[0], embeddings[1:]) Now that Matryoshka models have been introduced, let's look at the actual performance that we may be able to expect from a Matryoshka embedding model versus a regular embedding model. For this experiment, I have trained two models: -* [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): Trained by running [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base). -* [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): Trained by running a modified version of [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) where the training loss is only `MultipleNegativesRankingLoss` rather than `MatryoshkaLoss` on top of `MultipleNegativesRankingLoss`. I also use [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) as the base model. +* [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): Trained by running [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) with [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base). +* [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): Trained by running a modified version of [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) where the training loss is only `MultipleNegativesRankingLoss` rather than `MatryoshkaLoss` on top of `MultipleNegativesRankingLoss`. I also use [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) as the base model. Both of these models were trained on the AllNLI dataset, which is a concatenation of the [SNLI](https://huggingface.co/datasets/snli) and [MultiNLI](https://huggingface.co/datasets/multi_nli) datasets. I have evaluated these models on the [STSBenchmark](https://huggingface.co/datasets/mteb/stsbenchmark-sts) test set using multiple different embedding dimensions. The results are plotted in the following figure: @@ -231,4 +231,4 @@ In this demo, you can dynamically shrink the output dimensions of the [`nomic-ai * Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., ... & Farhadi, A. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35, 30233-30249. https://arxiv.org/abs/2205.13147 * Matryoshka Embeddings — Sentence-Transformers documentation. (n.d.). https://sbert.net/examples/training/matryoshka/README.html * UKPLab. (n.d.). GitHub. https://github.com/UKPLab/sentence-transformers -* Unboxing Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning. (n.d.). https://blog.nomic.ai/posts/nomic-embed-matryoshka \ No newline at end of file +* Unboxing Nomic Embed v1.5: Resizable Production Embeddings with Matryoshka Representation Learning. (n.d.). https://blog.nomic.ai/posts/nomic-embed-matryoshka diff --git a/static-embeddings.md b/static-embeddings.md index 67634fad13..a6ed750621 100644 --- a/static-embeddings.md +++ b/static-embeddings.md @@ -1327,7 +1327,7 @@ Additionally, there are quite a few possible extensions that are likely to impro 2. Model Souping: Combining weights from multiple models trained in the same way with different seeds or data distributions. 3. Curriculum Learning: Train on examples of increasing difficulties. 4. [Guided False In-Batch Negatives Filtering](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#sentence_transformers.losses.CachedGISTEmbedLoss): Exclude false negatives via an efficient pre-trained embedding model. -5. [Seed Optimization for the Random Weight Initialization](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/data_augmentation/train_sts_seed_optimization.py): Train the first steps with various seeds to find one with a useful weight initialization. +5. [Seed Optimization for the Random Weight Initialization](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/data_augmentation/train_sts_seed_optimization.py): Train the first steps with various seeds to find one with a useful weight initialization. 6. Tokenizer Retraining: Retrain a tokenizer with modern texts and learnings. 7. Gradient Caching: Applying GradCache via [`CachedMultipleNegativesRankingLoss`](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#sentence_transformers.losses.CachedMultipleNegativesRankingLoss) allows for larger batches, which often result in superior performance. 8. [Model Distillation](https://sbert.net/examples/training/distillation/README.html): Rather than training exclusively using supervised training data, we can also feed unsupervised data through a larger embedding model and distil those embeddings into the static embedding-based student model. diff --git a/zh/embedding-quantization.md b/zh/embedding-quantization.md index 5452321634..fe7120ff1c 100644 --- a/zh/embedding-quantization.md +++ b/zh/embedding-quantization.md @@ -339,17 +339,17 @@ int8 - **推荐检索**: - - [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_recommended.py): 此脚本结合了二进制搜索和标量重打分,与上面的演示类似,以实现廉价、高效且性能良好的检索。 + - [semantic_search_recommended.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_recommended.py): 此脚本结合了二进制搜索和标量重打分,与上面的演示类似,以实现廉价、高效且性能良好的检索。 - **使用**: - - [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss.py): 此脚本展示了使用 FAISS 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 semantic_search_faiss 实用函数。 - - [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch.py): 此脚本展示了使用 USearch 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 semantic_search_usearch 实用函数。 + - [semantic_search_faiss.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss.py): 此脚本展示了使用 FAISS 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 semantic_search_faiss 实用函数。 + - [semantic_search_usearch.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch.py): 此脚本展示了使用 USearch 的常规二进制或标量量化、检索和重打分的使用方式,通过使用 semantic_search_usearch 实用函数。 - **基准测试**: - - [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_faiss_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 FAISS。它使用了 semantic_search_faiss 实用函数。我们的基准测试特别显示了 `ubinary` 的速度提升。 - - [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/embedding-quantization/semantic_search_usearch_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 USearch。它使用了 semantic_search_usearch 实用函数。我们的实验在新硬件上显示了巨大的速度提升,特别是对于 `int8` 。 + - [semantic_search_faiss_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_faiss_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 FAISS。它使用了 semantic_search_faiss 实用函数。我们的基准测试特别显示了 `ubinary` 的速度提升。 + - [semantic_search_usearch_benchmark.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/applications/embedding-quantization/semantic_search_usearch_benchmark.py): 此脚本包括了对 `float32` 检索、二进制检索加重打分和标量检索加重打分的检索速度基准测试,使用 USearch。它使用了 semantic_search_usearch 实用函数。我们的实验在新硬件上显示了巨大的速度提升,特别是对于 `int8` 。 ### 未来工作 diff --git a/zh/matryoshka.md b/zh/matryoshka.md index 730d6fbf70..a4f02659b1 100644 --- a/zh/matryoshka.md +++ b/zh/matryoshka.md @@ -129,9 +129,9 @@ model.fit( 请查看以下完整脚本,了解如何在实际应用中使用 `MatryoshkaLoss` : -- **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py)**: 此示例使用 `MultipleNegativesRankingLoss` 与 `MatryoshkaLoss` 结合,利用自然语言推理 (NLI) 数据训练一个强大的嵌入模型。这是对 [NLI](../nli/README) 文档的改编。 -- **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli_reduced_dim.py)**: 此示例使用 `MultipleNegativesRankingLoss` 与 `MatryoshkaLoss` 结合,训练一个最大输出维度为 256 的小型嵌入模型。它使用自然语言推理 (NLI) 数据进行训练,这是对 [NLI](../nli/README) 文档的改编。 -- **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_sts.py)**: 此示例使用 `CoSENTLoss` 与 `MatryoshkaLoss` 结合,在 `STSBenchmark` 数据集的训练集上训练一个嵌入模型。这是对 [STS](../sts/README) 文档的改编。 +- **[matryoshka_nli.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py)**: 此示例使用 `MultipleNegativesRankingLoss` 与 `MatryoshkaLoss` 结合,利用自然语言推理 (NLI) 数据训练一个强大的嵌入模型。这是对 [NLI](../nli/README) 文档的改编。 +- **[matryoshka_nli_reduced_dim.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli_reduced_dim.py)**: 此示例使用 `MultipleNegativesRankingLoss` 与 `MatryoshkaLoss` 结合,训练一个最大输出维度为 256 的小型嵌入模型。它使用自然语言推理 (NLI) 数据进行训练,这是对 [NLI](../nli/README) 文档的改编。 +- **[matryoshka_sts.py](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_sts.py)**: 此示例使用 `CoSENTLoss` 与 `MatryoshkaLoss` 结合,在 `STSBenchmark` 数据集的训练集上训练一个嵌入模型。这是对 [STS](../sts/README) 文档的改编。 @@ -150,7 +150,7 @@ model.fit( ### 在 Sentence Transformers 中 在 Sentence Transformers 中,你可以像加载普通模型一样加载俄罗斯套娃嵌入模型,并使用 [`SentenceTransformers.encode`](https://sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode) 进行推理。获取嵌入后,我们可以将它们截断到我们所需的尺寸,如果需要,我们还可以对它们进行归一化。 -让我们尝试使用我使用 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) 和 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 训练的模型: +让我们尝试使用我使用 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) 和 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 训练的模型: ```python from sentence_transformers import SentenceTransformer @@ -223,8 +223,8 @@ similarities = cos_sim(embeddings[0], embeddings[1:]) 现在我们已经介绍了俄罗斯套娃模型,让我们来看看我们可以从俄罗斯套娃嵌入模型与常规嵌入模型中实际期待的绩效表现。为了这个实验,我训练了两个模型: -- [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): 通过运行 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) 与 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 进行训练。 -- [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): 通过运行修改版的 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py) 进行训练,其中训练损失仅为 `MultipleNegativesRankingLoss` ,而不是在 `MultipleNegativesRankingLoss` 之上的 `MatryoshkaLoss` 。我也使用 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 作为基础模型。 +- [tomaarsen/mpnet-base-nli-matryoshka](https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka): 通过运行 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) 与 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 进行训练。 +- [tomaarsen/mpnet-base-nli](https://huggingface.co/tomaarsen/mpnet-base-nli): 通过运行修改版的 [`matryoshka_nli.py`](https://github.com/UKPLab/sentence-transformers/blob/master/examples/sentence_transformer/training/matryoshka/matryoshka_nli.py) 进行训练,其中训练损失仅为 `MultipleNegativesRankingLoss` ,而不是在 `MultipleNegativesRankingLoss` 之上的 `MatryoshkaLoss` 。我也使用 [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) 作为基础模型。 这两个模型都在 AllNLI 数据集上进行了训练,该数据集是 [SNLI](https://huggingface.co/datasets/snli) 和 [MultiNLI](https://huggingface.co/datasets/multi_nli) 数据集的拼接。我使用多种不同的嵌入维度在这些模型上评估了 [STSBenchmark](https://huggingface.co/datasets/mteb/stsbenchmark-sts) 测试集。结果绘制在下面的图表中: