Skip to content

Commit 8087aea

Browse files
danielfleischerPeter Izsakmosheber
committed
Compatibility with Haystack v2
Co-authored-by: Peter Izsak <[email protected]> Co-authored-by: Moshe Berchansky <[email protected]>
1 parent 3959edd commit 8087aea

File tree

140 files changed

+11969
-10135
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

140 files changed

+11969
-10135
lines changed

Demo.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
2+
# Chat Demo with Chainlit
3+
4+
We provide the conversation demo with a multi-modal agent, using the [chainlit](https://github.com/Chainlit/chainlit) framework. For more information, please visit their official website [here](https://docs.chainlit.io/get-started/overview).
5+
6+
For a simple chat experience, we load an LLM, such as [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), by specifying the configuration of like so:
7+
8+
```yaml
9+
task: text-generation
10+
model: "meta-llama/Meta-Llama-3-8B-Instruct"
11+
do_sample: false
12+
max_new_tokens: 300
13+
```
14+
15+
Then, run the following:
16+
17+
```sh
18+
CONFIG=config/regular_chat.yaml chainlit run fastrag/ui/chainlit_no_rag.py
19+
```
20+
21+
For a chat using a RAG pipeline, specify the tools you wish to use in the following format:
22+
23+
```yaml
24+
chat_model:
25+
generator_kwargs:
26+
model: microsoft/Phi-3-mini-128k-instruct
27+
task: "text-generation"
28+
generation_kwargs:
29+
max_new_tokens: 300
30+
do_sample: false
31+
huggingface_pipeline_kwargs:
32+
torch_dtype: torch.bfloat16
33+
max_new_tokens: 300
34+
do_sample: false
35+
trust_remote_code: true
36+
generator_class: haystack.components.generators.hugging_face_local.HuggingFaceLocalGenerator
37+
tools:
38+
- type: doc
39+
query_handler:
40+
type: "haystack_yaml"
41+
params:
42+
pipeline_yaml_path: "config/empty_doc_only_retrieval_pipeline.yaml"
43+
index_handler:
44+
type: "haystack_yaml"
45+
params:
46+
pipeline_yaml_path: "config/empty_index_pipeline.yaml"
47+
params:
48+
name: "docRetriever"
49+
description: 'useful for when you need to retrieve text to answer questions. Use the following format: {{ "input": [your tool input here ] }}.'
50+
output_variable: "documents"
51+
```
52+
53+
Then, run the application using the command:
54+
55+
```sh
56+
CONFIG=config/rag_pipeline_chat.yaml chainlit run fastrag/ui/chainlit_pipeline.py
57+
```
58+
59+
## Screenshot
60+
61+
![alt text](./assets/chainlit_demo_example.png)
62+
63+
64+
# Multi-Modal Conversational Agent with Chainlit
65+
66+
In this demo, we use the [```xtuner/llava-llama-3-8b-v1_1-transformers```]https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) model as a conversational agent, that can decide which retriever to use to respond to the user's query.
67+
To perform that, we use dynamic reasoning with [ReAct](https://arxiv.org/abs/2210.03629) prompts, resulting in multiple logical turns.
68+
To explore all the steps to build the agent system, you can check out our [Example Notebook](../examples/multi_modal_react_agent.ipynb).
69+
For more information on how to use ReAct, feel free to visit [Haystack's original tutorial](https://haystack.deepset.ai/tutorials/25_customizing_agent), which our demo is based on.
70+
71+
To run the demo, simply run:
72+
73+
```sh
74+
CONFIG=config/visual_chat_agent.yaml chainlit run fastrag/ui/chainlit_multi_modal_agent.py
75+
```
76+
77+
## Screenshot
78+
79+
![alt text](./assets/chainlit_agent_example.png)
80+
81+
# Available Chat Templates
82+
83+
## Default Template
84+
85+
```
86+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
87+
{memory}
88+
Human: {query}
89+
AI:
90+
```
91+
92+
## Llama 2 Template (Llama2)
93+
94+
```
95+
<s>[INST] <<SYS>>
96+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
97+
<</SYS>>
98+
99+
{memory}{query} [/INST]
100+
```
101+
102+
Notice that here we, the user messages will be:
103+
104+
```
105+
<s>[INST] {USER_QUERY} [/INST]
106+
```
107+
108+
And the model messages will be:
109+
110+
```
111+
{ASSISTATN_RESPONSE} </s>
112+
```
113+
114+
## User-Assistant (UserAssistant)
115+
116+
```
117+
### System:
118+
The following is a conversation between a human and an AI. Do not generate the user response to your output.
119+
{memory}
120+
121+
### User: {query}
122+
### Assistant:
123+
```
124+
125+
## User-Assistant for Llava (UserAssistantLlava)
126+
127+
For the v1.5 llava models, we define a specific template, as shown in [this post regardin Llava models](https://huggingface.co/docs/transformers/model_doc/llava).
128+
129+
```
130+
{memory}
131+
132+
USER: {query}
133+
ASSISTANT:
134+
```

README.md

+10-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
> [!IMPORTANT]
2-
> We're migrating `fastRAG` to Haystack v2.0 API and will release a major update soon. Stay tuned!
3-
41
<div align="center">
52
<img src="assets/fastrag_header.png" width="300"/>
63

@@ -10,7 +7,7 @@
107
<p>Build and explore efficient retrieval-augmented generative models and applications</p>
118
</h4>
129

13-
:round_pushpin: <a href="#round_pushpin-installation">Installation</a> • :rocket: <a href="components.md">Components</a> • :books: <a href="examples.md">Examples</a> • :red_car: <a href="getting_started.md">Getting Started</a> • :pill: <a href="demo/README.md">Demos</a> • :pencil2: <a href="scripts/README.md">Scripts</a> • :bar_chart: <a href="benchmarks/README.md">Benchmarks</a>
10+
:round_pushpin: <a href="#round_pushpin-installation">Installation</a> • :rocket: <a href="components.md">Components</a> • :books: <a href="examples.md">Examples</a> • :red_car: <a href="getting_started.md">Getting Started</a> • :pill: <a href="Demo.md">Demos</a> • :pencil2: <a href="scripts/README.md">Scripts</a> • :bar_chart: <a href="benchmarks/README.md">Benchmarks</a>
1411

1512

1613
</div>
@@ -21,11 +18,13 @@ with a comprehensive tool-set for advancing retrieval augmented generation.
2118

2219
Comments, suggestions, issues and pull-requests are welcomed! :heart:
2320

21+
> [!IMPORTANT]
22+
> Now compatible with Haystack v2+. Please report any possible issues you find.
23+
2424
## :mega: Updates
2525

26-
- **2024-04**: [(Extra Demo)](extras/rag_on_client.ipynb) **Chat with your documents on Intel Meteor Lake iGPU**.
27-
- **2023-12**: Gaudi2, ONNX runtime and LlamaCPP support; Optimized Embedding models; Multi-modality and Chat demos; [REPLUG](https://arxiv.org/abs/2301.12652) text generation.
28-
- **2023-06**: ColBERT index modification: adding/removing documents; see [IndexUpdater](https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/index_updater.py).
26+
- **2023-12**: Gaudi2 and ONNX runtime support; Optimized Embedding models; Multi-modality and Chat demos; [REPLUG](https://arxiv.org/abs/2301.12652) text generation.
27+
- **2023-06**: ColBERT index modification: adding/removing documents; see [IndexUpdater](libs/colbert/colbert/index_updater.py).
2928
- **2023-05**: [RAG with LLM and dynamic prompt synthesis example](examples/rag-prompt-hf.ipynb).
3029
- **2023-04**: Qdrant `DocumentStore` support.
3130

@@ -57,6 +56,10 @@ For a brief overview of the various unique components in fastRAG refer to the [C
5756
<td><a href="components.md#fastrag-running-llms-with-onnx-runtime">ONNX Runtime</a></td>
5857
<td><em>Running LLMs with optimized ONNX-runtime</td>
5958
</tr>
59+
<tr>
60+
<td><a href="components.md#fastrag-running-quantized-llms-using-openvino">OpenVINO</a></td>
61+
<td><em>Running quantized LLMs using OpenVINO</td>
62+
</tr>
6063
<tr>
6164
<td><a href="components.md#fastrag-running-rag-pipelines-with-llms-on-a-llama-cpp-backend">Llama-CPP</a></td>
6265
<td><em>Running RAG Pipelines with LLMs on a Llama CPP backend</td>
@@ -117,10 +120,6 @@ pip install .[qdrant] # Support for Qdrant store
117120
pip install .[colbert] # Support for ColBERT+PLAID; requires FAISS
118121
pip install .[faiss-cpu] # CPU-based Faiss library
119122
pip install .[faiss-gpu] # GPU-based Faiss library
120-
pip install .[knowledge_graph] # Libraries for working with spacy and KG
121-
122-
# User interface (for demos)
123-
pip install .[ui]
124123

125124
# Benchmarking
126125
pip install .[benchmark]

assets/chainlit_agent_example.png

403 KB
Loading

assets/chainlit_demo_example.png

433 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
[
2+
{
3+
"image_url": "https://upload.wikimedia.org/wikipedia/commons/1/13/2200_Mission_College_Boulevard.jpg",
4+
"title": "Intel Headquarters",
5+
"content": "Headquarters in Santa Clara, California, in 2023"
6+
},
7+
{
8+
"image_url": "https://upload.wikimedia.org/wikipedia/commons/6/64/Intel_8742_153056995.jpg",
9+
"title": "Intel 8742",
10+
"content": "The die from an Intel 8742, an 8-bit microcontroller that includes a CPU running at 12 MHz, 128 bytes of RAM, 2048 bytes of EPROM, and I/O in the same chip"
11+
},
12+
{
13+
"image_url": "https://upload.wikimedia.org/wikipedia/commons/f/fc/Intel_Costa_12_2007_SJO_105b.jpg",
14+
"title": "microprocessor facility",
15+
"content": "Intel microprocessor facility in Costa Rica was responsible in 2006 for 20% of Costa Rican exports and 4.9% of the country's GDP."
16+
}
17+
]

chainlit.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Chat with an LLM 📚

components.md

+52-14
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
# fast**RAG** Components Overview
22

3-
4-
53
## REPLUG
64
<image align="right" src="assets/replug.png" width="600">
75

@@ -12,9 +10,9 @@ process a larger number of retrieved documents, without limiting ourselves to th
1210
method works with any LLM, no fine-tuning is needed. See ([Shi et al. 2023](#shiREPLUGRetrievalAugmentedBlackBox2023))
1311
for more details.
1412

15-
We provide implementation for the REPLUG ensembling inference, using the invocation layer
16-
`ReplugHFLocalInvocationLayer`; Our implementation supports most Hugging FAce models with `.generate()` capabilities (such that implement the generation mixin); For a complete example, see [REPLUG Parallel
17-
Reader](examples/replug_parallel_reader.ipynb) notebook.
13+
We provide implementation for the REPLUG ensembling inference, using the generator `ReplugGenerator`; Our implementation
14+
supports most Hugging FAce models with `.generate()` capabilities (such that implement the generation mixin); For a
15+
complete example, see [REPLUG Parallel Reader](examples/replug_parallel_reader.ipynb) notebook.
1816

1917
## ColBERT v2 with PLAID Engine
2018

@@ -82,11 +80,10 @@ We enabled support for running LLMs on Habana Gaudi (DL1) and Habana Gaudi 2 by
8280
See below an example for loading a `PromptModel` with Habana backend:
8381

8482
```python
85-
from fastrag.prompters.invocation_layers.gaudi_hugging_face_inference import GaudiHFLocalInvocationLayer
83+
from fastrag.generators import GaudiGenerator
8684

87-
PrompterModel = PromptModel(
85+
generator = GaudiGenerator(
8886
model_name_or_path= "meta-llama/Llama-2-7b-chat-hf",
89-
invocation_layer_class=GaudiHFLocalInvocationLayer,
9087
model_kwargs= dict(
9188
max_new_tokens=50,
9289
torch_dtype=torch.bfloat16,
@@ -137,18 +134,60 @@ quantizer.quantize(save_dir=os.path.join(converted_model_path, 'quantized'), qua
137134

138135
### Loading the Quantized Model
139136

140-
Now that our model is quantized, we can load it in our framework, by specifying the ```ORTInvocationLayer``` invocation layer.
137+
Now that our model is quantized, we can load it in our framework, by using the ```ORTGenerator``` generator.
141138

142139
```python
143-
PrompterModel = PromptModel(
144-
model_name_or_path= "my/local/path/quantized",
145-
invocation_layer_class=ORTInvocationLayer,
140+
generator = ORTGenerator(
141+
model="my/local/path/quantized",
142+
task="text-generation",
143+
generation_kwargs={
144+
"max_new_tokens": 100,
145+
}
146+
)
147+
```
148+
149+
## fastRAG running quantized LLMs using OpenVINO
150+
151+
We provide a method for running quantized LLMs with [OpenVINO](https://docs.openvino.ai/2024/home.html) and [optimum-intel](https://github.com/huggingface/optimum-intel).
152+
We recommend checking out our [notebook](examples/rag_with_openvino.ipynb) with all the details, including the quantization and pipeline construction.
153+
154+
### Installation
155+
156+
Run the following command to install our dependencies:
157+
158+
```
159+
pip install -e .[openvino]
160+
```
161+
162+
For more information regarding the installation process, we recommend checking out the guides provided by [OpenVINO](https://docs.openvino.ai/2024/home.html) and [optimum-intel](https://github.com/huggingface/optimum-intel).
163+
164+
### LLM Quantization
165+
166+
We can use the [OpenVINO tutorial notebook](https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb) to quantize an LLM to our liking.
167+
168+
### Loading the Quantized Model
169+
170+
Now that our model is quantized, we can load it in our framework, by using the ```OpenVINOGenerator``` component.
171+
172+
```python
173+
from fastrag.generators.openvino import OpenVINOGenerator
174+
175+
openvino_compressed_model_path = "path/to/model"
176+
177+
generator = OpenVINOGenerator(
178+
model="microsoft/phi-2",
179+
compressed_model_dir=openvino_compressed_model_path,
180+
device_openvino="CPU",
181+
task="text-generation",
182+
generation_kwargs={
183+
"max_new_tokens": 100,
184+
}
146185
)
147186
```
148187

149188
## fastRAG Running RAG Pipelines with LLMs on a Llama CPP backend
150189

151-
To run LLM effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
190+
To run LLMs effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
152191
We recommend checking out our [tutorial notebook](examples/client_inference_with_Llama_cpp.ipynb) with all the details, including processes such as downloading GGUF models.
153192

154193
### Installation
@@ -176,7 +215,6 @@ PrompterModel = PromptModel(
176215
)
177216
```
178217

179-
180218
## Optimized Embedding Models
181219

182220
Bi-encoder Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking. We provide support for quantized `int8` models that have low latency and high throughput, using [`optimum-intel`](https://github.com/huggingface/optimum-intel) framework.

config/data/wikipedia_hf_6M.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ dataset_info:
55
split: "train"
66
encoding_method: "wikipedia_hf_multisentence"
77
batch_size: 500
8+
limit:

config/data/wikitext_hf.yaml

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
type: fastrag.data_loaders.HFDatasetLoader
2+
dataset_info:
3+
path: wikitext
4+
name: wikitext-2-v1
5+
split: train
6+
encoding_method: text
7+
batch_size: 100
8+
limit:

config/doc_chat_ort.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,4 @@ chat_model:
66
session.intra_op_thread_affinities: '3,4;5,6;7,8;9,10;11,12'
77
intra_op_num_threads: 6
88
model_name_or_path: '/tmp/facebook_opt-iml-max-1.3b/quantized'
9-
invocation_layer_class: fastrag.prompters.invocation_layers.ort.ORTInvocationLayer
109
doc_pipeline_file: "config/empty_retrieval_pipeline.yaml"

config/embedder/dpr.yaml

-9
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
type: haystack.components.embedders.SentenceTransformersDocumentEmbedder
2+
init_parameters:
3+
model: "BAAI/llm-embedder"
4+
meta_fields_to_embed: ['title'] # optional
5+
prefix: "Represent this document for retrieval: "
6+
batch_size: 256
7+
device:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
type: haystack.components.embedders.SentenceTransformersTextEmbedder
2+
init_parameters:
3+
model: "BAAI/llm-embedder"
4+
prefix: "Represent this query for retrieving relevant documents: "
5+
batch_size: 256
6+
device:

config/embedder/sentence-transformer.yaml

-9
This file was deleted.

0 commit comments

Comments
 (0)