A FastAPI-based Retrieval-Augmented Generation (RAG) service that combines document retrieval with text generation.
- Create a conda environment with the requirements.txt file
TIP: Check this example for how to use slurm to create a conda environment.
conda create -n rag python=3.10 -y
conda activate rag
git clone https://github.com/ed-aisys/edin-mls-25-spring.git
cd edin-mls-25-spring/task-2
pip install -r requirements.txt
- Run the service
python serving_rag.py
- Test the service
curl -X POST "http://localhost:8000/rag" -H "Content-Type: application/json" -d '{"query": "Which animals can hover in the air?"}'
Note:
If you encounter issues while downloading model checkpoints on a GPU machine, try the following workaround:
- Manually download the model on the host machine:
conda activate rag
huggingface-cli download <model_name>
- Create a new script (bash or python) to test the service with different request rates. A reference implementation is TraceStorm
- Implement a request queue to handle concurrent requests
A potential design: Create a request queue Put incoming requests into the queue, instead of directly processing them Start a background thread that listens on the request queue
- Implement a batch processing mechanism
Take up to MAX_BATCH_SIZE requests from the queue or wait until MAX_WAITING_TIME Process the batched requests
-
Measure the performance of the optimized system compared to the original service
-
Draw a conclusion