Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inability to search all documents or a group of documents with GraphRAG #548

Open
jradikk opened this issue Dec 3, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jradikk
Copy link

jradikk commented Dec 3, 2024

Description

Whenever you try to use all documents or a group of documents with GraphRAG, you get either

searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(

for all documents or something similar to

AssertionError: GraphRAG index not found for file_id: ["d6f18887-0b01-4df0-a30d-997c919d60f1", "08bad9e8-ef3f-4e89-abed-21da7d4f9611"]

for a grpup of documents. However, there is no problem searching any of these documents one by one. Considering, that RAG is mostly used to be able to access a large quantity of different docs, it makes Kotaemon unusable unless you stick with File Collections

Reproduction steps

1. Go to Files, upload more than one document
2. Got to Chat, Click on 'Graph Collection', choose "Select All"
3. Send any kind of message
4. Observe an absence of a reference of any documents and completely unrelated response

Screenshots

No response

Logs

use_quick_index_mode False
reader_mode default
Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x7f1fdbdd19f0>
Page numbers: 4
Got 4 page thumbnails
Adding documents to doc store
indexing step took 0.2428741455078125
Initializing project at 
/app/ktem_app_data/user_data/files/graphrag/da6da42a-8eb2-4c0b-afb2-a8a56fc509d7

/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:59: FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.
  return bound(*args, **kwds)
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_datetime without passing `errors` and catch exceptions explicitly instead
  datetime_column = pd.to_datetime(column, errors="ignore")
/usr/local/lib/python3.10/site-packages/datashaper/engine/verbs/convert.py:72: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  datetime_column = pd.to_datetime(column, errors="ignore")
User-id: None, can see public conversations: False
User-id: 1, can see public conversations: True
User-id: 1, can see public conversations: True
Session reasoning type None use mindmap (default) use citation (default) language (default)
Session LLM 
Reasoning class <class 'ktem.reasoning.simple.FullQAPipeline'>
Reasoning state {'app': {'regen': False}, 'pipeline': {}}
Thinking ...
Retrievers [DocumentRetrievalPipeline(DS=<kotaemon.storages.docstores.lancedb.LanceDBDocumentStore object at 0x7f2004dca7d0>, FSPath=PosixPath('/app/ktem_app_data/user_data/files/index_1'), Index=<class 'ktem.index.file.index.IndexTable'>, Source=<class 'ktem.index.file.index.Source'>, VS=<kotaemon.storages.vectorstores.chroma.ChromaVectorStore object at 0x7f2004dcaef0>, get_extra_table=False, llm_scorer=LLMTrulensScoring(concurrent=True, normalize=10, prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106290>, system_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106170>, top_k=3, user_prompt_template=<kotaemon.llms.prompts.template.PromptTemplate object at 0x7f1fd8106080>), mmr=False, rerankers=[TeiFastReranking(endpoint_url='http://proxy:3000/v1/rerank', is_truncated=True, model_name='jina')], retrieval_mode='hybrid', top_k=10, user_id=1), GraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), LightRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>), NanoGraphRAGRetrieverPipeline(DS=<theflow.base.unset_ object at 0x7f20fe849300>, FSPath=<theflow.base.unset_ object at 0x7f20fe849300>, Index=<class 'ktem.index.file.index.IndexTable'>, Source=<theflow.base.unset_ object at 0x7f20fe849300>, VS=<theflow.base.unset_ object at 0x7f20fe849300>, file_ids=[], user_id=<theflow.base.unset_ object at 0x7f20fe849300>)]
searching in doc_ids []
INFO:ktem.index.file.pipelines:Skip retrieval because of no selected files: DocumentRetrievalPipeline(
  (vector_retrieval): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe31c0>
  (embedding): <function Function._prepare_child.<locals>.exec at 0x7f1fbbfe32e0>
)
Got 0 retrieved documents
len (original) 0
Got 0 images
Trying LLM streaming
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"
Got 0 cited docs
INFO:httpx:HTTP Request: POST http://vllm:8000/v1/chat/completions "HTTP/1.1 200 OK"

Browsers

No response

OS

Linux

Additional information

No response

@jradikk jradikk added the bug Something isn't working label Dec 3, 2024
@jradikk
Copy link
Author

jradikk commented Dec 3, 2024

Additionally, based on this commit, it seems like groups are intended not to work. Maybe the "select all" option is intended not to work as well?

@Silverls96
Copy link

assert (
            len(self.file_ids) <= 1
        ), "GraphRAG retriever only supports one file_id at a time"

        file_id = self.file_ids[0]

I saw this in the code. Does it have any relate to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants