Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate got-ocr2.0 as image reader #355

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .github/workflows/stale.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: Close stale issues
on:
schedule:
- cron: "0 0 * * *" # Runs at 00:00 UTC every day

jobs:
stale:
runs-on: ubuntu-latest
steps:
- uses: actions/stale@v9
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}

stale-issue-message: |
Hi folk 😊, it looks like this issue has been inactive for a while. Is there any update or further action
needed? If not, we might consider closing it soon to keep our board clean and focused.

But don't worry, you can reopen it anytime when needed.Thank you for your contributions to Kotaemon🪴

days-before-issue-stale: 30
days-before-issue-close: 3
days-before-pr-stale: 90
days-before-pr-close: -1
exempt-issue-labels: "documentation,tutorial,TODO"
operations-per-run: 300 # The maximum number of operations per run, used to control rate limiting.
9 changes: 9 additions & 0 deletions docker-compose.dev.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
services:
kotaemon:
volumes:
- "./ktem_app_data:/app/ktem_app_data"
- "./libs/kotaemon:/app/kotaemon"
- "./libs/ktem:/app/ktem"
- "./flowsettings.py:/app/flowsettings.py"
ports:
- "7860:7860"
30 changes: 30 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
services:
kotaemon:
build:
context: .
target: lite
dockerfile: Dockerfile
env_file: .env
environment:
- GRADIO_SERVER_NAME=0.0.0.0
- GRADIO_SERVER_PORT=7860
ports:
- "7860:7860"
networks:
- backend
# gocr:
# image: ghcr.io/phv2312/got-ocr2.0:main
# ports:
# - "8881:8881"
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
# networks:
# - backend
networks:
backend:
driver: bridge
30 changes: 30 additions & 0 deletions integration/got-ocr2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Extension Manager and GOT-OCR2.0 Loader

## Key Features

### 1. **GOCR2 as Image Reader**

- **GOCR2ImageReader** is a new class designed to read images using the [**GOCR-2.0** OCR engine](https://github.com/Ucas-HaoranWei/GOT-OCR2.0).
- This reader is initialized with an endpoint that defaults to `http://localhost:8881/ai/infer/` for the OCR service, but can be configured through an environment variable `GOCR2_ENDPOINT` or passed explicitly.
- It uses exponential backoff retry mechanisms to ensure robustness during API calls.
- Supports loading image files and extracting their text content, returning structured document data.

#### Setup

- We provide the docker image, with fastapi for serving the GOT-OCR2.0. Pull the image from:

```bash
docker run -d --gpus all -p 8881:8881 ghcr.io/phv2312/got-ocr2.0:main
phv2312 marked this conversation as resolved.
Show resolved Hide resolved
phv2312 marked this conversation as resolved.
Show resolved Hide resolved
```

- Detail implementation is placed at [ocr_loader.py](/libs/kotaemon/kotaemon/loaders/ocr_loader.py)

### 2. **Extension Manager**

- ExtensionManager allows users to dynamically manage multiple loaders for different file types.

- Users can switch between multiple loaders for the same file extension, such as using the GOCR2ImageReader or a
different unstructured data parser for .png files. This provides the flexibility to choose the best-suited loader for the task at hand.

- To change the default loader, go to **Settings**, then **Loader settings**. It displays a grid of extensions and
its supported loaders. Any modification will be saved to DB as other settings do.
3 changes: 0 additions & 3 deletions libs/kotaemon/kotaemon/indices/ingests/__init__.py

This file was deleted.

137 changes: 0 additions & 137 deletions libs/kotaemon/kotaemon/indices/ingests/files.py

This file was deleted.

3 changes: 2 additions & 1 deletion libs/kotaemon/kotaemon/loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from .excel_loader import ExcelReader, PandasExcelReader
from .html_loader import HtmlReader, MhtmlReader
from .mathpix_loader import MathpixPDFReader
from .ocr_loader import ImageReader, OCRReader
from .ocr_loader import GOCR2ImageReader, ImageReader, OCRReader
from .pdf_loader import PDFThumbnailReader
from .txt_loader import TxtReader
from .unstructured_loader import UnstructuredReader
Expand All @@ -32,4 +32,5 @@
"PDFThumbnailReader",
"WebReader",
"DoclingReader",
"GOCR2ImageReader",
]
71 changes: 71 additions & 0 deletions libs/kotaemon/kotaemon/loaders/ocr_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,3 +191,74 @@ def load_data(
)

return result


class GOCR2ImageReader(BaseReader):
default_endpoint = "http://localhost:8881/ai/infer/"
"""Read Image using GOCR-2.0

Args:
endpoint: URL to GOCR endpoint. If not provided, will look for
environment variable `GOCR2_ENDPOINT` or use the default
(http://localhost:8881/ai/infer/)
"""

def __init__(self, endpoint: Optional[str] = None):
"""Init the OCR reader with OCR endpoint (FullOCR pipeline)"""
super().__init__()
self.endpoint = endpoint or os.getenv("GOCR2_ENDPOINT", self.default_endpoint)

def load_data(
self, file_path: Path, extra_info: dict | None = None, **kwargs
) -> List[Document]:
"""Load data using OCR reader

Args:
file_path (Path): Path to PDF file
extra_info (Path): Extra information while inference

Returns:
List[Document]: list of documents extracted from the PDF file
"""

@retry(
stop=stop_after_attempt(6),
wait=wait_exponential(multiplier=20, exp_base=2, min=1, max=1000),
after=after_log(logger, logging.WARNING),
)
def _tenacious_api_post(
url: str, file_path: Path, ocr_type: str = "ocr", **kwargs
):
with file_path.open("rb") as content:
files = {"file": content}
data = {"ocr_type": ocr_type}
resp = requests.post(url=url, files=files, data=data, **kwargs)
resp.raise_for_status()
return resp

file_path = Path(file_path).resolve()

# call the API from GOCR endpoint
if "response_content" in kwargs:
# overriding response content if specified
ocr_results = kwargs["response_content"]
else:
# call original API
resp = _tenacious_api_post(url=self.endpoint, file_path=file_path)
ocr_results = [resp.json()["result"]]

extra_info = extra_info or {}
result = []
for ocr_result in ocr_results:
metadata = {"file_name": file_path.name, "page_label": 1}
if extra_info is not None:
metadata.update(extra_info)

result.append(
Document(
content=ocr_result,
metadata=metadata,
)
)

return result
15 changes: 0 additions & 15 deletions libs/kotaemon/tests/test_ingestor.py

This file was deleted.

4 changes: 4 additions & 0 deletions libs/ktem/ktem/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from ktem.components import reasonings
from ktem.exceptions import HookAlreadyDeclared, HookNotDeclared
from ktem.index import IndexManager
from ktem.loaders.extensions import extension_manager
from ktem.settings import BaseSettingGroup, SettingGroup, SettingReasoningGroup
from theflow.settings import settings
from theflow.utils.modules import import_dotted_string
Expand Down Expand Up @@ -63,6 +64,9 @@ def __init__(self):
self.default_settings = SettingGroup(
application=BaseSettingGroup(settings=settings.SETTINGS_APP),
reasoning=SettingReasoningGroup(settings=settings.SETTINGS_REASONING),
extension=BaseSettingGroup(
settings=extension_manager.generate_gradio_settings()
),
)

self._callbacks: dict[str, list] = {}
Expand Down
Loading
Loading