jplfaria · jplfaria · May 5, 2025 · May 5, 2025 · May 5, 2025 · May 5, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,36 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.10, 3.11]
+    steps:
+      - uses: actions/checkout@v3
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          pip install --upgrade pip
+          pip install black isort flake8 mypy pytest mkdocs mkdocs-material
+      - name: Black check
+        run: black --check .
+      - name: isort check
+        run: isort --check-only .
+      - name: flake8 lint
+        run: flake8 .
+      - name: mypy type check
+        run: mypy .
+      - name: pytest
+        run: pytest -q
+      - name: Build docs
+        run: mkdocs build --strict
diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml
@@ -0,0 +1,26 @@
+name: Deploy Docs
+
+on:
+  push:
+    branches: [main]
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.x'
+      - name: Install MKDocs
+        run: |
+          pip install --upgrade pip
+          pip install mkdocs mkdocs-material
+      - name: Build documentation
+        run: mkdocs build --strict
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@v3
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,19 @@
+repos:
+  - repo: https://github.com/psf/black
+    rev: 25.1.0
+    hooks:
+      - id: black
+        language_version: python3
+  - repo: https://github.com/pycqa/isort
+    rev: 6.0.1
+    hooks:
+      - id: isort
+  - repo: https://github.com/myint/autoflake
+    rev: v1.4
+    hooks:
+      - id: autoflake
+        args: [--remove-all-unused-imports, --remove-unused-variables, --in-place, --recursive]
+  - repo: https://github.com/charliermarsh/ruff-pre-commit
+    rev: v0.11.8
+    hooks:
+      - id: ruff
diff --git a/README.md b/README.md
@@ -1,164 +1,180 @@
-# CONCORDIA  
+# CONCORDIA
 *CONcordance of Curated & Original Raw Descriptions In Annotations*
 
-Concordia compares two functional-annotation sources—old vs new, RAST vs UniProt, manual vs AI—and writes a tidy table with  
-
-* **`similarity_Pubmedbert`** – cosine similarity from the PubMedBERT sentence-embedding model  
-* **`label`** – a one-word judgement from an LLM (`o3-mini` by default) drawn from a 7-class ontology  
-* **`note`** – the LLM’s ultra-short reason (blank when the heuristic label is used)
-
-You can choose from **four processing modes**:
-
-| Mode | What happens |
-|------|--------------|
-| **llm** | LLM only — cheapest if you already trust the model |
-| **local** | PubMedBERT embeddings → cosine → heuristic label (no LLM) |
-| **dual** | Embeddings **and** LLM (baseline) |
-| **simhint** | Same as **dual** **plus** the cosine similarity is prefixed to the prompt as a weak prior |
-
----
+A toolkit for annotation concordance and entity relationship classification using embeddings and LLMs.
+
+## Features
+- **gateway-check**: Argo Gateway API connectivity check on startup with prod/dev endpoint fallback
+- **local**: PubMedBERT embeddings → cosine similarity → heuristic labels
+- **zero-shot**: Single LLM call with optional similarity hints
+- **vote**: Multiple LLM calls with majority vote (with vote tracking)
+- **rac** (Beta): Retrieval-Augmented Classification with example memory
+- **fallback**: Safe local fallback on errors
+- Template-driven prompt management with versioned external templates (v1.x, v2, v2.1, v3.0, v3.1)
+- Ad-hoc mode for quick two-sentence comparisons (without requiring a CSV file)
+- **list-templates**: List available prompt templates
+- **batch processing**: Control both file chunking and LLM batch sizes
+- **verbose**: Show detailed evidence and explanations
 
 ## Installation
+
+### Using Poetry (recommended)
 ```bash
 git clone https://github.com/you/concordia.git
 cd concordia
-poetry install          # installs deps + CLI
-poetry shell            # activate the virtualenv
+poetry install          # install dependencies & CLI entry-point
+poetry shell            # activate the virtual environment
 ```
 
----
-
-## Quick-start recipes
+### Alternative via pip
+```bash
+pip install concordia
+```
 
-| Goal | Command |
-|------|---------|
-| CSV with default LLM (o3-mini → dev) | `concord example_data/annotations_test.csv` |
-| TSV with GPT-4o (prod)               | `concord example_data/annotations_test.tsv --llm-model gpt4o` |
-| Embed-only (no LLM)                  | `concord …csv --mode local` |
-| Dual (baseline)                      | `concord …csv --mode dual` |
-| Similarity-hint                      | `concord …csv --mode simhint` |
-| Two ad-hoc strings                   | `concord --text-a "RecA" --text-b "DNA recombinase A"` |
-| **Overwrite** existing output        | add `--force` |
+### Syncing Local Dependencies
+If you've installed additional Python packages in your environment, you can compare them with Poetry-managed dependencies:
+```bash
+# export current environment packages
+pip freeze > env-requirements.txt
 
----
+# export Poetry-managed requirements
+poetry export -f requirements.txt --without-hashes > poetry-requirements.txt
 
-## Accepted input formats
+# view differences
+diff env-requirements.txt poetry-requirements.txt
+```
+Manually add any missing packages to `pyproject.toml` under `[tool.poetry.dependencies]` and run `poetry update`.
 
-| Extension | Loader | Note |
-|-----------|--------|------|
-| `.csv`           | `pandas.read_csv(sep=',')` | default |
-| `.tsv` / `.tab`  | `pandas.read_csv(sep='\t')`| or `--sep "\t"` |
-| `.json`          | `pandas.read_json()`       | list-of-objects **or** column-orient |
+## Quickstart
+**CLI**
+```bash
+# Simplified command structure (single invocation)
+concord data/pairs.csv --mode zero-shot --output results.csv
+concord data/pairs.csv --mode local --output local.csv
+concord data/pairs.csv --mode vote --output results_vote.csv
+concord data/pairs.csv --mode rac --output results_rac.csv
 
-If you do **not** pass `--col-a / --col-b`, the **first two textual columns that don’t end with `id`** are taken.
+# Direct text comparison (no CSV required)
+concord --text-a "Entity A" --text-b "Entity B" --mode zero-shot
 
----
+# List available templates
+concord --list-templates
 
-## Minimal sample CSV
-```csv
-annotation_a,annotation_b
-DNA repair protein RecA,Recombinase A
-Hypothetical protein,Uncharacterized protein
+# Control batch processing
+concord data/pairs.csv --batch-size 32 --llm-batch-size 12
 ```
-Any extra columns (e.g. `gene_id`) are preserved in the output.
-
----
-
-## CLI options
 
-| Flag | Description |
-|------|-------------|
-| **`FILE`**             | Input table (`.csv`, `.tsv`, `.json`) |
-| `--text-a / --text-b`  | Compare two strings instead of a file |
-| `--mode`               | `llm` (default) | `local` | `dual` | `simhint` |
-| `--llm-model`          | Gateway LLM (`gpto3mini`, `gpt4o`, …) |
-| `--retry`              | Automatic blank-reply retries (default 5) |
-| `--force`              | Overwrite existing output instead of appending |
-| `--output`             | Destination path (file-mode only) |
-| `--cfg`                | Alternate YAML config |
-| `--col-a / --col-b`    | Explicit annotation columns |
-| `--sep`                | Custom delimiter for text files (e.g. `"\t"`) |
+**Python**
+```python
+from concord.pipeline import run_pair, run_file
+label, sim, evidence = run_pair("Entity A", "Entity B", "config.yaml")
+print(label, sim, evidence)
+```
 
----
+## Evaluation
+After generating predictions, evaluate them against the gold standard:
+```bash
+python eval/evaluate_gold_standard.py \
+  --gold-standard eval/synthetic_gold_standard_v1.csv \
+  --predictions example_data/results_v2_zero.csv \
+  --relationship-column relationship \
+  --plot
+```
 
-## Modes in detail
+## Configuration (`config.yaml`)
+```yaml
+engine:
+  mode: zero-shot        # local | zero-shot | vote | rac
+  sim_hint: false       # Optional: prefix similarity hint to prompts
 
-| Mode | Pipeline | Extra columns |
-|------|----------|---------------|
-| **llm**    | Skip embeddings; every pair → LLM               | `label`, `note` |
-| **local**  | PubMedBERT embeddings → cosine → heuristic      | `similarity_Pubmedbert`, `label` |
-| **dual**   | Embeddings **and** LLM for every row            | `similarity_Pubmedbert`, `label`, `note` |
-| **simhint**| Same as **dual**, but cosine similarity is sent to the LLM prompt | `similarity_Pubmedbert`, `label`, `note` |
+llm:
+  model: gpt4o          # use without hyphens
+  stream: false
+  user: ${ARGO_USER}
 
-> **Embedding model** [`NeuML/pubmedbert-base-embeddings`](https://huggingface.co/NeuML/pubmedbert-base-embeddings) (Apache-2.0)
+local:
+  model_id: NeuML/pubmedbert-base-embeddings
+  device: cpu           # cpu or cuda
 
----
+# RAC mode settings (Beta)
+rac:
+  example_limit: 3      # Number of examples to include in prompts
+  similarity_threshold: 0.6  # Minimum similarity to include example
+  auto_store: true      # Auto-save classifications to vector store
 
-## 7-label ontology
+data_dir: "./data"      # Where to store the vector database
+```
 
-| Label | Meaning |
-|-------|---------|
-| **Exact**        | Same function; wording/punctuation only |
-| **Synonym**      | Semantically equivalent paraphrase |
-| **Broader**      | A ⊃ B (A more general) |
-| **Narrower**     | A ⊂ B (A more specific) |
-| **Related**      | Same pathway / complex / family but not parent–child |
-| **Uninformative**| Placeholder or extremely generic |
-| **Different**    | No functional overlap |
+### Configuration Fields
+- `engine.mode`: select mode (`local`, `zero-shot`, `vote`, `rac`)
+- `engine.sim_hint`: boolean flag to prefix cosine similarity hint to LLM prompts (default: false)
+- `engine.sim_threshold`: similarity threshold for local mode (default: 0.98)
+- `engine.vote_temps`: list of temperatures for vote mode LLM calls (default: `[0.8, 0.2, 0.0]`)
+- `llm.model`: Gateway model name (e.g. `gpt4o`, `gpt35`, `gpto3mini`)
+- `llm.stream`: `true` to use streaming `/streamchat/` endpoint
+- `llm.user`: Argo Gateway username (via `ARGO_USER`)
+- `llm.api_key`: Argo Gateway API key (via `ARGO_API_KEY`)
+- `prompt_ver`: explicit prompt version to use (overrides config `prompt_ver` and bucket routing)
+- `local.model_id`: embedding model ID (PubMedBERT or SPECTER2)
+- `local.device`: device for embeddings (`cpu` or `cuda`)
+- `local.batch_size`: batch size for file processing
+- `rac.example_limit`: number of similar examples to retrieve (for RAC mode)
+- `rac.similarity_threshold`: minimum similarity score for examples (0-1)
+- `rac.auto_store`: whether to automatically store successful classifications
+- `data_dir`: directory for storing vector database and other data
 
-### Alias system — why you rarely see **Unknown**
+## RAC Mode (Beta)
 
-Older checkpoints reply with tokens like *Identical* or *Partial*.  
-`llm_label()` holds a tiny alias map so such answers are remapped automatically; genuine blanks are the only source of `Unknown`.
+The Retrieval-Augmented Classification (RAC) mode is currently in beta development. This mode enhances classification by retrieving similar previously classified examples and including them in the prompt for context.
 
----
+### Current Limitations
 
-## Prompting — tweak in one place
+RAC mode currently has several limitations being actively worked on:
 
-All prompt text lives in **`concord/llm/prompts.py`**.  
-Edit the few-shot examples or definitions to experiment; no other code must change.
+1. **All Classifications Get Stored**: Currently, all successful LLM classifications are stored in the vector database if `auto_store` is enabled, regardless of quality or accuracy.
 
----
+2. **Planned Improvements**:
+   - Human validation before storing examples
+   - Confidence thresholds from the LLM responses
+   - Selective storage based on specific characteristics or patterns
+   - Improved embedding methods for better similarity matching
 
-## Config snapshot (`concord/config.yaml`)
-```yaml
-engine:
-  mode: llm
+### Using RAC Mode
 
-llm:
-  model: gpto3mini        # auto-routes to apps-dev
-  stream: false
-  user: ${ARGO_USER}
+```bash
+# First time setup - create data directory
+mkdir -p data
 
-local:
-  model_id: NeuML/pubmedbert-base-embeddings
+# Run with RAC mode (will build up examples over time)
+concord data/pairs.csv --mode rac --output results_rac.csv
 ```
-*(Leave `env` unset — o-series → apps-dev, GPT-4* → apps-prod.)*
-
----
 
-## Progress, recovery & overwrite
-
-* Output is **appended row-by-row** – abort with Ctrl-C and rerun; finished pairs are skipped.  
-  Add `--force` to **replace** an existing file instead.  
-* Live progress bar example:
-  ```
-  Processing 73%|██████████████▋ | 730/1000 [00:14<00:05, 49.2it/s]
-  ```
-  o3-mini ≈ 0.10 s per pair; embeddings ≈ 1 ms per string on Apple M-series.
+## Documentation
+```bash
+mkdocs serve
+```
+Published site: https://<org>.github.io/concordia/
 
----
+## Environment Variables
+- `ARGO_USER`: ANL login for Argo Gateway (required)
+- `ARGO_API_KEY`: API key for private Argo Gateway (optional)
 
-## FAQ
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
 
-| Q | A |
-|---|---|
-| **Why keep similarity in LLM mode?** | Free sanity check (~ 2 ms) |
-| **Still see “Unknown”?** | Reply didn’t start with any known token; tweak alias or prompt |
-| **Where are model weights?** | Hugging Face cache (`~/.cache/huggingface`) |
-| **Crash recovery?** | Append-only output; rerun resumes automatically |
-| **Overwrite output?** | Use `--force` to replace an existing file |
+## Testing
+Run all tests via `pytest`:
+```bash
+pytest
+```
 
----
+## Development
+We enforce formatting and linting with pre-commit hooks:
+```bash
+pip install pre-commit
+pre-commit install
+pre-commit run --all-files
+```
 
-*Happy concording! – Stars ⭐, issues 🐞, and PRs 💡 welcome.*
+## License
+Apache-2.0