Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: CI

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
lint:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.10, 3.11]
steps:
- uses: actions/checkout@v3
- name: Setup Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
pip install --upgrade pip
pip install black isort flake8 mypy pytest mkdocs mkdocs-material
- name: Black check
run: black --check .
- name: isort check
run: isort --check-only .
- name: flake8 lint
run: flake8 .
- name: mypy type check
run: mypy .
- name: pytest
run: pytest -q
- name: Build docs
run: mkdocs build --strict
26 changes: 26 additions & 0 deletions .github/workflows/deploy-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Deploy Docs

on:
push:
branches: [main]

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install MKDocs
run: |
pip install --upgrade pip
pip install mkdocs mkdocs-material
- name: Build documentation
run: mkdocs build --strict
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./site
19 changes: 19 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
repos:
- repo: https://github.com/psf/black
rev: 25.1.0
hooks:
- id: black
language_version: python3
- repo: https://github.com/pycqa/isort
rev: 6.0.1
hooks:
- id: isort
- repo: https://github.com/myint/autoflake
rev: v1.4
hooks:
- id: autoflake
args: [--remove-all-unused-imports, --remove-unused-variables, --in-place, --recursive]
- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.11.8
hooks:
- id: ruff
262 changes: 139 additions & 123 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,164 +1,180 @@
# CONCORDIA
# CONCORDIA
*CONcordance of Curated & Original Raw Descriptions In Annotations*

Concordia compares two functional-annotation sources—old vs new, RAST vs UniProt, manual vs AI—and writes a tidy table with

* **`similarity_Pubmedbert`** – cosine similarity from the PubMedBERT sentence-embedding model
* **`label`** – a one-word judgement from an LLM (`o3-mini` by default) drawn from a 7-class ontology
* **`note`** – the LLM’s ultra-short reason (blank when the heuristic label is used)

You can choose from **four processing modes**:

| Mode | What happens |
|------|--------------|
| **llm** | LLM only — cheapest if you already trust the model |
| **local** | PubMedBERT embeddings → cosine → heuristic label (no LLM) |
| **dual** | Embeddings **and** LLM (baseline) |
| **simhint** | Same as **dual** **plus** the cosine similarity is prefixed to the prompt as a weak prior |

---
A toolkit for annotation concordance and entity relationship classification using embeddings and LLMs.

## Features
- **gateway-check**: Argo Gateway API connectivity check on startup with prod/dev endpoint fallback
- **local**: PubMedBERT embeddings → cosine similarity → heuristic labels
- **zero-shot**: Single LLM call with optional similarity hints
- **vote**: Multiple LLM calls with majority vote (with vote tracking)
- **rac** (Beta): Retrieval-Augmented Classification with example memory
- **fallback**: Safe local fallback on errors
- Template-driven prompt management with versioned external templates (v1.x, v2, v2.1, v3.0, v3.1)
- Ad-hoc mode for quick two-sentence comparisons (without requiring a CSV file)
- **list-templates**: List available prompt templates
- **batch processing**: Control both file chunking and LLM batch sizes
- **verbose**: Show detailed evidence and explanations

## Installation

### Using Poetry (recommended)
```bash
git clone https://github.com/you/concordia.git
cd concordia
poetry install # installs deps + CLI
poetry shell # activate the virtualenv
poetry install # install dependencies & CLI entry-point
poetry shell # activate the virtual environment
```

---

## Quick-start recipes
### Alternative via pip
```bash
pip install concordia
```

| Goal | Command |
|------|---------|
| CSV with default LLM (o3-mini → dev) | `concord example_data/annotations_test.csv` |
| TSV with GPT-4o (prod) | `concord example_data/annotations_test.tsv --llm-model gpt4o` |
| Embed-only (no LLM) | `concord …csv --mode local` |
| Dual (baseline) | `concord …csv --mode dual` |
| Similarity-hint | `concord …csv --mode simhint` |
| Two ad-hoc strings | `concord --text-a "RecA" --text-b "DNA recombinase A"` |
| **Overwrite** existing output | add `--force` |
### Syncing Local Dependencies
If you've installed additional Python packages in your environment, you can compare them with Poetry-managed dependencies:
```bash
# export current environment packages
pip freeze > env-requirements.txt

---
# export Poetry-managed requirements
poetry export -f requirements.txt --without-hashes > poetry-requirements.txt

## Accepted input formats
# view differences
diff env-requirements.txt poetry-requirements.txt
```
Manually add any missing packages to `pyproject.toml` under `[tool.poetry.dependencies]` and run `poetry update`.

| Extension | Loader | Note |
|-----------|--------|------|
| `.csv` | `pandas.read_csv(sep=',')` | default |
| `.tsv` / `.tab` | `pandas.read_csv(sep='\t')`| or `--sep "\t"` |
| `.json` | `pandas.read_json()` | list-of-objects **or** column-orient |
## Quickstart
**CLI**
```bash
# Simplified command structure (single invocation)
concord data/pairs.csv --mode zero-shot --output results.csv
concord data/pairs.csv --mode local --output local.csv
concord data/pairs.csv --mode vote --output results_vote.csv
concord data/pairs.csv --mode rac --output results_rac.csv

If you do **not** pass `--col-a / --col-b`, the **first two textual columns that don’t end with `id`** are taken.
# Direct text comparison (no CSV required)
concord --text-a "Entity A" --text-b "Entity B" --mode zero-shot

---
# List available templates
concord --list-templates

## Minimal sample CSV
```csv
annotation_a,annotation_b
DNA repair protein RecA,Recombinase A
Hypothetical protein,Uncharacterized protein
# Control batch processing
concord data/pairs.csv --batch-size 32 --llm-batch-size 12
```
Any extra columns (e.g. `gene_id`) are preserved in the output.

---

## CLI options

| Flag | Description |
|------|-------------|
| **`FILE`** | Input table (`.csv`, `.tsv`, `.json`) |
| `--text-a / --text-b` | Compare two strings instead of a file |
| `--mode` | `llm` (default) | `local` | `dual` | `simhint` |
| `--llm-model` | Gateway LLM (`gpto3mini`, `gpt4o`, …) |
| `--retry` | Automatic blank-reply retries (default 5) |
| `--force` | Overwrite existing output instead of appending |
| `--output` | Destination path (file-mode only) |
| `--cfg` | Alternate YAML config |
| `--col-a / --col-b` | Explicit annotation columns |
| `--sep` | Custom delimiter for text files (e.g. `"\t"`) |
**Python**
```python
from concord.pipeline import run_pair, run_file
label, sim, evidence = run_pair("Entity A", "Entity B", "config.yaml")
print(label, sim, evidence)
```

---
## Evaluation
After generating predictions, evaluate them against the gold standard:
```bash
python eval/evaluate_gold_standard.py \
--gold-standard eval/synthetic_gold_standard_v1.csv \
--predictions example_data/results_v2_zero.csv \
--relationship-column relationship \
--plot
```

## Modes in detail
## Configuration (`config.yaml`)
```yaml
engine:
mode: zero-shot # local | zero-shot | vote | rac
sim_hint: false # Optional: prefix similarity hint to prompts

| Mode | Pipeline | Extra columns |
|------|----------|---------------|
| **llm** | Skip embeddings; every pair → LLM | `label`, `note` |
| **local** | PubMedBERT embeddings → cosine → heuristic | `similarity_Pubmedbert`, `label` |
| **dual** | Embeddings **and** LLM for every row | `similarity_Pubmedbert`, `label`, `note` |
| **simhint**| Same as **dual**, but cosine similarity is sent to the LLM prompt | `similarity_Pubmedbert`, `label`, `note` |
llm:
model: gpt4o # use without hyphens
stream: false
user: ${ARGO_USER}

> **Embedding model** [`NeuML/pubmedbert-base-embeddings`](https://huggingface.co/NeuML/pubmedbert-base-embeddings) (Apache-2.0)
local:
model_id: NeuML/pubmedbert-base-embeddings
device: cpu # cpu or cuda

---
# RAC mode settings (Beta)
rac:
example_limit: 3 # Number of examples to include in prompts
similarity_threshold: 0.6 # Minimum similarity to include example
auto_store: true # Auto-save classifications to vector store

## 7-label ontology
data_dir: "./data" # Where to store the vector database
```

| Label | Meaning |
|-------|---------|
| **Exact** | Same function; wording/punctuation only |
| **Synonym** | Semantically equivalent paraphrase |
| **Broader** | A ⊃ B (A more general) |
| **Narrower** | A ⊂ B (A more specific) |
| **Related** | Same pathway / complex / family but not parent–child |
| **Uninformative**| Placeholder or extremely generic |
| **Different** | No functional overlap |
### Configuration Fields
- `engine.mode`: select mode (`local`, `zero-shot`, `vote`, `rac`)
- `engine.sim_hint`: boolean flag to prefix cosine similarity hint to LLM prompts (default: false)
- `engine.sim_threshold`: similarity threshold for local mode (default: 0.98)
- `engine.vote_temps`: list of temperatures for vote mode LLM calls (default: `[0.8, 0.2, 0.0]`)
- `llm.model`: Gateway model name (e.g. `gpt4o`, `gpt35`, `gpto3mini`)
- `llm.stream`: `true` to use streaming `/streamchat/` endpoint
- `llm.user`: Argo Gateway username (via `ARGO_USER`)
- `llm.api_key`: Argo Gateway API key (via `ARGO_API_KEY`)
- `prompt_ver`: explicit prompt version to use (overrides config `prompt_ver` and bucket routing)
- `local.model_id`: embedding model ID (PubMedBERT or SPECTER2)
- `local.device`: device for embeddings (`cpu` or `cuda`)
- `local.batch_size`: batch size for file processing
- `rac.example_limit`: number of similar examples to retrieve (for RAC mode)
- `rac.similarity_threshold`: minimum similarity score for examples (0-1)
- `rac.auto_store`: whether to automatically store successful classifications
- `data_dir`: directory for storing vector database and other data

### Alias system — why you rarely see **Unknown**
## RAC Mode (Beta)

Older checkpoints reply with tokens like *Identical* or *Partial*.
`llm_label()` holds a tiny alias map so such answers are remapped automatically; genuine blanks are the only source of `Unknown`.
The Retrieval-Augmented Classification (RAC) mode is currently in beta development. This mode enhances classification by retrieving similar previously classified examples and including them in the prompt for context.

---
### Current Limitations

## Prompting — tweak in one place
RAC mode currently has several limitations being actively worked on:

All prompt text lives in **`concord/llm/prompts.py`**.
Edit the few-shot examples or definitions to experiment; no other code must change.
1. **All Classifications Get Stored**: Currently, all successful LLM classifications are stored in the vector database if `auto_store` is enabled, regardless of quality or accuracy.

---
2. **Planned Improvements**:
- Human validation before storing examples
- Confidence thresholds from the LLM responses
- Selective storage based on specific characteristics or patterns
- Improved embedding methods for better similarity matching

## Config snapshot (`concord/config.yaml`)
```yaml
engine:
mode: llm
### Using RAC Mode

llm:
model: gpto3mini # auto-routes to apps-dev
stream: false
user: ${ARGO_USER}
```bash
# First time setup - create data directory
mkdir -p data

local:
model_id: NeuML/pubmedbert-base-embeddings
# Run with RAC mode (will build up examples over time)
concord data/pairs.csv --mode rac --output results_rac.csv
```
*(Leave `env` unset — o-series → apps-dev, GPT-4* → apps-prod.)*

---

## Progress, recovery & overwrite

* Output is **appended row-by-row** – abort with Ctrl-C and rerun; finished pairs are skipped.
Add `--force` to **replace** an existing file instead.
* Live progress bar example:
```
Processing 73%|██████████████▋ | 730/1000 [00:14<00:05, 49.2it/s]
```
o3-mini ≈ 0.10 s per pair; embeddings ≈ 1 ms per string on Apple M-series.
## Documentation
```bash
mkdocs serve
```
Published site: https://<org>.github.io/concordia/

---
## Environment Variables
- `ARGO_USER`: ANL login for Argo Gateway (required)
- `ARGO_API_KEY`: API key for private Argo Gateway (optional)

## FAQ
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

| Q | A |
|---|---|
| **Why keep similarity in LLM mode?** | Free sanity check (~ 2 ms) |
| **Still see “Unknown”?** | Reply didn’t start with any known token; tweak alias or prompt |
| **Where are model weights?** | Hugging Face cache (`~/.cache/huggingface`) |
| **Crash recovery?** | Append-only output; rerun resumes automatically |
| **Overwrite output?** | Use `--force` to replace an existing file |
## Testing
Run all tests via `pytest`:
```bash
pytest
```

---
## Development
We enforce formatting and linting with pre-commit hooks:
```bash
pip install pre-commit
pre-commit install
pre-commit run --all-files
```

*Happy concording! – Stars ⭐, issues 🐞, and PRs 💡 welcome.*
## License
Apache-2.0
Loading
Loading