Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 41 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<img src="assets/logo.svg" alt="Janitr logo" width="200">

A browser extension that filters crypto scams, AI-generated replies, and promotional spam from your social media feeds. Inference runs locally on-device; advanced mode can optionally fetch model runs from Hugging Face.
A browser extension that filters crypto scams, AI-generated replies, and promotional spam from your social media feeds. Inference runs locally on-device.

> **⚠️ Work in Progress**: This is an MVP. Currently it only works on X (Twitter) for demoing scam detection. Try it out, and if you have ideas for new content categories or improvements, tag or DM [@janitr_ai](https://x.com/janitr_ai) on X.

Expand All @@ -24,11 +24,14 @@ Models trained on top of this dataset are a separate concern. Different models w

A core principle is that models must run **locally on your device** — no cloud, no API calls, no data leaving your browser. This means optimizing for small model sizes and fast inference so detection works on everything from phones to older laptops. Privacy isn't optional.

The current implementation uses **fastText** (123KB quantized model running via WebAssembly), but the underlying ML approach may evolve as we expand to more content categories.
The current implementation supports **two local backends**:

- **Transformer (default):** ONNX Runtime Web, stronger scam detection, larger model
- **fastText:** ultra-small WASM path, useful as a lightweight fallback

**Current dataset:**

~2,900 multi-label samples, all sourced from X via browser automation and human-verified. See [LABELS.md](docs/LABELS.md) for the full label guide.
~4,200+ multi-label samples, all sourced from X via browser automation and human-verified. See [LABELS.md](docs/LABELS.md) for the full label guide.

This entire project — data collection, labeling, model training, and the extension itself — was built using [OpenClaw](https://github.com/openclaw/openclaw), an open framework for personal AI assistants.

Expand All @@ -48,29 +51,30 @@ The approach: start narrow (crypto scams have clear ground truth), prove the pip

## How It Works

- **fastText model** runs in-browser via WebAssembly
- **Transformer model** runs in-browser via ONNX Runtime Web (default backend)
- **fastText model** runs in-browser via WebAssembly (optional fallback backend)
- **Content scripts** scan posts and DMs as you scroll
- **3-class detection**: `scam`, `topic_crypto`, `clean` (backed by a [100+ label taxonomy](docs/LABELS.md))
- **Thresholds** are tunable per-class to control false positive rate
- **No network calls during inference** — classification happens on your CPU
- **No network calls during inference** — classification happens on your CPU (network is only used when you explicitly download remote runs in advanced mode)

## Model Performance

| Metric | Value |
| ---------------------- | ------------------ |
| Model size | 123 KB (quantized) |
| Scam precision | 95% |
| Scam recall | 64% |
| topic_crypto precision | 79% |
| topic_crypto recall | 37% |
| Target FPR | ≤ 2% |
Frozen-split benchmark (expanded holdout, see `docs/reports/2026-02-14-frozen-split-fasttext-vs-transformer-benchmark.md`):

| Metric | fastText (ftz) | Transformer (int8 ONNX) |
| ----------- | -------------- | ----------------------- |
| Scam P | 0.9128 | 0.9375 |
| Scam R | 0.5551 | 0.6122 |
| Scam F1 | 0.6904 | 0.7407 |
| Scam FPR | 0.0158 | 0.0121 |
| Macro F1 | 0.7838 | 0.8013 |
| Exact Match | 0.7624 | 0.8241 |

Current thresholds (`extension/fasttext/thresholds.json`), tuned for ≤ 2% FPR:
Model artifact sizes from the same benchmark:

- `scam`: 0.93
- `topic_crypto`: 0.91
- `clean`: 0.1
- fastText `.ftz`: ~123 KB
- transformer int8 ONNX: ~3.4 MB

## Hugging Face Experiment Runs

Expand All @@ -83,6 +87,7 @@ Janitr keeps a rolling artifact repo on Hugging Face for large model files and d
### Extension approach (advanced mode)

- The extension ships with bundled local models and works offline by default.
- Default backend is `transformer`; you can switch backend from the popup or options page.
- In advanced mode (`Options` page), you can:

1. list remote runs from Hugging Face
Expand All @@ -97,23 +102,29 @@ Janitr keeps a rolling artifact repo on Hugging Face for large model files and d
### Training pipeline

```bash
python -m venv .venv
source .venv/bin/activate
pip install fasttext-wheel

make prepare # prepare train/valid splits
make train # train fastText model
make eval # evaluate on test set
# FastText path
uv run --project scripts python scripts/prepare_data.py
uv run --project scripts python scripts/train_fasttext.py
uv run --project scripts python scripts/evaluate.py

# Transformer path (teacher -> student -> eval)
uv run --project scripts python scripts/train_transformer_teacher.py --seeds 13,42,7
uv run --project scripts python scripts/calibrate_teacher.py
uv run --project scripts python scripts/cache_teacher_logits.py
uv run --project scripts python scripts/train_transformer_student_distill.py
uv run --project scripts python scripts/export_transformer_student_onnx.py
uv run --project scripts python scripts/quantize_transformer_student.py
uv run --project scripts python scripts/evaluate_transformer.py
```

### Extension development

The extension lives in `extension/`. Key files:

- `manifest.json` — Chrome extension manifest (MV3)
- `content-script.js` — injected into pages, scans DOM
- `background.js` — service worker
- `fasttext/` — WASM runtime + quantized model + thresholds
- `src/` — TypeScript source for background/content/offscreen/options/popup
- `transformer/` — bundled transformer runtime + model loader
- `fasttext/` — WASM runtime + quantized fallback model + thresholds

### Quantization

Expand Down Expand Up @@ -145,7 +156,9 @@ See `docs/DATA_LABELING.md` for the labeling workflow.

## Local-First

No network calls. No telemetry. No cloud. Your browsing stays private.
No telemetry. No cloud inference. Your browsing stays private.

By default, inference is fully local. Optional network access is only used when you explicitly download remote experiment artifacts in advanced mode.

## Similar Projects

Expand Down
22 changes: 20 additions & 2 deletions docs/logs/2026-02-15.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ author: bob <unknown@example.com>
date: 2026-02-15
tz: Europe/Berlin
created: 2026-02-15T00:08:59+01:00
last_section: 2026-02-15T19:14:36+01:00
updated: 2026-02-15T19:14:36+01:00
last_section: 2026-02-15T21:03:09+01:00
updated: 2026-02-15T21:03:09+01:00
---

## 00:08
Expand Down Expand Up @@ -591,3 +591,21 @@ Switched formatting/tooling to uv everywhere: package scripts and lint-staged no
## 19:14

Pinned CI uv toolchain version to 0.7.3 in both GitHub Actions jobs via astral-sh/setup-uv to keep formatter/lint execution deterministic across runners.

## 19:35

Pulled latest origin/main (fast-forwarded local main to 5082240) and switched back to feat/tiny-transformer-impl to keep branch context unchanged.

Updated root README to reflect current state: transformer is default backend with fastText fallback, dataset size updated to ~4.2k+, model performance section replaced with frozen-split fastText vs transformer benchmark snapshot, advanced mode/backend switch behavior clarified, and training commands standardized to uv-based fastText + transformer pipelines.

Refined README consistency: extension development section now points to TypeScript source/runtime directories, and Local-First section now explicitly distinguishes local inference from optional user-triggered model downloads in advanced mode.

Removed the Hugging Face advanced-mode clause from the opening README sentence; intro now states only the core local on-device inference promise.

## 20:56

Opened a fresh PR branch from updated main to avoid previously merged/overlapping history. Created and pushed docs/readme-transformer-refresh-2026-02-15 with README transformer-first refresh changes and validation via npm run format:check.

## 21:03

Opened new README refresh PR using GitHub App credentials from ~/.config/github-app helper (in-memory token only). PR URL: https://github.com/janitrai/janitr/pull/23.