Skip to content

Feat/tiny transformer impl#22

Merged
osolmaz merged 48 commits intomainfrom
feat/tiny-transformer-impl
Feb 15, 2026
Merged

Feat/tiny transformer impl#22
osolmaz merged 48 commits intomainfrom
feat/tiny-transformer-impl

Conversation

@osolmaz
Copy link
Copy Markdown
Member

@osolmaz osolmaz commented Feb 15, 2026

No description provided.

- Targets existing classifier: scam, clean, topic_crypto (multi-label, ~4k samples)
- Teacher: cardiffnlp/twitter-roberta-large-2022-154m (primary), vinai/bertweet-large (alt)
- Ensemble strategy: 3 seeds at 4k, single teacher at 100k-1M scale
- Dual-head: softmax (scam/clean) + sigmoid (topics)
- Distillation: logit-based, T=2-4, intermediate-layer matching, DAPT
- Student: 4-layer BERT, hidden 192, 4 heads, int8 ONNX ≤5MB
- Daily log entry for 2026-02-13
…ples

Deep scrape progress:
- replies.jsonl: 14→62 ground truth samples (+48)
- 61 levelsio callouts collected (Jan 15 - Feb 13)
- 27 new entries from deep fetch (14 deleted/missing parent)
- Bot handles: 525, Taggers: 18
- Still more to collect going back to Dec 2022
113 levelsio callouts collected (Dec 13 2025 - Feb 13 2026)
103 ground truth samples in replies.jsonl (was 61)
527 bot handles
~40% deletion rate on flagged tweets (bots delete after callout)
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Feb 15, 2026

Deploying janitr with  Cloudflare Pages  Cloudflare Pages

Latest commit: efd030b
Status: ✅  Deploy successful!
Preview URL: https://3756ea41.janitr.pages.dev
Branch Preview URL: https://feat-tiny-transformer-impl.janitr.pages.dev

View logs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: beef260475

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const ENGINE_FASTTEXT: Engine = "fasttext";
const ENGINE_TRANSFORMER: Engine = "transformer";
const ENGINE_AUTO: Engine = "auto";
const DEFAULT_ENGINE: Engine = ENGINE_TRANSFORMER;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Default to fastText until bundled transformer assets exist

DEFAULT_ENGINE now points to transformer, but the transformer loader expects bundled files (student.int8.onnx, student_config.json, tokenizer/vocab.txt, thresholds.json) that are not present in this commit; I verified with rg --files and only fastText assets are tracked. On a fresh install, every inference request will attempt transformer first, fail, and then fall back, which adds repeated failing work and noisy fallback behavior in normal usage. Keep the default on fastText unless those bundled transformer artifacts are shipped.

Useful? React with 👍 / 👎.

Comment on lines +616 to +619
ortWasmPathPrefix: runtimeWasmBaseUrlForDir(
DEFAULT_HF_EXPERIMENTS_REPO,
RUNTIME_DIR_CANDIDATES[0],
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use local ORT wasm path for bundled transformer mode

Bundled mode currently sets ortWasmPathPrefix to a Hugging Face URL, so ONNX Runtime must fetch its wasm binary from the network even when the source is builtin. That means transformer mode cannot be truly local/offline and will fail (then fall back) in environments where HF is blocked or unavailable. builtin should resolve runtime wasm from extension-packaged assets (e.g., via runtime URL) rather than a remote host.

Useful? React with 👍 / 👎.

}
return new Promise<StoragePayload>((resolve, reject) => {
area.get(key, (value: StoragePayload) => {
const err = chrome?.runtime?.lastError || browser?.runtime?.lastError;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard browser namespace access in storage callbacks

This callback error check can throw ReferenceError: browser is not defined in Chrome callback-mode paths when chrome.runtime.lastError is empty, because the right-hand browser?.runtime?.lastError is still evaluated. In that case storageGet/storageSet fail on successful operations, breaking backend/source persistence on fallback API paths. Add a typeof browser !== 'undefined' guard before touching browser.

Useful? React with 👍 / 👎.

@osolmaz osolmaz merged commit 5082240 into main Feb 15, 2026
3 checks passed
@osolmaz osolmaz deleted the feat/tiny-transformer-impl branch February 15, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants