Conversation
- Targets existing classifier: scam, clean, topic_crypto (multi-label, ~4k samples) - Teacher: cardiffnlp/twitter-roberta-large-2022-154m (primary), vinai/bertweet-large (alt) - Ensemble strategy: 3 seeds at 4k, single teacher at 100k-1M scale - Dual-head: softmax (scam/clean) + sigmoid (topics) - Distillation: logit-based, T=2-4, intermediate-layer matching, DAPT - Student: 4-layer BERT, hidden 192, 4 heads, int8 ONNX ≤5MB - Daily log entry for 2026-02-13
…ples Deep scrape progress: - replies.jsonl: 14→62 ground truth samples (+48) - 61 levelsio callouts collected (Jan 15 - Feb 13) - 27 new entries from deep fetch (14 deleted/missing parent) - Bot handles: 525, Taggers: 18 - Still more to collect going back to Dec 2022
113 levelsio callouts collected (Dec 13 2025 - Feb 13 2026) 103 ground truth samples in replies.jsonl (was 61) 527 bot handles ~40% deletion rate on flagged tweets (bots delete after callout)
…ning had deleted parents)
Deploying janitr with
|
| Latest commit: |
efd030b
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3756ea41.janitr.pages.dev |
| Branch Preview URL: | https://feat-tiny-transformer-impl.janitr.pages.dev |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: beef260475
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const ENGINE_FASTTEXT: Engine = "fasttext"; | ||
| const ENGINE_TRANSFORMER: Engine = "transformer"; | ||
| const ENGINE_AUTO: Engine = "auto"; | ||
| const DEFAULT_ENGINE: Engine = ENGINE_TRANSFORMER; |
There was a problem hiding this comment.
Default to fastText until bundled transformer assets exist
DEFAULT_ENGINE now points to transformer, but the transformer loader expects bundled files (student.int8.onnx, student_config.json, tokenizer/vocab.txt, thresholds.json) that are not present in this commit; I verified with rg --files and only fastText assets are tracked. On a fresh install, every inference request will attempt transformer first, fail, and then fall back, which adds repeated failing work and noisy fallback behavior in normal usage. Keep the default on fastText unless those bundled transformer artifacts are shipped.
Useful? React with 👍 / 👎.
| ortWasmPathPrefix: runtimeWasmBaseUrlForDir( | ||
| DEFAULT_HF_EXPERIMENTS_REPO, | ||
| RUNTIME_DIR_CANDIDATES[0], | ||
| ), |
There was a problem hiding this comment.
Use local ORT wasm path for bundled transformer mode
Bundled mode currently sets ortWasmPathPrefix to a Hugging Face URL, so ONNX Runtime must fetch its wasm binary from the network even when the source is builtin. That means transformer mode cannot be truly local/offline and will fail (then fall back) in environments where HF is blocked or unavailable. builtin should resolve runtime wasm from extension-packaged assets (e.g., via runtime URL) rather than a remote host.
Useful? React with 👍 / 👎.
| } | ||
| return new Promise<StoragePayload>((resolve, reject) => { | ||
| area.get(key, (value: StoragePayload) => { | ||
| const err = chrome?.runtime?.lastError || browser?.runtime?.lastError; |
There was a problem hiding this comment.
Guard browser namespace access in storage callbacks
This callback error check can throw ReferenceError: browser is not defined in Chrome callback-mode paths when chrome.runtime.lastError is empty, because the right-hand browser?.runtime?.lastError is still evaluated. In that case storageGet/storageSet fail on successful operations, breaking backend/source persistence on fallback API paths. Add a typeof browser !== 'undefined' guard before touching browser.
Useful? React with 👍 / 👎.
No description provided.