Skip to content

Latest commit

 

History

History
242 lines (197 loc) · 23.5 KB

File metadata and controls

242 lines (197 loc) · 23.5 KB

Managed under atelier. Before starting, read C:\Users\kenrin\Project\.atelier\CHARTER.md (from WSL: /mnt/c/Users/kenrin/Project/.atelier/CHARTER.md), the current week log in .atelier/logs/, and this project's brief + log at .atelier/projects/coding/synonymicon/. Clock out per the charter when done.

Synonymicon

What this is

A multi-source synonym discovery tool with frequency-band filtering. The combined WordNet + fastText pipeline returns candidates; the user filters by frequency band. Originally framed as obscurity-only ("excavate rare lexical outliers"); the framing is broadening toward "better thesaurus including obscure stuff" — the architecture has value across the full Zipf range, not only the rare end.

Not a vocabulary learning tool. Not a definitions-first tool — definitions are supporting context for picking the right candidate.

Stack

  • Python + Flask, synchronous, single-process
  • No database — all state computed on the fly, nothing persisted
  • wordfreq for frequency: zipf_frequency(word, 'en') at query time. No precomputed frequency index. Zipf scale: 0 = vanishingly rare, 7 = extremely common
  • NLTK WordNet — primary synonym source (synset lemmas)
  • fastText (fasttext-wiki-news-subwords-300 via gensim) — secondary/fallback synonym source. Note: the gensim distribution is KeyedVectors (pretrained vectors only), not the full FastText model — OOV inputs raise KeyError and must be caught. WordNet still covers OOV cases. The vocabulary is lowercased, so most_similar is queried with the lowercased word (a capitalized input would otherwise KeyError and silently drop the whole embedding source). Disable the ~1 GB model load with SYNONYMICON_FASTTEXT=0 (dev/test mode — WordNet-only results, instant startup; the test suite uses this).
  • Definition fallback chain: Wiktionary API → Webster's 1913 (local JSON at data/websters1913.json) → WordNet gloss → "[undefined]" (literal string, rendered in italics). Wiktionary REST API requires a descriptive User-Agent header per Wikimedia policy — requests without one return 403 or get rate-limited. Use requests for fetches and beautifulsoup4 (bs4) for HTML stripping.
  • Corpus frequency tablesdata/subtlex_us.xlsx (SUBTLEX-US, Brysbaert & New 2009), data/bnc_all.al (BNC, Kilgarriff, ~939k lemmas), data/google_1grams.txt (Norvig, 333k words), data/wikipedia_freq.txt (IlyaSemenov 2023, 2.77M words), data/kaggle_freq.csv (rtatman, 333k words), data/hermitdave_freq.txt (OpenSubtitles 2018, 1.66M words), data/scriptsmith_freq.txt (Project Gutenberg, 3.08M words), data/leipzig_news_2025.txt (Leipzig News 2025, 634k types), data/leipzig_web_com_2018.txt (Leipzig Web COM 2018, 480k types), and data/leipzig_web_uk_2018.txt (Leipzig Web UK 2018, 445k types) loaded at startup alongside wordfreq. get_zipf(word, corpus) dispatches to the selected source. Counts are aggregated (summed) per lowercased word before the Zipf is computed, so BNC's per-POS-tag rows and Leipzig's capitalization variants are combined rather than silently dropped (a dict[word, zipf] is built once per corpus at load via a shared count-aggregating loader). SUBTLEX-US Zipf values are pre-computed (read directly, no aggregation); BNC is log10(count × 1B / 85_714_226); Google/Kaggle use log10(count) - 2.634; Wikipedia uses log10(count) - 0.5; OpenSubtitles uses log10(count) + 0.37; Gutenberg uses log10(count) - 0.5; Leipzig News uses log10(count) + 1.6763; Leipzig Web COM uses log10(count) + 1.7779; Leipzig Web UK uses log10(count) + 1.6987. Every offset is calibrated so the corpus's "the" lands at ≈ 7.73 (the wordfreq anchor) for cross-corpus tier consistency; the only exceptions are SUBTLEX (uses its own pre-computed Zipf, "the" ≈ 7.47) and BNC ("the" ≈ 7.86 via the principled per-billion formula). When adding or recalibrating a count corpus, verify get_zipf('the', <corpus>) ≈ 7.73. For the default wordfreq source, an unknown word (wordfreq Zipf 0.0) is returned as None so OOV candidates are dropped uniformly with the dict-backed corpora (rather than polluting the absurd band).
  • Frontend: single-page HTML/CSS/JS served from static/. Three themes cycled via footer button, persisted in localStorage. Right-heavier layout (38/62); integrated search/control surface on the left; single rounded word surface containing column cells on the right.

Layout

Backend is split into focused modules (was a single app.py until the Session-21 refactor):

  • app.py — Flask app: routes, request validation, response assembly
  • config.py — constants: TIERS, POS_MAP, VALID_POS/VALID_RANKS/VALID_CORPORA, scoring/limit constants, FASTTEXT_ENABLED
  • candidates.py — WordNet + fastText candidate generation, blended scoring/sorting, morphology/artifact filtering, get_band_label, get_senses
  • corpora.py — corpus loaders (count aggregation) + get_zipf dispatch
  • definitions.py — Wiktionary → Webster's → WordNet gloss → [undefined] fallback chain and caches
  • scripts/setup_nltk.py — one-time NLTK wordnet/omw-1.4 download
  • data/websters1913.json — Webster's 1913, loaded at startup
  • data/subtlex_us.xlsx — SUBTLEX-US frequency table (Brysbaert & New 2009), loaded at startup
  • data/google_1grams.txt — Norvig Google 1-grams (333k words), loaded at startup
  • data/wikipedia_freq.txt — Wikipedia frequency list (IlyaSemenov 2023, 2.77M words), loaded at startup
  • data/kaggle_freq.csv — Kaggle rtatman unigram frequency (333k words), loaded at startup
  • data/hermitdave_freq.txt — OpenSubtitles frequency list (hermitdave 2018, 1.66M words), loaded at startup
  • data/scriptsmith_freq.txt — Project Gutenberg frequency list (scriptsmith topwords, 3.08M words), loaded at startup
  • data/leipzig_news_2025.txt — Leipzig News 2025 (634k word types), loaded at startup
  • data/leipzig_web_com_2018.txt — Leipzig Web COM 2018 (480k word types), loaded at startup
  • data/leipzig_web_uk_2018.txt — Leipzig Web UK 2018 (445k word types), loaded at startup
  • static/ — frontend files (index.html; CSS and JS inline in the same file)
  • .venv/ — Python venv (gitignored)

Mobile layout (≤768px)

Single breakpoint at max-width: 768px:

  • .layout switches to grid-template-columns: 1fr; grid-template-rows: auto 1fr; gap: 0
  • .left-panel gets flex-direction: column; min-width: 0
  • Search tray: full-width, border-radius: 0, box-shadow removed, border-bottom: 1px solid var(--border) only
  • Watermark & hidden
  • .control-button uses flex: 0 0 auto (auto-width, not fixed 8rem) so three controls fit on narrow viewports
  • Nav arrows repositioned inside surface edges; .nav-arrow.left { left: 0.25rem }, .nav-arrow.right { right: 0.25rem }
  • Footer: position: sticky; bottom: 0; z-index: 20; about button hidden via footer .footer-link#about-button { display: none }
  • 100dvh used instead of 100vh on html, body, .shell to handle mobile browser chrome
  • overflow: hidden on .layout, .right-panel, .word-surface, .columns-grid to prevent content overflow on small viewports

Themes

Three themes cycled via a footer button, persisted in localStorage. Cycle order: lumen → penumbra → umbra → lumen.

  • lumen — light, warm-paper background. Default on first load.
  • penumbra — dark, #1a1a1e background.
  • umbra — OLED black, true #000 background.

All color properties use CSS variables overridden by body[data-theme="..."]. Hardcoded rgba values in page dots, entry headwords, and footer links were replaced with variables in Session 7b — keep this discipline; new color uses must go through variables, not raw rgba.

Frequency tiers

TIERS = {
    'all':      (float('-inf'), float('inf')),  # default; everything
    'common':   (4.0, float('inf')),
    'uncommon': (3.0, 4.0),
    'rare':     (2.0, 3.0),
    'exotic':   (1.0, 2.0),
    'absurd':   (float('-inf'), 1.0),
}
COMMON_FLOOR = 4.0  # band-label threshold; no longer used for filtering

Tier filtering: zmin <= z < zmax.

Candidate filtering

Results from both WordNet and fastText pass through these filters before frequency matching:

  • Query exclusion: case-insensitive match against the input word
  • Morphological variants: query word + common inflections (-s, -es, -ed, -ing, -er, -ers, double-consonant variants). Words ending in e additionally get the e-drop forms (make→making/maker) and the simple-suffix forms (make→makes/maker) — the branch is additive, not exclusive, so inflected forms of the query don't slip through.
  • Repeated characters: any character repeating 4+ times (re.search(r'(.)\1{3,}', key)) — catches embedding junk like "loooove" while sparing valid triple-consonant compounds (wallless, crossstitch).
  • Non-letter start: result must begin with [a-z]
  • Double hyphen: -- in key
  • Short words: fewer than 3 characters
  • Trailing punctuation cleanup: artifacts like "walk-" or "walk." stripped and re-checked against morph set

Definition truncation

Definitions over 200 characters are truncated at the last word boundary with "…" appended. The full definition is cached in DEFINITION_CACHE; truncation happens at API response time only.

Band labels

The API includes a band field on each result. Band labels match TIERS keys exactly. This was off-by-one before Session 7b's fix; get_band_label(zipf) in candidates.py is the single source of truth, and any change to TIERS boundaries must update get_band_label in lockstep.

Zipf range band value
Zipf ≥ 4.0 common
3.0 ≤ Zipf < 4.0 uncommon
2.0 ≤ Zipf < 3.0 rare
1.0 ≤ Zipf < 2.0 exotic
Zipf < 1.0 absurd

Synonym scoring

Blended single list, no source labels exposed in UI:

  • WordNet candidates: flat score = 1.5
  • fastText candidates: score = cosine similarity
  • Overlap: WordNet wins (true synonym trumps embedding neighbor)
  • Normalize for comparison/lookup on lowercase; WordNet lemma underscores become spaces
  • Multiword candidates are allowed for MVP
  • fastText cosine cutoff: FASTTEXT_COSINE_CUTOFF = 0.65
  • Sort is user-selectable via the rank param (default common):
    • common — Zipf descending, score descending as tiebreaker (the default/original order)
    • rare — Zipf ascending, score descending as tiebreaker
    • relevance — score descending, Zipf descending as tiebreaker

API

GET /synonyms?word=<x>&tier=<t>&pos=<p>&corpus=<c>&rank=<r> — returns a JSON object:

{
  "senses":          [{"id": "<synset>", "gloss": "...", "pos": "noun"}],
  "query_in_corpus": true,
  "results":         [{"word": "...", "zipf": 3.4, "definition": "...", "band": "uncommon"}]
}

senses is WordNet sense metadata for the query (capped at 8, filtered by pos when set, empty [] for 2-word phrases). results is the blended/sorted/frequency-filtered candidate list. query_in_corpus is true/false for single-word queries (whether the word exists in the selected corpus's frequency table), or null for 2-word phrases.

Valid tier values: all (default), common, uncommon, rare, exotic, absurd. Comma-separated lists accepted (tier=uncommon,rare).

Valid rank values: common (default), rare, relevance — controls result sort order (see Synonym scoring). Unknown values return 400 with available_ranks list.

Valid pos values: all (default), noun, verb, adj, adv. Multi-select: noun,verb. When pos is specified, WordNet candidates are filtered to matching POS synsets; fastText standalone candidates are excluded (fastText has no POS metadata). Unknown pos values return 400 with available_pos list.

Valid corpus values: wordfreq (default), subtlex (SUBTLEX-US film subtitles), bnc (British National Corpus, Kilgarriff), google_1grams (Norvig), wikipedia (Wikipedia 2023), kaggle (rtatman), opensubtitles (OpenSubtitles 2018), gutenberg (Project Gutenberg), leipzig_news (Leipzig News 2025), leipzig_web_com (Leipzig Web COM 2018), leipzig_web_uk (Leipzig Web UK 2018). Controls which frequency table is used for Zipf filtering. Unknown values return 400 with available_corpora list.

BNC lookup: The query word is lemmatized via NLTK WordNetLemmatizer (noun form first, verb form as fallback) before BNC Zipf lookup, because BNC surface forms are POS-tagged and the corpus lookup requires exact form matching.

Phrases of up to 2 words supported (e.g., word=hard+work). 3+ words return 400.

Performance

  • Definition cache: DEFINITION_CACHE at module scope caches full get_definition results. Repeated lookups across queries are instant.
  • Concurrent fetches: /synonyms uses ThreadPoolExecutor(max_workers=10) for parallel definition lookups. Fresh queries ~5-10x faster than sequential; cache-hit queries near-instant.

Parameter precedence:

  • Exactly one of min/max → 400
  • Neither → use tier; missing tier → 400
  • Unknown tier value → 400 with available tier names
  • Missing word → 400

Frontend control surface (left panel)

The left panel contains, top to bottom:

  1. Serif "Synonymicon" wordmark in the top-left corner — just the word in a serif face, 1.75rem (Session 7b bump). No logo glyph, no ornament.
  2. Integrated search/control surface, anchored at the upper-third (margin-top ~18vh), structured as a single rounded "tray" containing:
    • Inner search card (input field with magnifying-glass icon and submit-arrow button). Placeholder text: discover.
    • Four flat dropdowns sitting on the tray below the search card: corpus: <current>, frequency: <current>, pos: <current>, and sort: <current>.
    • The tray, search card, and dropdowns form three layered visual surfaces — outer tray (--surface), inner search card (--column), and the bare dropdowns on the tray.
    • Each dropdown trigger carries aria-haspopup/aria-expanded (synced on open/close); a single document handler closes all menus on outside-click or Escape.
  3. Watermark & glyph in the bottom-left, ~18rem, ~7% opacity, fills the otherwise-empty lower portion of the panel.

Frequency dropdown is checkbox-style with multi-select. all is mutually exclusive with bands; selecting any band deselects all. Empty selection reverts to all. Selecting all individual bands collapses to all. Trigger label shows all when all selected, band name when one selected, custom when multiple selected. A divider separates all from the band options.

POS dropdown mirrors the frequency pattern: all mutually exclusive with individual POSes. Trigger label shows all / POS name / custom. Selecting all collapses to all.

Sort dropdown is single-select with options common (default), rare, relevance → sent as the rank param. The band-separator labels are derived from the same FREQUENCY_TIERS label set so a band reads identically as a control option and as an inline separator. (The rank-count framing like 10k-30k is calibrated to wordfreq; under other corpora the same Zipf band maps to a different real-world rank — a known labeling limitation, not a per-corpus relabel.)

UI label tier param
all (default) all
common common
10k-30k uncommon
30k-80k rare
80k-150k exotic
150k+ absurd

Display labels (10k-30k, etc.) are display-only; backend filters on Zipf.

The corpus dropdown has eleven options: wordfreq (internet-derived, default), subtlex (US film/TV subtitles, Brysbaert & New 2009), bnc (British National Corpus, Kilgarriff), google_1grams (Norvig), wikipedia (IlyaSemenov 2023), kaggle (rtatman), opensubtitles (OpenSubtitles 2018), gutenberg (Project Gutenberg), leipzig_news (Leipzig News 2025), leipzig_web_com (Leipzig Web COM 2018), leipzig_web_uk (Leipzig Web UK 2018). Switching corpora re-filters results through the selected frequency table; tier boundaries (Zipf values) remain fixed — only the per-word Zipf value changes. Corpus selection is persisted in URL params (?corpus=...) and restored on back/forward navigation and page load.

Frontend results surface (right panel)

The right panel holds one rounded "word surface" containing column cells per page. Pagination is page-based, not continuous-scroll.

  • Responsive column count. Column count adapts to window width via getColumnsPerPage(): ≥1150px → 3 columns, ≥800px → 2 columns, <800px → 1 column. Grid template columns and width are set inline by JS. Resize handler debounced at 150ms.
  • Continuous-flow chunking. Results fill column 1 to capacity, overflow to column 2, then column 3, then page 2's column 1, etc. The fill logic is continuous; only the visible window is paginated.
  • Peek column. When more pages exist and column count > 1, ~36px of the next page's first column bleeds past the surface's right edge as an affordance signaling "more results continue." Disabled at 1-column width. On the last page, the columns sit flush with no peek.
  • Page indicator. Bar-style active dot (wider, darker) with thin dots for inactive pages, centered below the columns. Clickable.
  • Edge arrows. Left and right circular arrow buttons positioned outside the surface edges. Disabled (faded) when at the bounds.
  • Page transition. 220ms fade + slight horizontal slide on page change. The columns container has its key-equivalent state cycled to retrigger the animation.

Within columns

  • Entry layout: serif headword (~2rem, slight letterspacing), small superscript Zipf badge, italic serif definition (~1rem, muted color, ~1.35 line-height).
  • Pivot-on-click: clicking a result headword fires a new search for that word. cursor: pointer is the only visual affordance — no underlines, no link styling — but the headword carries role="link"/tabindex="0" and an Enter/Space keydown so the pivot is keyboard- and AT-operable (it is a <div>, not a heading — results are a list, not document sections). Page resets to 1 on pivot. Browser history via history.pushState({word, tiers, pos, corpus, rank}); the popstate listener restores all of word + tier + pos + corpus + rank on back/forward (and resets to the empty state when navigating back to the initial entry). popstate only restores — it never pushes. On page load, ?word=...&tier=...&pos=...&corpus=...&rank=... params are read and auto-searched (bookmarkable queries).

Brand wordmark reset. The "Synonymicon" wordmark is clickable and resets to the empty state. Enter key with an empty search field also triggers the reset. history.replaceState clears the URL so refreshing on the empty state stays there.

  • Hover state: color deepens on both headword and definition. No movement, no scale, no shadow change.
  • [undefined] rendering: italic, lighter muted color than regular definitions.
  • Band separators (Session 7b): when results span multiple bands and the current page contains a band transition, a small-caps muted header with hairline divider above appears inline at the position of the transition. Cross-column tracking via prevBand state through the render loop ensures a continuing band does not get a redundant header at the top of a new column. A new band starting at the top of a column does get a header.

Surface layering

  • Outer word surface: --surface background, large radius (--radius-outer, 2rem), soft shadow.
  • Inner column cells: --column background, smaller radius (--radius-inner, 1.5rem), subtle shadow, hairline border. Use background-clip: padding-box to avoid corner-leak rendering artifacts. Scrollable vertically (overflow-y: auto) so long band sections don't clip.
  • The two-level layering (outer surface + inner cells) is intentional and earns its complexity by making the column boundaries legible without dividers.

Frontend states

  • Loading: 6px dot, 1.2s pulse animation, anchored bottom-center of the word surface.
  • Empty: contextual message — begin with a word when no query has been issued, no synonyms found in this band when a query returned zero results in the selected tier. Empty state renders directly on the --surface tray background without a .column-cell wrapper — the message cell uses grid-column: 1/-1; display: grid; place-items: center.
  • Error: muted red message in the word surface area.

Run commands

cd ~/projects/synonymicon
source .venv/bin/activate
flask run --no-reload

Dev server on localhost:5000. Use --no-reload because the fastText model loads at module scope and the reloader would spawn two processes that both load it. Server startup ~2.5–3 minutes due to fastText (~1GB into RAM).

Non-goals — do not add these

  • General synonyms across the full frequency range with no filtering applied (the multi-source pipeline is broad, but the tool always exposes frequency as a control)
  • Dictionary-like features beyond definition (etymology, pronunciation, usage examples)
  • Languages other than English
  • A fourth or fifth ad-hoc theme — three is the committed set; further themes need a deliberate session
  • simple / advanced mode toggle in the UI
  • Any database, ORM, or persistent storage

Scope rails

  • Do not introduce a database. Ephemeral in-memory caches are fine; do not add persistent storage. (The definition cache is in-memory module-scope only.)
  • Do not add features outside the MVP scope listed above.
  • Frontend is desktop-first and right-heavier (38/62). Left side fixed for the integrated search/control surface; right side for the word surface only. No top bar; no controls on the right side.
  • Results render inside the single word surface with band separators flowing inline through the columns. Do not fix one band per column. Do not mirror or duplicate results across containers. No recursive or looped result repetition.
  • Frontend is plain inline HTML/CSS/JS. No build tools, no bundler, no framework. Tailwind, React, etc. are out of scope — translation from any external mockup must produce vanilla output.
  • Do not do client-side sorting, scoring, definition lookup, or ranking — the backend returns results in final order; render as-is.
  • The current visual treatment intentionally includes: soft shadows on surfaces, a watermark & glyph in the lower-left of the search panel, a 220ms page-transition animation, and serif typography (Cormorant Garamond). These are part of the agreed visual model — do not remove them as "decorative excess." Do not, however, add further ornament: no additional decorative glyphs, no additional animations beyond page transition and color-state hover, no additional taglines or branding marks beyond the wordmark and watermark already present.

Coding rules

  • Backend is split into focused modules (app.py, config.py, candidates.py, corpora.py, definitions.py); each has a single responsibility. Most changes are still tiny — a sort tweak, a band-label adjustment, one corpus row in corpora.py's _COUNT_CORPORA table. Don't re-merge the modules or refactor for "cleanliness" without a reason.
  • Frontend is single-file. Inline CSS in <style>, inline JS in <script>. Do not split into separate files unless there's a load-time reason.
  • The fixed-height heuristic for items-per-column (ITEM_HEIGHT_PX = 130) is acknowledged-imperfect; do not "fix" it without replacing it with proper dynamic measurement (and only do that as a deliberate session task, not a side effect of other work).
  • When a Planned for Session 8 item lands, fold it into the relevant spec section in the same change and remove the bullet. The planned-changes section is a staging area, not permanent documentation.