Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider add Term-Text mapping #532

Open
1 of 9 tasks
jzohrab opened this issue Dec 7, 2024 · 0 comments
Open
1 of 9 tasks

Consider add Term-Text mapping #532

jzohrab opened this issue Dec 7, 2024 · 0 comments
Labels
architecture fundamental architecture redesign

Comments

@jzohrab
Copy link
Collaborator

jzohrab commented Dec 7, 2024

Currently, Lute doesn't build an 'index' of its terms, it only sorts through things dynamically at runtime.

Adding an always-accurate term index (textterms, fields TtID, TtTxID, TtWoID, TtCount) could be useful:

  • gives potentially useful stats to learners to see what words are common
  • lets us build stats graphs for books (for the index page) really quickly. (Note this wouldn't be useful when evaluating new texts to see how difficult they would be, b/c those texts wouldn't be indexed, unless they're also fully imported.)
  • simplifies finding term references, as they'll now be pre-calculated and cached
  • lets users do book-level filters for terms (e.g. show me status 1 terms in this book)
  • lets users list terms by descending count frequency (potentially useful when importing a new book, can do some pre-processing of data)

To do:

  • Figure out how to handle text deletion? If user deletes a text, does that mean that term's "counts" will decrease? Yes, but that should be fine ... if they're deleting texts, sure it might skew some small things but nothing to be too worried about. Lute does the best it can with the data it has, it can't be perfect.
  • wait for Searching for references should take sentence casing into account #531 to be completed, as doing multiterm searches requires casing to be correct
  • see if this really is useful. I'm not sure that it is ... it's 'interesting' data, but interesting can just mean more numbers and crap, and not really help encourage reading.
  • note that common words ('the', 'a', etc) will be found on every single page, so indexing these will be overkill -- might just end up with a ton of data. e.g. 1000 pages 100 unique words each = 100K records
  • figure out how to backfill existing texts -- it's a big processing job. Could be piecemeal, or could be a startup job that is only run once, depending on efficiency
  • remove all textterms.TtTxID = TxID when a text is edited
  • remove all textterms.TtTxID = TxID when a text is rendered -- just rebuild the whole thing during render again
  • cascade delete textterms = TxID when text deleted, = TxWoID when word is deleted
  • on save of new multiword term, check all all existing text sentences to see if it should be added to the index for that page (requires Searching for references should take sentence casing into account #531 to be done)
@jzohrab jzohrab added this to Lute-v3 Dec 7, 2024
@jzohrab jzohrab changed the title Consider add Term-Text mapping (BLOCKED by #531) Consider add Term-Text mapping - BLOCKED by #531 Dec 7, 2024
@jzohrab jzohrab changed the title Consider add Term-Text mapping - BLOCKED by #531 Consider add Term-Text mapping - BLOCKED by 531 Dec 7, 2024
@jzohrab jzohrab changed the title Consider add Term-Text mapping - BLOCKED by 531 Consider add Term-Text mapping Dec 14, 2024
@jzohrab jzohrab added the architecture fundamental architecture redesign label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture fundamental architecture redesign
Projects
Status: No status
Development

No branches or pull requests

1 participant