- 
                Notifications
    You must be signed in to change notification settings 
- Fork 221
[ENH] Add Dynamic Alphabet Sizes for SFA #2844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Thank you for contributing to  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments didn't pick up anything major, otherwise lgtm
| X_test = zscore(X_test.squeeze(), axis=1) | ||
| histogram_type = "equi-width" | ||
|  | ||
| # print("Testing") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left over comment
| alphabet_allocation_methods = { | ||
| "linear_scale", | ||
| "log_scale", | ||
| "sqrt_scale", | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, you would use this list in testing by importing it so it can reflect new potential future additions
| normed_scale = variance / variance.mean() | ||
| elif self.alphabet_allocation_method == "log_scale": | ||
| variance = np.log2((self.dft_variance[self.support]) + 1) | ||
| normed_scale = variance / variance.mean() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but you could put normed scale after the if conditions if it happens in all of them.
This PR introduces the concept of dynamic alphabet sizes to SFA.
The alphabet size is used as a budget and assigned over all coefficients to maximize tightness of lower bound. Alphabet sizes are assigned proportional to the variance using three 3 strategies:
Illustration
Example with Alphabet Sizes [4, 4, 2, 2] and variance-based feature selection:

Example
E.g. Example for word length of 4 using 4 each, we have a budget of 16=4*4:
CD-Diagram for (average) alphabet-size 64
Experiments
Using this kind of assignment is most beneficial for smaller alphabet sizes. TLB results (larger is better) show that for 2 to 8 alphabet sizes large improvements can be observed.