-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alex lemmatizer classifier 2 #1422
Open
AngledLuffa
wants to merge
16
commits into
dev
Choose a base branch
from
alex_lemmatizer_classifier_2
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
AngledLuffa
force-pushed
the
alex_lemmatizer_classifier_2
branch
30 times, most recently
from
September 16, 2024 02:27
21cb859
to
f8455f4
Compare
AngledLuffa
force-pushed
the
alex_lemmatizer_classifier_2
branch
13 times, most recently
from
November 12, 2024 07:31
b99e698
to
8fa012b
Compare
AngledLuffa
force-pushed
the
alex_lemmatizer_classifier_2
branch
from
November 22, 2024 21:26
a3cd587
to
1a39fa9
Compare
… token in English or other lemmas with ambiguous resolutions Includes data processing class for extracting sentences of interest Has evaluation functions for single example and multiexample Adds utility functions for loading dataset from file and handling unknown tokens during embedding lookup Can use charlm models for training Includes a baseline which uses a transformer to compare against the LSTM model Uses AutoTokenizer and AutoModel to load the transformer - can provide a specific model name with the --bert_model flag Includes a feature to drop certain lemmas, or rather, only accept lemmas if they match a regex. This will be particularly useful for a language like Farsi, where the training data only has 6 and 1 examples of the 3rd and 4th most common expansions Automatically extract the label information from the dataset. Save the label_decoder in the regular model and the transformer baseline model. Word vectors are trainable in the LSTM model Word vectors used are the ones shipped with Stanza for whichever language, not specifically Glove. This allows for using WV for whichever language we are using Model selection during training loop done using eval set performance - both baseline and LSTM model Training/testing done via batch processing for speed Include UPOS tags in data processing/loading for files. We then use UPOS embeddings for the words in the LSTM model as an additional signal for the query word Implement multihead attention option for LSTM model Add positional encodings to MultiHeadAttention layer of the LSTM model. The common train() method from the two trainer classes is treated as one parent class. Should make it easier to update pieces and keep them in sync Keep the dataset in a single object rather than a bunch of lists. Makes it easier to shuffle, keeps everything in one place Don't save the transformer, charlm, or original word vector file in the model files. Word vectors are finetuned and the deltas are saved. import full path
… to store other information with the data, such as the tag being processed
… charlms if they exist run_lemma_classifier.py now automatically tries to pick a save name and training filename appropriate for the dataset being trained. Still need to calculate the lemmas to predict and use a language-appropriate wordvec file before we can do other languages, though Add the ability to use run_lemma_classifier.py in --score_dev mode Add --score_test to the lemma_classifier as well Connects the transformer baseline to the run_lemma_classifier script Reports the dev & test scores when running in TRAIN mode
…taset fa_perdt, ja_gsd, AR, HI as current options for the lemma classifier
This requires using a target regex instead of target word to make it simpler to match multiple words at once in the data preparation code
Add a sample 9/2/2 dataset and test that it gets read in a way we might like
…mmaClassifier model Call evaluate_model just in case, although the expectation is that the F1 isn't going to be great
… Will be useful for integrating with the Pipeline Save the target upos for a lemma classifier along with the target words
…- now running on text with a lemma trainer that has one or more of these classifiers should attach the words correctly
…ssarily require the sentences be written anywhere
…luding all of them, there seems to be enough 's -> have without adding artificial data
…r data, make run_lemma automatically attach it
…f lemma_classifier is not specifically set. Pass along the charlm args to the lemma classifier as well
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a word classifier to cover ambiguous lemmas such as
's