Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to train NER on google Colab as I have windows machine #10

Open
Fatima-Sajid opened this issue Mar 12, 2022 · 18 comments
Open

I want to train NER on google Colab as I have windows machine #10

Fatima-Sajid opened this issue Mar 12, 2022 · 18 comments

Comments

@Fatima-Sajid
Copy link

Is it possible that you guide me about steps to to be taken to train NER for Urdu on Colab or suggest and tell the steps

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 12, 2022 via email

@Fatima-Sajid
Copy link
Author

Yes, I have a dataset. I am sharing paper name and dataset link, can you please guide me. Paper name is "Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications " and dataset link given in the paper is as: MK-PUCIT can be downloaded from https://www.dropbox.com/sh/1ivw7ykm2tugg94/AAB9t5wnN7FynESpo7TjJW8la .... Please let me know the steps, format of the training file, where to keep it, how to execute it , I have used jupyter notebook , is it OK or do you suggest something different. Thanks a lot

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 13, 2022 via email

@ZaraSajid
Copy link

ZaraSajid commented Mar 14, 2022

I made files according to the files given in stanza-train, en_sample.train.bio, en_sample.test.bio, en_sample.dev.bio, kept them in the folder ner/training. I ran this command !python run_ner.py en_sample.dev.bio, the error generated is given below:
2022-03-14 08:41:42 INFO: Training program called with:
run_ner.py en_sample.dev.bio
2022-03-14 08:41:42 DEBUG: en_sample.dev.bio: en_sample.dev.bio
2022-03-14 08:41:42 INFO: en_sample.dev.bio: saved_models/ner/en_sample.dev.bio_nertagger.pt does not exist, training new model
2022-03-14 08:41:42 WARNING: The data for en_sample.dev.bio is missing or incomplete. Attempting to rebuild...
2022-03-14 08:41:42 ERROR: Unable to build the data. Please correctly build the files in data/ner/en_sample.dev.bio.train.json, data/ner/en_sample.dev.bio.dev.json, data/ner/en_sample.dev.bio.test.json and then try again.
Traceback (most recent call last):
File "run_ner.py", line 166, in
main()
File "run_ner.py", line 163, in main
common.main(run_treebank, "ner", "nertagger", add_ner_args)
File "/usr/local/lib/python3.7/dist-packages/stanza/utils/training/common.py", line 106, in main
temp_output_file.name, command_args, extra_args)
File "run_ner.py", line 90, in run_treebank
prepare_ner_dataset.main(short_name)
File "/usr/local/lib/python3.7/dist-packages/stanza/utils/datasets/ner/prepare_ner_dataset.py", line 426, in main
raise ValueError(f"dataset {dataset_name} currently not handled")
ValueError: dataset en_sample.dev.bio currently not handled

@Fatima-Sajid
Copy link
Author

I made files according to the files given in stanza-train, en_sample.train.bio, en_sample.test.bio, en_sample.dev.bio, kept them in the folder ner/training. I ran this command !python run_ner.py en_sample.dev.bio, the error generated is given below:
2022-03-14 08:41:42 INFO: Training program called with:
run_ner.py en_sample.dev.bio
2022-03-14 08:41:42 DEBUG: en_sample.dev.bio: en_sample.dev.bio
2022-03-14 08:41:42 INFO: en_sample.dev.bio: saved_models/ner/en_sample.dev.bio_nertagger.pt does not exist, training new model
2022-03-14 08:41:42 WARNING: The data for en_sample.dev.bio is missing or incomplete. Attempting to rebuild...
2022-03-14 08:41:42 ERROR: Unable to build the data. Please correctly build the files in data/ner/en_sample.dev.bio.train.json, data/ner/en_sample.dev.bio.dev.json, data/ner/en_sample.dev.bio.test.json and then try again.
Traceback (most recent call last):
File "run_ner.py", line 166, in
main()
File "run_ner.py", line 163, in main
common.main(run_treebank, "ner", "nertagger", add_ner_args)
File "/usr/local/lib/python3.7/dist-packages/stanza/utils/training/common.py", line 106, in main
temp_output_file.name, command_args, extra_args)
File "run_ner.py", line 90, in run_treebank
prepare_ner_dataset.main(short_name)
File "/usr/local/lib/python3.7/dist-packages/stanza/utils/datasets/ner/prepare_ner_dataset.py", line 426, in main
raise ValueError(f"dataset {dataset_name} currently not handled")
ValueError: dataset en_sample.dev.bio currently not handled

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 14, 2022 via email

@Fatima-Sajid
Copy link
Author

Thanks
but now error is language code, ran this command
!python run_ner.py /usr/local/lib/python3.7/dist-packages/stanza/data/ner/en_sample.dev.bio , error is
2022-03-15 08:04:29 INFO: Training program called with:
run_ner.py /usr/local/lib/python3.7/dist-packages/stanza/data/ner/en_sample.dev.bio
Traceback (most recent call last):
File "run_ner.py", line 166, in
main()
File "run_ner.py", line 163, in main
common.main(run_treebank, "ner", "nertagger", add_ner_args)
File "/usr/local/lib/python3.7/dist-packages/stanza/utils/training/common.py", line 89, in main
short_name = treebank_to_short_name(treebank)
File "/usr/local/lib/python3.7/dist-packages/stanza/models/common/constant.py", line 180, in treebank_to_short_name
raise ValueError("Unable to find language code for %s" % lang)
ValueError: Unable to find language code for /usr/local/lib/python3.7/dist

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 15, 2022 via email

@Fatima-Sajid
Copy link
Author

kindly, tell ,what I am supposed to do? I understand you did not suggested this..full path. But I don't understand where data/ner directory is or I have to make it ?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 15, 2022 via email

@Fatima-Sajid
Copy link
Author

I made the directory, as you guided, problem still exist, I am pasting commands and errors from jupyter notebook
import stanza
cd Lib\site-packages\stanza\utils
cd Lib\site-packages\stanza\utils\datasets\ner
C:\Users\Fatima\AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\datasets\ner
import stanza.utils.datasets.ner.prepare_ner_file as prepare_ner_file
cd stanza
cd utils/dataset
cd datasets
!python run_ner.py en_sample.train.bio
!python run_ner.py en_sample.train.bio
prepare_ner_file.process_dataset(en_sample.train.bio, output_json)

NameError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 prepare_ner_file.process_dataset(en_sample.train.bio, output_json)

NameError: name 'en_sample' is not defined
cd
C:\Users\Fatima
Lib\site-packages\stanza\utils\training
cd AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\training
C:\Users\Fatima\AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\training
!python run_ner.py en_sample.train.bio
2022-03-22 15:10:57 INFO: Training program called with:
run_ner.py en_sample.train.bio
2022-03-22 15:10:57 DEBUG: en_sample.train.bio: en_sample.train.bio
2022-03-22 15:10:57 INFO: en_sample.train.bio: saved_models/ner/en_sample.train.bio_nertagger.pt does not exist, training new model
2022-03-22 15:10:57 WARNING: The data for en_sample.train.bio is missing or incomplete. Attempting to rebuild...
2022-03-22 15:10:57 ERROR: Unable to build the data. Please correctly build the files in data/ner\en_sample.train.bio.train.json, data/ner\en_sample.train.bio.dev.json, data/ner\en_sample.train.bio.test.json and then try again.
Traceback (most recent call last):
File "C:\Users\Fatima\AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\training\run_ner.py", line 166, in
main()
File "C:\Users\Fatima\AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\training\run_ner.py", line 163, in main
common.main(run_treebank, "ner", "nertagger", add_ner_args)
File "C:\Users\Fatima\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\utils\training\common.py", line 105, in main
run_treebank(mode, paths, treebank, short_name,
File "C:\Users\Fatima\AppData\Local\Programs\Python\Python310\Lib\site-packages\stanza\utils\training\run_ner.py", line 90, in run_treebank
prepare_ner_dataset.main(short_name)
File "C:\Users\Fatima\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\utils\datasets\ner\prepare_ner_dataset.py", line 426, in main
raise ValueError(f"dataset {dataset_name} currently not handled")
ValueError: dataset en_sample.train.bio currently not handled

@Fatima-Sajid
Copy link
Author

The program will look for the .json files in the data/ner directory, which you may need to create if this is your first time training a Stanza NER model. You can change the expected path by setting the $NER_DATA_DIR environment variable.

This is from the page https://stanfordnlp.github.io/stanza/training.html, it says it requires .json file while the other page which shows example data , files are in bio extension, that page link is https://github.com/stanfordnlp/stanza-train in data directory

@Fatima-Sajid
Copy link
Author

https://github.com/stanfordnlp/stanza/blob/de44be871282e05f79f23f5f5e284aceb672726b/stanza/utils/training/run_ner.py

Can this link help me for training using dataset, if yes, how?

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Mar 22, 2022 via email

@Fatima-Sajid
Copy link
Author

Thanks, kindly look also this matter instruction + error
import stanza
cd /home/zara/.local/lib/python3.9/site-packages
/home/zara/.local/lib/python3.9/site-packages
cd stanza
/home/zara/.local/lib/python3.9/site-packages/stanza
import stanza.utils.datasets.prepare_ner_file as prepare_ner_file

ModuleNotFoundError Traceback (most recent call last)
Input In [37], in <cell line: 1>()
----> 1 import stanza.utils.datasets.prepare_ner_file as prepare_ner_file

ModuleNotFoundError: No module named 'stanza.utils.datasets.prepare_ner_file'

@ZaraSajid
Copy link

ZaraSajid commented Oct 11, 2022 via email

@AngledLuffa
Copy link
Contributor

I simplified the instructions as much as I could a couple months ago. Please go through that and tell me if it helps.

https://stanfordnlp.github.io/stanza/new_language_ner.html

@ZaraSajid
Copy link

ZaraSajid commented Oct 13, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants