Checking this repo against the Quote-500K Dataset #185
TomLucidor
started this conversation in
General
Replies: 1 comment 2 replies
-
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently there is a dataset of quotes that are useful for language models and that I need to exclusively pick out the English ones to prevent data being unclean. There are a lot of quotes that are foreign (Spanish, French, Italian, Hindi, Chinese etc.) but I also realized that some of them are buggy (e.g. bad characters, numerals, formatting) and I would like to adventure into cleaning them out. Here are the ones being flagged as non-English (some are however English)
expected_foreign.txt
Questions:
Note: I will drop the "false negative" case once the data is handled carefully, as the original is 98KB
Beta Was this translation helpful? Give feedback.
All reactions