Exploring Dataset: The Reddit dataset is divided into three categories: training, validation, and testing. The training set is used to train the model, make it learn hidden patterns in the data and fit the parameters to the model, while the validation set is used to provide an unbiased evaluation of a model fitted on the training dataset and provided information that helps for tuning hyperparameters, and the test set is used to provide an unbiased evaluation of a final model fitted on the training dataset. The dataset is divided into three parts to ensure that the classification algorithm can generalise effectively to fresh data after being trained. There are 1200 reddit posts in the training set, and 400 in each of the validation and test sets. The label to predict is the 'Subreddit' column in the dataset, which contains nine distinct values (PS4, pcgaming, NintendoSwitch, antiMLM, HydroHomies, Coffee, xbox, Soda, tea). Within each dataset (train/validation/test), the number of occurrences in each label is more or less similar, resulting in a balanced train, validation and test dataset. However, when the full dataset is divided into three sets, all of the labels are appeared in three divided sets, but they are not evenly distributed among train/ validation /test dataset leading to the suspicion that the data is biased (unbalanced data set)
Classifiers Implemented
1.Dummy Classifier with strategy="most_frequent"
2.Dummy Classifier with strategy="stratified"
3.Logistic Regression with One-hot vectorization
4.Logistic Regression with TF-IDF vectorization (default settings)
5.SVC Classifier with One-hot vectorization (SVM with RBF kernel, default settings)
Creation of Custom Tokenizer
function name: custom_tokenizer() for converting the text (body of reddit post) into list of words assuming that the better we tokenize the body text, the better the final result will be.
The function custom_tokenizer() removes
• html elements, URLs, digits from reddit post body
• emojis. I observed the presence of emoji’s in reddit post body which is not relevant to the problem statement
• all special characters and punctuators from reddit post body
• removed stop words and spaces
• and expanded commonly seen short form of words. For example, isn’t -> is not, can’t -> can not
• normalise the text by converting all words to lowercase
• tokenize/split the body text into list of tokens/words
• applied lemmatization on tokens (eg., dancing -> dance)
Feature Engineering
Below features are selected to add to the tuned model.
• Added other properties of reddit posts
‘Title’ has been added as an additional feature. In a normal context, title of a reddit post shows a very brief description of the subreddit. It will be shown in the tab's text, as well as in Reddit and Google search results. As a result, if we add title as a new feature, there may be a significant increase in classifier performance.
• Word embedding using Gensim word2vec
word2vec is used for word embedding. Reddit post ‘title’ has been passed to gensim function created. Since it is capable of capturing word similarity using vector arithmetic, it may improve model performance. Also size of generated vector is small and flexible. gensim.utils.simple_preprocess is used for text preprocessing and tokenization.