Find Movie Similarity from Plot Summaries using Kmeans and plotting Dendrograms.
Kmean algorithm is an unsupervised learning algorithm that helps to group similar clusters together in a data. This task shows how to use machine learning to cluster movie plot based on similarity between the 'wiki plot' in the dataset and the 'imdb plot' in the dataset
METHOD
- we imported the necessary libraries used for Data Analysis
- we read in the movie dataframe by using read.xlxs method.
- we checked the info and describe method on the data.
- we performed some explanatory data analysis (EDAS)
Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module.
- We then Created a function to perform both stemming and tokenization
- We imported the Kmeans module from scikitlearn to perform the clustering
- We performed cosine similarity and then obtained the similarity distance
- We plotted the dendrogram from which we obtained the different clusters of movies
- We printed the movies that were similar