Skip to content

Latest commit

 

History

History
23 lines (17 loc) · 1.16 KB

README.md

File metadata and controls

23 lines (17 loc) · 1.16 KB

TASK 5: Movie-Similarity

Find Movie Similarity from Plot Summaries using Kmeans and plotting Dendrograms.

What is Kmean?

Kmean algorithm is an unsupervised learning algorithm that helps to group similar clusters together in a data. This task shows how to use machine learning to cluster movie plot based on similarity between the 'wiki plot' in the dataset and the 'imdb plot' in the dataset

METHOD

  1. we imported the necessary libraries used for Data Analysis
  2. we read in the movie dataframe by using read.xlxs method.
  3. we checked the info and describe method on the data.
  4. we performed some explanatory data analysis (EDAS)

Tokenization

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module.

  1. We then Created a function to perform both stemming and tokenization
  2. We imported the Kmeans module from scikitlearn to perform the clustering
  3. We performed cosine similarity and then obtained the similarity distance
  4. We plotted the dendrogram from which we obtained the different clusters of movies
  5. We printed the movies that were similar