TASK 5: Movie-Similarity

Find Movie Similarity from Plot Summaries using Kmeans and plotting Dendrograms.

What is Kmean?

Kmean algorithm is an unsupervised learning algorithm that helps to group similar clusters together in a data. This task shows how to use machine learning to cluster movie plot based on similarity between the 'wiki plot' in the dataset and the 'imdb plot' in the dataset

METHOD

we imported the necessary libraries used for Data Analysis
we read in the movie dataframe by using read.xlxs method.
we checked the info and describe method on the data.
we performed some explanatory data analysis (EDAS)

Tokenization

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module.

We then Created a function to perform both stemming and tokenization
We imported the Kmeans module from scikitlearn to perform the clustering
We performed cosine similarity and then obtained the similarity distance
We plotted the dendrogram from which we obtained the different clusters of movies
We printed the movies that were similar

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Movie+Comparison.py		Movie+Comparison.py
Movie_Similarity_main.ipynb		Movie_Similarity_main.ipynb
Movie_similarity.ipynb		Movie_similarity.ipynb
README.md		README.md
movie_comparison.py		movie_comparison.py
movies.csv		movies.csv
movies.xlsx		movies.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TASK 5: Movie-Similarity

What is Kmean?

Tokenization

About

Releases

Packages

Contributors 3

Languages

Inyrkz/Movie-Similarity

Folders and files

Latest commit

History

Repository files navigation

TASK 5: Movie-Similarity

What is Kmean?

Tokenization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages