InformationDiffusion

M1 - Internship - Lab HC - Information Diffusion in Online Communities

##Main contributions##

Longitudinal analysis of discussions around Brexit on two social media platforms: Twitter and Reddit;
Tool for visualizing the dynamics of discussion topics and trajectory of users;
Prediction of future political stance based on features defined using the structure online diffusions

##Datasets## https://www.dropbox.com/s/xx9y6setca7vz0q/Data.zip?dl=0

data_circulate/data_dt_brexittweets.Rdata - the full Twitter dataset used for training the NB classifier
data_circulate/data_dt_brexittweets_sample.Rdata - a sample of the Twitter dataset used for training the NB classifier
data_tabulation - results of the prediction of neutral users (not important)
correct_model_no_hashtags.rds - the NB classifier used for labelling the political stance
diffusions_comments_extra.csv - the Reddit dataset (the comments)
diffusions_submissions_extra.csv - the Reddit dataset (the initial reddits (the roots))
Fx_improved_data.RData - the training dataset build using features extracted from the structure of the diffusions
nb_with_hashtags_july.rds - a NB classifier trained on the Twitter dataset, but aimed to label Twitter data (it takes into account Twitter specific information such as the hashtags or mentions)

##Content## ###Code Files###

2019.04.02_statistics.R
- Read the input Reddit dataset
- Split the posts set into time-periods
- Aggregating all textual content of the dataset and build a DTM, then build word clouds for each period and for the whole interval
- Plot heatmap with the evolution of the topics in the 15 time-periods
- Plot two timeframes in the same graphic
- Plot transitions of common users between different time periods
- Find and Plot the Leaders of two different time periods. (leader = user with high number of posts)
- Structural Analysis of the Reddit dataset - density of posts, no of unique users, no of submissions, posts etc., CCDF plots for the number of comments per diffusion
2019.04.19_distances_for_tsne.R
- different tests on how we should represent the users using T-SNE in the Terms Space (w or w/o PCA, w or w/o tf-idf)
- not crucial to understand as they don't affect the Machine Learning part.
2019.04.23_PCA_on_covariance_matrix.R
- representing the users in the term space using PCA instead of T-SNE - turns out not to be so successfull
2019.04.25_LDA_users_and_terms_in_topics_space.R
- split the users in periods, aggregate users replies, build DTMs, apply LDA, represent the users in a 10 dimensional space coresponding to the probability distribution of topics over the users' discourses
- print most important terms of each topics
- apply T-SNE to reduce dimensionality then plot users and terms in the same 2D space.
2019.04.26_CLUSTERING_users_in_word_space.R
- split the users in periods, aggregate users replies, build DTMs, use T-SNE to reduce dimensionality, plot and perform clustering of the users in the terms space (using k-medoids and hdbscan)
2019.04.26_TSNE_users_in_words_space.R
- split the users in periods, aggregate users replies, build DTMs, use T-SNE to reduce dimensionality, plot users
2019.04.29_LDA_k_cross_validation_to_detect_topic_numbers.R
- for each time period perform k-fold CV to detect the optimal number of topics that should be sought via LDA (experimental)
2019.05.01_CLUSTERING_users_in_topic_space.R
- split the users in periods, aggregate users replies, build DTMs, apply LDA, obtain the representation of the users in topic space, use T-SNE to reduce dimensionality, plot and perform clustering of the users in the terms space (using k-medoids and hdbscan)
2019.05.06_LDA_distances_between_topics.R
- for a certain period, aggregate users texts and apply LDA to obtain the most important topics
- use KL divergence to compute the distance between the topics and build a heatmap with these distances
2019.05.07_LDA_parallel_distances_between_topics.R
- same as 9) but done in parallel
2019.05.08_LDA_determine_subjects_of_clusters.R
- represent users in the topic space using LDA and cluster the users in this new space using K-medoids
- determine the most important clusters and get the thematics in each of them
2019.05.15_SENTIMENT_ANALYSIS_SentimentR.R
- try to polarize the users using sentiment analysis using SentimentR library - did not work.
2019.05.15_SENTIMENT_ANALYSIS_Vader.R
- try to polarize the users using sentiment analysis using VadeR library - did not work.
2019.05.24_Twitter_Political_Stance_Detector.R
- using the Twitter Dataset and Ken Benoit's methodology, a political stance labeler is trained
- aggregate users tweets, filter out users who do not use targetted hashtags (hashes which refer to brexit or against brexit)
- filter users who do not send at least a min threshold number of tweets
- compute a leave score based on the difference between the number of leave hashtags and remain hashtags of each users and sort all users acording to this score
- keep the top 10% and bottom 10% as the training set
- train a Naive Bayes classifier on this smaller corpus
2019.07.16_Reddit_Classify_F0.R
- manufacture textual features in order to train the political stance predictor on the Reddit Dataset
- aggregate every users replies and build a DTM, then obtain the 100 most frequent overall terms - this will be the features vocabulary
- for every consecutive 2 time-periods, get common users between the 2 periods: from the first period get the training features (the frequencies of the terms in the features vocabulary) for every common user and the current political stance. From the second period, obtain the future political stance
- after the first phase, filter the resulting training set: eliminate training elements who have transitions from Neutral to Neutral and appear once (the training features)
- save the resulting dataset in a file named: F0_improved_data.csv
2019.05.20_Reddit_Classify_F1.R
- build FS1 (which considers the user activity information - no of received replies, no of submitted replies), no of initiated threads and no of submitted comments
2019.06.05_Reddit_Classify_F2.R
- build FS2 (which considers the user activity per group), no of received replies from each group, no of submitted replies to each group, no of initiated threads and no of submitted comments
2019.05.30_Reddit_Classify_F3.R
- build FS3 (which considers the structure of the diffusions a user is part of - ratio of users from each group in the diffusions)
2019.06.11_Reddit_Build_Mixed_Features.R
- different combination of the above feature sets
2019.06.17_Tabulation_Of_Users_Per_period.R
- (experimental) - for detecting the ratio of neutral users who have translations to the pro / against brexit side
2019.07.04_Twitter_Dataset_Statistics.R
- the same exploratory analyis which was performed on the Reddit dataset is now performed on the Twitter dataset.
2019.07.10_Twitter_FS1_Building.R
- building FS1 - Twitter version - initiated threads = original tweets, comments = retweets
2019.07.16_Twitter_FS2_Building.R
- building FS2
Python/ReditCrawler/crawlReddit.py
- the python script used for obtaining the Reddit dataset
Python/RunClassifiers - the models trained in Python for performing the prediction

###Reports###

contains the PDFs with the partial reports delivered thorughout the internship

###Final Documents###

M1_Report_MardaleAndrei_LabHC.pdf - final report describing in details this study
M1_Presentation_MardaleAndrei_LabHC.pptx - the presentation slides for defending the project
M1_Poster_MardaleAndrei_LabHC.pdf - the poster for the 2nd Manutech conference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InformationDiffusion

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Final Documents		Final Documents
Python		Python
Reports		Reports
.gitignore		.gitignore
2019.04.02_statistics.R		2019.04.02_statistics.R
2019.04.19_distances_for_tsne.R		2019.04.19_distances_for_tsne.R
2019.04.23_PCA_on_covariance_matrix.R		2019.04.23_PCA_on_covariance_matrix.R
2019.04.25_LDA_users_and_terms_in_topics_space.R		2019.04.25_LDA_users_and_terms_in_topics_space.R
2019.04.26_CLUSTERING_users_in_word_space.R		2019.04.26_CLUSTERING_users_in_word_space.R
2019.04.26_TSNE_users_in_words_space.R		2019.04.26_TSNE_users_in_words_space.R
2019.04.29_LDA_k_cross_validation_to_detect_topic_numbers.R		2019.04.29_LDA_k_cross_validation_to_detect_topic_numbers.R
2019.05.01_CLUSTERING_users_in_topic_space.R		2019.05.01_CLUSTERING_users_in_topic_space.R
2019.05.06_LDA_distances_between_topics.R		2019.05.06_LDA_distances_between_topics.R
2019.05.07_LDA_parallel_distances_between_topics.R		2019.05.07_LDA_parallel_distances_between_topics.R
2019.05.08_LDA_determine_subjects_of_clusters.R		2019.05.08_LDA_determine_subjects_of_clusters.R
2019.05.15_SENTIMENT_ANALYSIS_SentimentR.R		2019.05.15_SENTIMENT_ANALYSIS_SentimentR.R
2019.05.15_SENTIMENT_ANALYSIS_Vader.R		2019.05.15_SENTIMENT_ANALYSIS_Vader.R
2019.05.20_Reddit_Classify_F1.R		2019.05.20_Reddit_Classify_F1.R
2019.05.24_Twitter_Political_Stance_Detector.R		2019.05.24_Twitter_Political_Stance_Detector.R
2019.05.30_Reddit_Classify_F3.R		2019.05.30_Reddit_Classify_F3.R
2019.06.05_Reddit_Classify_F2.R		2019.06.05_Reddit_Classify_F2.R
2019.06.11_Reddit_Build_Mixed_Features.R		2019.06.11_Reddit_Build_Mixed_Features.R
2019.06.17_Tabulation_Of_Users_Per_period.R		2019.06.17_Tabulation_Of_Users_Per_period.R
2019.07.04_Twitter_Dataset_Statistics.R		2019.07.04_Twitter_Dataset_Statistics.R
2019.07.10_Twitter_FS1_Building.R		2019.07.10_Twitter_FS1_Building.R
2019.07.16_Reddit_Classify_F0.R		2019.07.16_Reddit_Classify_F0.R
2019.07.16_Twitter_FS2_Building.R		2019.07.16_Twitter_FS2_Building.R
Code.Rproj		Code.Rproj
LICENSE		LICENSE
README.md		README.md
library_loader.R		library_loader.R
utils.R		utils.R

License

andreimardale/information_diffusion_in_online_communities

Folders and files

Latest commit

History

Repository files navigation

InformationDiffusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages