Documentation and data for the paper
Note: All code is made available under a Apache License Version 2.0
Run the following in order to re-create the analysis and data used in section 4.3:
-
Assembly of the CCC corpus dataset:
see folder
Produce data.csv, a csv of CCC corpus data merged by sample, with text-analysis of the sample contexts.
see more information for details.
-
Assess significant associations at token level:
see folder
Produce p_response_given_context.csv, a csv of statistics wrt., token and majority vote contentious/ nono-contentious co-occurrences.
see more information for details.
-
Build an embeddings set, a reduced set of 2d embeddings via UMAP and a hierarchical clustering matrix via umap:
see folder see more information for further details
Run:
python3 PIPELINE.py
-
View selected tokens (significantly association with contentious or non-contentious samples) on a t-SNE reduced embedding space:
see folder see tsne_cluster.py for details
Run:
python3 tsne_cluster.py
-
Assess significant associations with selected hierarchical clusters with contentious and non-contentious majority vote samples
see folder see PIPELINE.py for details
Run:
python3 PIPELINE.py
refer to investigated clusters for scripts used to asses statistical association of 'cleaned' token groups