K-CAP_2021

Documentation and data for the paper

Note: All code is made available under a Apache License Version 2.0

Investigating Annotator Agreement

the script used to calculate the annotators agreement; and additional files used for the results analysis

Investigating Context

Run the following in order to re-create the analysis and data used in section 4.3:

Assembly of the CCC corpus dataset:

see folder

Produce data.csv, a csv of CCC corpus data merged by sample, with text-analysis of the sample contexts.

see more information for details.
Assess significant associations at token level:

see folder

Produce p_response_given_context.csv, a csv of statistics wrt., token and majority vote contentious/ nono-contentious co-occurrences.

see more information for details.
Build an embeddings set, a reduced set of 2d embeddings via UMAP and a hierarchical clustering matrix via umap:

see folder see more information for further details

Run:
```
python3 PIPELINE.py
```
View selected tokens (significantly association with contentious or non-contentious samples) on a t-SNE reduced embedding space:

see folder see tsne_cluster.py for details

Run:
```
python3 tsne_cluster.py
```
Assess significant associations with selected hierarchical clusters with contentious and non-contentious majority vote samples

see folder see PIPELINE.py for details

Run:
```
python3 PIPELINE.py
```
refer to investigated clusters for scripts used to asses statistical association of 'cleaned' token groups