Skip to content

Latest commit

 

History

History
61 lines (37 loc) · 2.06 KB

README.md

File metadata and controls

61 lines (37 loc) · 2.06 KB

K-CAP_2021

Documentation and data for the paper

Note: All code is made available under a Apache License Version 2.0

Investigating Annotator Agreement

the script used to calculate the annotators agreement; and additional files used for the results analysis

Investigating Context

Run the following in order to re-create the analysis and data used in section 4.3:

  1. Assembly of the CCC corpus dataset:

    see folder

    Produce data.csv, a csv of CCC corpus data merged by sample, with text-analysis of the sample contexts.

    see more information for details.

  2. Assess significant associations at token level:

    see folder

    Produce p_response_given_context.csv, a csv of statistics wrt., token and majority vote contentious/ nono-contentious co-occurrences.

    see more information for details.

  3. Build an embeddings set, a reduced set of 2d embeddings via UMAP and a hierarchical clustering matrix via umap:

    see folder see more information for further details

    Run:

    python3 PIPELINE.py
    
  4. View selected tokens (significantly association with contentious or non-contentious samples) on a t-SNE reduced embedding space:

    see folder see tsne_cluster.py for details

    Run:

    python3 tsne_cluster.py
    
  5. Assess significant associations with selected hierarchical clusters with contentious and non-contentious majority vote samples

    see folder see PIPELINE.py for details

    Run:

    python3 PIPELINE.py
    

    refer to investigated clusters for scripts used to asses statistical association of 'cleaned' token groups