Name		Name	Last commit message	Last commit date
parent directory ..
Sequence_labeling_based_version		Sequence_labeling_based_version
Span_Extraction_based_version		Span_Extraction_based_version
Test_data		Test_data
README.md		README.md

README.md

Here is our data folder structure!

.
└── data/
    ├── Sequence labeling-based version/
    │   ├── Syllable/
    │   │   ├── dev_BIO_syllable.csv
    │   │   ├── test_BIO_syllable.csv
    │   │   └── train_BIO_syllable.csv
    │   └── Word/
    │       ├── dev_BIO_Word.csv
    │       ├── test_BIO_Word.csv
    │       └── train_BIO_Word.csv
    ├── Span Extraction-based version/
    │   ├── dev.csv
    │   └── train.csv
    └── Test_data/
        └── test.csv

Sequence labeling-based version

Syllable

Description:

This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns:
- index: The id of the word.
- word: Words in the sentence after the processing of tokenization using VnCoreNLP tokenizer followed by underscore tokenization. The reason for this is that some words are in bad format: e.g. "điện.thoại của tôi" is split into ["điện.thoại", "của", "tôi"] instead of ["điện", "thoại", "của", "tôi"] if we use space tokenization, which is not in the right format of Syllable. As that, we used VnCoreNLP to tokenize first and then split words into tokens. e.g. "điện.thoại của tôi" ---(VnCoreNLP)---> ["điện_thoại", "của", "tôi"] ---(split by "_")---> ["điện", "thoại", "của", "tôi"].
- tag: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word).
The train_BIO_syllable and dev_BIO_syllable file are used for training and validation for XLMR model, respectively.
The test_BIO_syllable file is used for reference only. It is not used for testing the model. Please use the test.csv file in the Testdata folder for testing the model.

Word

Description:

This folder contains the data for the sequence labeling-based version of the task. The data is divided into two files: train, and dev. Each file contains the following columns:
- index: The id of the word.
- word: Words in the sentence after the processing of tokenization using VnCoreNLP tokenizer
- tag: The tag of the word. The tag is either B-T (beginning of a word), I-T (inside of a word), or O (outside of a word).
The train_BIO_Word and dev_BIO_Word file are used for training and validation for PhoBERT model, respectively.
The test_BIO_Word file is used for reference only. It is not used for testing the model. Please use the test.csv file in the Testdata folder for testing the model.

Span Extraction-based version

Description:

This folder contains the data for the span extraction-based version of the task. The data is divided into two files: train and dev. Each file contains the following columns:
- content: The content of the sentence.
- index_spans: The index of the hate and offensive spans in the sentence. The index is in the format of [start, end] where start is the index of the first character of the hate and offensive span and end is the index of the last character of the hate and offensive span.
The train and dev file are used for training and validation for BiLSTM-CRF model, respectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Sequence labeling-based version

Syllable

Word

Span Extraction-based version

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Sequence labeling-based version

Syllable

Word

Span Extraction-based version