Computing the number of words #34

danishpruthi · 2019-09-05T17:09:50Z

Most files share similar data reading code, like

Lines 18 to 22 in a9e8be5

    
           train = list(read_dataset("../data/classes/train.txt")) 
        
           w2i = defaultdict(lambda: UNK, w2i) 
        
           dev = list(read_dataset("../data/classes/test.txt")) 
        
           nwords = len(w2i) 
        
           ntags = len(t2i)

In most of the examples, the variable nwords is used as the effective vocabulary size, for instance, when we allocate parameters for embedding matrix.

nn4nlp-code/01-intro/cbow.py

Line 30 in a9e8be5

W_emb = model.add_lookup_parameters((nwords, EMB_SIZE)) # Word embeddings

However, there are likely many new words in dev/test set that might be added in w2i... their values are mapped to UNK, but they are still counted in len(w2i) which is likely not intended. Often this overcounting does not change the results, but it can be problematic in some cases.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computing the number of words #34

Computing the number of words #34

danishpruthi commented Sep 5, 2019 •

edited

Loading

Computing the number of words #34

Computing the number of words #34

Comments

danishpruthi commented Sep 5, 2019 • edited Loading

danishpruthi commented Sep 5, 2019 •

edited

Loading