notes.txt

PROBLEMS ENCONTERED SO FAR:
    
    - preprocess1.py

    (patched) preprocess2.py supposes the input file is composed of blocks (delimited by \n\n), 
    containing each sentence (in 1 line) + the parse (like drug present etc).
    However, some of the sentences are in 2 lines, so that the parsing  
    fails. These are the sentences that have a first part delimited by [] and the 
    "real" sentence in the following line.
    For the moment, I only removed the \n so that the parsing is ok.
    There are about 5 sentences with this problem. 
    Only look for "]" in the text file and fix the problem.
    THINGS TO DO:
        - The code should be modified to deal with this problem.
          Maybe simply look for ]\n and change the \n into " "?
    
    - main_train_val.py
    
    No 95-5% split is done in the code. Files are simply read
    THINGS TO DO:
        -  (FIXED) create the files programmatically 
        - we have 26K sentenced, it's probably not a problem to just choose the 
          sentences randomly (without caring for the distribution of the classes);
          however, we should check that. No check is mentioned in the paper.
    
    For some reason, the file train95.txt contains as last row only the sentence
    (without the annotation) and this makes the main crush.
    For the moment I've just manually removed the last line.
    THINGS TO DO:
        - check why this happens, fix
        
    There is a problem in the test2 in a sentence in which (dunno why) there is no
    drugb (practically they assigned N to other drugs, so not sure what happened).
    I manually removed the sentence ("The objective of this study...") and left 
    only the first occurrance
    THINGS TO DO:
        - check wtf is going on and fix it
        
    EMBEDDINGS:
    - https://github.com/chop-dbhi/drug_word_embeddings
    - http://bio.nlplab.org/
    - https://github.com/jakerochlinmarcus/biomedical-word-embeddings


THINGS TO TRY:
	The main problems in the model seem to be related to:
	- very long sentences (https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/)
	- unbalanced classes
	So, things we could do are:
	- Remove the first part of the sententeces if no druga or drug b is present (<- this seems to work!!!)
	- remove useless words from all sentences
	- remove empty words or do for very long sentences only
	- maybe only consider words that are not too distant from the 2 drugs? like only n words before DRUGA and m after DRUGB
	- oversample / undersample / similar methods for NLP / data augmentation?

	Moreover, other things we could try are:
	- change the embedding (already done, but maybe try other ones)?
	- change the number of neurons for layer?
	- look for other activation functions?
	- add features for words like pos or similar