forked from sunilitggu/DDI-extraction-through-LSTM
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotes.txt
60 lines (50 loc) · 2.78 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
PROBLEMS ENCONTERED SO FAR:
- preprocess1.py
(patched) preprocess2.py supposes the input file is composed of blocks (delimited by \n\n),
containing each sentence (in 1 line) + the parse (like drug present etc).
However, some of the sentences are in 2 lines, so that the parsing
fails. These are the sentences that have a first part delimited by [] and the
"real" sentence in the following line.
For the moment, I only removed the \n so that the parsing is ok.
There are about 5 sentences with this problem.
Only look for "]" in the text file and fix the problem.
THINGS TO DO:
- The code should be modified to deal with this problem.
Maybe simply look for ]\n and change the \n into " "?
- main_train_val.py
No 95-5% split is done in the code. Files are simply read
THINGS TO DO:
- (FIXED) create the files programmatically
- we have 26K sentenced, it's probably not a problem to just choose the
sentences randomly (without caring for the distribution of the classes);
however, we should check that. No check is mentioned in the paper.
For some reason, the file train95.txt contains as last row only the sentence
(without the annotation) and this makes the main crush.
For the moment I've just manually removed the last line.
THINGS TO DO:
- check why this happens, fix
There is a problem in the test2 in a sentence in which (dunno why) there is no
drugb (practically they assigned N to other drugs, so not sure what happened).
I manually removed the sentence ("The objective of this study...") and left
only the first occurrance
THINGS TO DO:
- check wtf is going on and fix it
EMBEDDINGS:
- https://github.com/chop-dbhi/drug_word_embeddings
- http://bio.nlplab.org/
- https://github.com/jakerochlinmarcus/biomedical-word-embeddings
THINGS TO TRY:
The main problems in the model seem to be related to:
- very long sentences (https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/)
- unbalanced classes
So, things we could do are:
- Remove the first part of the sententeces if no druga or drug b is present (<- this seems to work!!!)
- remove useless words from all sentences
- remove empty words or do for very long sentences only
- maybe only consider words that are not too distant from the 2 drugs? like only n words before DRUGA and m after DRUGB
- oversample / undersample / similar methods for NLP / data augmentation?
Moreover, other things we could try are:
- change the embedding (already done, but maybe try other ones)?
- change the number of neurons for layer?
- look for other activation functions?
- add features for words like pos or similar