Home

Jump to bottom

Shujian2015 edited this page Feb 9, 2018 · 8 revisions

Useful features

Len of text
Mean price of each category
Mean of brand/shipping
Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
Better way to remove stop word cached
Reduce TF time

Tricks

Stage 2: 1, 2, Mine

Worth a read:

strategy
Ridge: performance/computation time trade off
ensemble averaging
Why Ridge is much better than other sklearn models
Efficient Way to do TFIDF
Using log price as Dependent Variable But becarefull with those "without zero price" kernel, as it also remove it from the validation set it makes local CV score useless. If you want to remove zero price,, remove it inside the fold, so the validation set still resemble the original dataset, and then your CV score shall resemble LB
Wordbatch(TFIDF) vs WordSequence
Best single model
Wordbatch for preprocessing and modeling
Surpass 0.40000
LB shake up
CNN or RNN: Best single model

Top players

LB

Ideas

Rewrite the code:
- "without merge(fitting on train and transforming on test) my CV and LB loss increased by 0.009. I can't figure out the reason." Link
- Test set into batches. link
- Better val set for TF
Tune: dropout/FC layers
Use averaged GloVe for TF
Other features for TF: Quora solutions
- No 1: Number of capital letters, question marks etc...
- No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
- No 8 -> a lot

Tried:

Combine (condition and shipping)
Concatination of brand, item description and product name