Skip to content
Shujian2015 edited this page Feb 9, 2018 · 8 revisions

Useful features

  • Len of text
  • Mean price of each category
  • Mean of brand/shipping
  • Average of word embeddings: Lookup all words in Word2vec and take the average of them. paper, Github Quora
  • Better way to remove stop word cached
  • Reduce TF time

Tricks


Worth a read:

Top players


Ideas

  • Rewrite the code:
    • "without merge(fitting on train and transforming on test) my CV and LB loss increased by 0.009. I can't figure out the reason." Link
    • Test set into batches. link
    • Better val set for TF
  • Tune: dropout/FC layers
  • Use averaged GloVe for TF
  • Other features for TF: Quora solutions
    • No 1: Number of capital letters, question marks etc...
    • No 3: We used TFIDF and LSA distances, word co-occurrence measures (pointwise mutual information), word matching measures, fuzzy word matching measures (edit distance, character ngram distances, etc), LDA, word2vec distances, part of speech and named entity features, and some other minor features. These features were mostly recycled from a previous NLP competition, and were not nearly as essential in this competition.
    • No 8 -> a lot

Tried:

Clone this wiki locally