DeepTweets

Real or Not? NLP with Disaster Tweets

Link: https://www.kaggle.com/c/nlp-getting-started

History of Word Embeddings

Traditionally, we use bag-of-word to represent a feature (e.g. TF-IDF or Count Vectorize). Besides BoW, we can apply LDA or LSA on word feature. However, they have some limitations such as high dimensional vector, sparse feature. Word Embedding is a dense feature in low dimensional vector. It is proved that word embedding provides a better vector feature on most of NLP problem.

In 2013, Mikolov et al. made Word Embedding popular. Eventually, word embedding is state-of-the-art in NLP. He released the word2vec toolkit and allowing us to enjoy the wonderful pre-trained model. Later on, gensim provide a amazing wrapper so that we can adopt different pre-trained word embedding models which including Word2Vec (by Google), GloVe (by Stanford), fastText (by Facebook).

12 years before Tomas et al. introduces Word2Vec, Bengio et al. published a paper [1] to tackle language modeling and it is the initial idea of word embedding. At that time, they named this process as “learning a distributed representation for words”.

2001: Bengio et al. introduced a concept of word embedding 2008: Ronan and Jason introduced a concept of pre-trained model 2013: Mikolov et al. released pre-trained model which is Word2Vec

Approaches -

Bag of Words, N-grams, and their TF-IDF.
Shallow Neural Net
Attempt to use ConvNets(Zhang and LeCun, 2015)
CNNs for Sentence Classification, Yoon Kim
Very Deep CNN Architecture, Facebook AI Research
Fine tuning of BERT for text classification.

Dataset

id
keyword
location
text
target => 1 [real disaster] => 0 [fake disaster]

Toolkit Tensorflow, sklearn

Dataset Analysis

7613 examples for training 3263 examples for testing

Dataset Cleaning

Replace all capital letters with small letters
Removed all punctuation marks
Remove URLs, Emojis and html text

Text encoding

Character Level Encoding
Word level Encoding

Bag of Words
Glove

Sentence Level Encoding

Google Universal Sentence Encoder

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset		dataset
images		images
papers		papers
README.md		README.md
Real or Fake_ NLP with Disaster Tweets.ipynb		Real or Fake_ NLP with Disaster Tweets.ipynb
bert.py		bert.py
links.txt		links.txt
overview.txt		overview.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepTweets

Real or Not? NLP with Disaster Tweets

Dataset Analysis

Dataset Cleaning

Text encoding

About

Releases

Packages

Contributors 2

Languages

svp19/deeptweets

Folders and files

Latest commit

History

Repository files navigation

DeepTweets

Real or Not? NLP with Disaster Tweets

Dataset Analysis

Dataset Cleaning

Text encoding

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages