Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT training for Paragraph embeddings #49

Open
reaganrewop opened this issue May 22, 2019 · 3 comments
Open

GPT training for Paragraph embeddings #49

reaganrewop opened this issue May 22, 2019 · 3 comments
Labels
experiment poc Prototyping new approach priority

Comments

@reaganrewop
Copy link
Contributor

Test GPT2 feature representation linearity and it's scalability, for paragraph vectors.

@reaganrewop reaganrewop self-assigned this May 22, 2019
@reaganrewop reaganrewop added experiment poc Prototyping new approach labels May 22, 2019
@vdpappu vdpappu changed the title GPT2 training on ether data. GPT training on ether data. Jun 17, 2019
@vdpappu vdpappu assigned master10 and unassigned reaganrewop Jun 17, 2019
@ArjunKini
Copy link
Contributor

Will be extending GPT to get paragraph embeddings by using a LSTM-based "head" trained offline on sentence features extracted from GPT.

@vdpappu vdpappu changed the title GPT training on ether data. GPT training for Paragraph embeddings Jul 29, 2019
@ArjunKini
Copy link
Contributor

I tested GPT and BERT for possible paragraph embedding applications. It was found that BERT gave a narrow range of scores, in the range of 0.7-0.99 across out-of-domain and in-domain topics as opposed to 0.2-0.9 for GPT, possibly due to aggregation of tokens to get the pooled feature representation of a sentence.
Moreover, BERT-paragraph embeddings formed by aggregating sentence level features is sensitive to noise(appending an out-of-domain sentence to the end of an in-domain paragraph reduces the score drastically). On the other hand, GPT is more resilient to the added noise.
Adding an LSTM-head to BERT did not alleviate these problems.

Conclusion: GPT-based paragraph embeddings are more stable than BERT-based ones.

@ArjunKini
Copy link
Contributor

Conclusion2: GPT paragraph embeddings show good topic separation and can be used for separating segments based on context. In order to not rely on aggregation of sentence features, a Bi-LSTM head was used to aggregate the features instead of summing up the sentence-level feature vectors. This resulted in better context-capture across a paragraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment poc Prototyping new approach priority
Projects
None yet
Development

No branches or pull requests

4 participants