Skip to content

Conversation

@Subhanshusethi
Copy link

  1. Added a valid indices check which, while loading the tokens file, ensures that the number of captions matches the number of embeddings. If not, the mismatched entries are filtered out.
  2. Fine-tuning GPT-2 is a major task that highly relies on the dataset. Removed single letters, special characters, and stop words (as defined by NLTK by default) to reduce the impact of connector words while training in the embedding space.

1. Added a valid indices check which, while loading the tokens file, ensures that the number of captions matches the number of embeddings. If not, the mismatched entries are filtered out.
2. Fine-tuning GPT-2 is a major task that highly relies on the dataset. Removed single letters, special characters, and stop words (as defined by NLTK by default) to reduce the impact of connector words while training in the embedding space.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant