Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

Open
sajz opened this issue Nov 4, 2024 · 0 comments
Open

[AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT #15

sajz opened this issue Nov 4, 2024 · 0 comments
Assignees

Comments

@sajz
Copy link
Collaborator

sajz commented Nov 4, 2024

Objective

Develop a custom dataset to test and evaluate our BERT topic modeling model. The current dataset (default newsletter) makes it challenging to assess the model's quality. By creating a controlled dataset with predefined topics and structures, we can better evaluate the model's performance and identify areas for improvement.

Description

  • Data Sources:

    • Option 1: Export messages from the Element messaging client.
    • Option 2: Generate synthetic messages using AI tools following the format of Element messaging client.
  • Dataset Requirements:

    • Include messages across 5–10 predefined topics.
    • Introduce overlaps between topics to mimic real-world data complexity.
    • Ensure a variety of messages per topic, utilizing different keywords and lexical fields.
    • The dataset should be sizable enough (e.g., 500–1000 messages) for meaningful evaluation.

Steps to Follow

  1. Define Predefined Topics:

    • Identify specific topics relevant to our domain (e.g., technology, health, finance, wellness, entertainment, travel).
    • For each topic, list associated keywords, phrases, and lexical fields.
  2. Data Generation:

    • Option 1: Export from Element Messaging Client

      • Collect messages that correspond to the predefined topics.
      • Use the test room create on Bored Labs Server in Element
      • Format messages consistently (e.g., JSON or CSV).
    • Option 2: Generate Synthetic Messages Using AI

      • Use AI tools to create messages for each topic.
      • Craft prompts that guide the AI to produce messages with desired content and style.
      • Ensure messages are diverse in vocabulary and structure.
      • Ensure format matches what one would get from Element.
  3. Incorporate Topic Overlaps:

    • Design messages that intentionally include keywords from multiple topics.
    • Create scenarios where topics naturally intersect.
  4. Ensure Message Variety:

    • Vary message lengths (short, medium, long).
    • Include different writing styles and tones.
    • Use synonyms and related terms to enrich lexical diversity.
    • Test what happened if abbreviations or new words (example a new project) are introduced
  5. Organize and Format the Dataset:

    • Label each message with its corresponding topic(s) for validation purposes.
    • Store messages in a format compatible with our BERT model (e.g., plain text files, CSV).
  6. Quality Assurance:

    • Review the dataset to verify topic representation and message quality (human review).
    • Check for balance in the number of messages per topic.
    • Ensure that overlaps are correctly implemented.

Validation Criteria for this task

  • Topic Coverage: Each predefined topic has an adequate number of messages (e.g., at least 50 messages per topic).
  • Overlaps Implemented: A subset of messages (e.g., 10–20%) should contain overlaps between topics.
  • Variety and Diversity: Messages exhibit a range of lengths, styles, and vocabulary.
  • Correct Labeling: All messages are accurately labeled with their topic(s).
  • Data Quality: Messages are coherent, relevant, and free of errors.

Expected Deliverables

  • A structured dataset containing all messages, ready for model input.
  • Documentation outlining the dataset creation process, including:
    • Topics selected and associated keywords.
    • Methodology for data collection/generation.
    • Any scripts, prompts, or tools used in the process - so we don't have to start from scratch if we need other datasets to compare.
@sajz sajz added this to Concord Nov 4, 2024
@sajz sajz converted this from a draft issue Nov 4, 2024
@sajz sajz changed the title AI - Create Test Data Set [AI] Create Test Data Set Nov 4, 2024
@sajz sajz assigned sajz and unassigned sajz Nov 4, 2024
@sajz sajz changed the title [AI] Create Test Data Set [AI] Create a Custom Dataset to Evaluate and Fine-Tune BERT Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants