Refactor barcodeBERT tokenizer: move to dataset and fix performance issue #34

charlie1587 · 2025-04-21T07:09:04Z

Summary

Refactored the tokenizer integration by moving it from the epoch file to the dataset module for better modularity and maintainability. Also fixed an issue where tokenization was significantly slowing down the data loading process.

Changes

Moved tokenizer logic into dataset.py
Removed old tokenizer reference from the training/epoch file
Optimized tokenization speed during dataset loading

…erge_with_BarcodeBERT_5M

Add bioscan_bert_checkpoint_trained_with_canada_1_5_m and bioscan_bert_checkpoint_trained_with_bioscan_5_m

correct max_length of tokenizer

…o .hidden_states[-1].mean(dim=1). This commit is for debugging.

…ch size.

…erge_with_BarcodeBERT_5M

…o accept bert-small as valid name.

… anymore.

…factor to also remove that in the code.

…id for ICLR final.

…d token type ids (which is not being used) for the old BarcodeBERT model. Based on the old KmerTokenizer, new one will have attention masks to indicate the pad tokens (just 'N').

…and 1 inference and evaluation epoch.

…e take as a pad token.

…re_merge_with_BarcodeBERT_5M

- Fixed `Dataset` class in dataset.py to support the new BarcodeBERT tokenizer - Modified `__getitem__` method to ensure proper tensor handling when processing DNA input - Removed redundant device movement of DNA inputs in inference_epoch.py - Added wrapper for NewKmerTokenizer to improve compatibility with existing code - Ensured consistent tensor shapes to prevent "storage not resizable" errors during batching

zmgong and others added 30 commits April 4, 2025 02:36

Add new branch.

4d925c6

Merge commit '7fb8e020f139aceee889f206a42e006b618a38bc' into before_m…

47577ad

…erge_with_BarcodeBERT_5M

fix(config): Add missing config

4e967fb

Add bioscan_bert_checkpoint_trained_with_canada_1_5_m and bioscan_bert_checkpoint_trained_with_bioscan_5_m

fix(training): Fix new barcodeBERT tokenizer

51b4e4e

correct max_length of tokenizer

Initial attempt for just change .logits.softmax(dim=-1).mean(dim=1) t…

907efb5

…o .hidden_states[-1].mean(dim=1). This commit is for debugging.

Add a temporary config for testing the softmax fix.

9963b3d

Add a temporary config for testing the softmax fix with a smaller bat…

31b3f0f

…ch size.

Merge commit '51b4e4eb62bffaf5c76fb524664b7b5e878ebc71' into before_m…

00a492e

…erge_with_BarcodeBERT_5M

refactor(dataset): Move tokenizer to dataset.py

3916b2b

feat(dataset): Support batch for new barcodebert tokenizer

6974b88

Merge commit '6fd7556946d5e07c7bb7801aab459c2e37d8a891' into before_m…

6d64867

…erge_with_BarcodeBERT_5M

style(dataset): Remove comment

710f57f

fix(dataset): Fixed the problem of tokenizer loading data too slowly

950a4a1

style(dataset): Delete unused tokenizers and format the code

7c0d919

style(training): Format the code

736e499

fix(inference): Add missing library

fac118b

Update the font size of some figure plotting scripts.

462a599

update the model names to more proper names in multiple config files.

779da4f

Remove the finished tod comment.

5802ac8

fix(dataset): Correct typo

d831530

Merge commit '5802ac80745c2417661ebc29641ea7498c6a8460' into before_m…

b8ef8ce

…erge_with_BarcodeBERT_5M

refactor(config): Update config for barcodeBERT comparison

462ab50

accept 'bert-small' as valid language model name

bba13d3

Same for previous commit, now for the language tokenizer init, we als…

8609a6b

…o accept bert-small as valid name.

Fix a tiny bug. Now the dataloader is not looking for pre-train_model…

a045f8f

… anymore.

accept bert-small and bert_small

08d9215

stop using pre-train-model to define model name. Will do a further re…

8c5e83d

…factor to also remove that in the code.

do not tokenize when not language model.

5e7f148

None leadd to error during collecting, replace none with empty list

0ca8898

None leadd to error during collecting, replace none with empty tensor

8031f16

zmgong and others added 19 commits April 26, 2025 22:11

Change the model name that support for language encoder.

cdc34df

Change the epoch number for the final experiments to 50, as what we d…

e19dabf

…id for ICLR final.

Update the code to use the newer dna tokenizer with attention mask an…

7cfb016

…d token type ids (which is not being used) for the old BarcodeBERT model. Based on the old KmerTokenizer, new one will have attention masks to indicate the pad tokens (just 'N').

Remove the unused model file for the language.

163f845

dna_input_batch can also be a dictionary now.

08940c2

Change the input ids, token type ids and attention mask to tensor.

86beb64

modify the debug flag function. Now it only test for 1 training step …

516738d

…and 1 inference and evaluation epoch.

Fix a small bug that accidentally disabled the language tokenizer.

b0ab6ed

update the un-align baseline config.

8b38955

Quick check for stride value

2ebe1d9

Remove the test code.

660e877

Fix a very very weird bug for the tokenizer...

989dc64

Change the way to generate tokens to use in get item.

4ba9a9f

Change back the debug method. Stop just testing one step.

a2e29c4

Update the config for image text alignment for BIOSCAN-1M

6f1a704

Set the function of debug flag back to normal

6b592cb

Change the kmer tokenization a bit. Now any token with N in it will b…

d5fcb57

…e take as a pad token.

Merge branch '33-update-the-softmax-issue-with-dna-encoder' into befo…

080a775

…re_merge_with_BarcodeBERT_5M

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor barcodeBERT tokenizer: move to dataset and fix performance issue #34

Refactor barcodeBERT tokenizer: move to dataset and fix performance issue #34

Uh oh!

charlie1587 commented Apr 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor barcodeBERT tokenizer: move to dataset and fix performance issue #34

Are you sure you want to change the base?

Refactor barcodeBERT tokenizer: move to dataset and fix performance issue #34

Uh oh!

Conversation

charlie1587 commented Apr 21, 2025

Summary

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants