Skip to content

Conversation

@charlie1587
Copy link
Collaborator

Summary

Refactored the tokenizer integration by moving it from the epoch file to the dataset module for better modularity and maintainability. Also fixed an issue where tokenization was significantly slowing down the data loading process.

Changes

  • Moved tokenizer logic into dataset.py
  • Removed old tokenizer reference from the training/epoch file
  • Optimized tokenization speed during dataset loading

zmgong and others added 30 commits April 4, 2025 02:36
Add bioscan_bert_checkpoint_trained_with_canada_1_5_m and bioscan_bert_checkpoint_trained_with_bioscan_5_m
correct max_length of tokenizer
…o .hidden_states[-1].mean(dim=1). This commit is for debugging.
zmgong and others added 19 commits April 26, 2025 22:11
…d token type ids (which is not being used) for the old BarcodeBERT model.

Based on the old KmerTokenizer, new one will have attention masks to indicate the pad tokens (just 'N').
- Fixed `Dataset` class in dataset.py to support the new BarcodeBERT tokenizer - Modified `__getitem__` method to ensure proper tensor handling when processing DNA input - Removed redundant device movement of DNA inputs in inference_epoch.py  - Added wrapper for NewKmerTokenizer to improve compatibility with existing code - Ensured consistent tensor shapes to prevent "storage not resizable" errors during batching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants