Skip to content

Add JUnit 5 support and initial unit test for Tokenizer #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ayush0407
Copy link

This PR adds initial testing support to the project by integrating JUnit 5 and creating the first unit test for the Tokenizer class.

Changes:

  • Added JUnit 5 dependency in pom.xml
  • Created TokenizerTest.java in src/test/java/llama
  • Added a simple test to validate the tokenization of a basic prompt

This sets the foundation for adding more tests across the codebase.

@CLAassistant
Copy link

CLAassistant commented Jun 15, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates JUnit 5 into the project and adds the first unit tests for the Tokenizer class to establish a testing foundation.

  • Added JUnit 5 dependency in pom.xml
  • Created TokenizerTest.java with basic tokenization tests

Reviewed Changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

File Description
pom.xml Added JUnit 5 dependency for unit testing support
src/test/java/com/example/tokenizer/impl/TokenizerTest.java Introduced initial test class covering core tokenizer methods
Comments suppressed due to low confidence (1)

src/test/java/com/example/tokenizer/impl/TokenizerTest.java:39

  • [nitpick] Add tests for edge cases (e.g., empty or null inputs) to encodeOrdinary to ensure the tokenizer handles them without errors.
void testEncodeOrdinary() {

Copy link
Member

@mikepapadim mikepapadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @ayush0407

I have a few comments:

  • the changes in pom file are missing
  • we need to have some instructions how to run the tests
  • we need to wait for #17 to be merged first as it refactors a bit the functionality of Tokenizer class to support multiple tokenizers

@ayush0407
Copy link
Author

Thank you for your contribution @ayush0407

I have a few comments:

  • the changes in pom file are missing
  • we need to have some instructions how to run the tests
  • we need to wait for [model] Add support for Mistral models #17 to be merged first as it refactors a bit the functionality of Tokenizer class to support multiple tokenizers
    Sure I forgot to Push the pom file with the JUnit dependency.

@ayush0407
Copy link
Author

Ill fix the Rest by Eod today thanks for the Suggestion

@ayush0407
Copy link
Author

Hi @mikepapadim, Review changes completed.

@mikepapadim
Copy link
Member

hello @ayush0407, it still missing instructions on how to run the test and the expected output, also a test for mistral tokenizer. Can you possibly add these? thanks

@ayush0407
Copy link
Author

hello @ayush0407, it still missing instructions on how to run the test and the expected output, also a test for mistral tokenizer. Can you possibly add these? thanks

sure.

@ayush0407
Copy link
Author

Hi @mikepapadim, please review the updated PR.

@mikepapadim
Copy link
Member

Thank you @ayush0407

Copy link
Collaborator

@orionpapadakis orionpapadakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @ayush0407! 👍

To make the JUnit support more complete and meaningful, I suggest the following improvements:

  1. Add unit tests for LlamaTokenizer methods.
  2. Include more targeted checks, prioritizing the encode and decode methods for both LlamaTokenizer and MistralTokenizer.

This will help ensure correctness and consistency across tokenizer implementations.

Comment on lines +47 to +48
assertNotNull(tokens);
assertFalse(tokens.isEmpty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a check based on the expected result rather than checking if it's not null and empty.

Comment on lines +51 to +54
@Test
void testRegexPatternReturnsNull() {
assertNull(tokenizer.regexPattern());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexPattern() should not return null

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TokenizerTest covers the static replaceControlCharacters() methods which are implemented in the Tokenizer interface, not LlamaTokenizer. In order this junit support to make sense I would suggest to add coverage for LlamaTokenizer as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants