-
Notifications
You must be signed in to change notification settings - Fork 965
Description
Describe the bug
I am unable to download the necessary test data for added_tokens.rs
and other integration tests. Running cargo test --test added_tokens
results in "Files not found" errors, specifically:
Files not found, run make test to download these files: Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }
Steps to Reproduce
- Clone the
tokenizers
repository (or pull tomain
if already cloned). - Attempt to run the integration tests:
cargo test --test added_tokens
(fails). - Attempt to use the previously documented method to download test data:
make test
(fails with 'command not found' ifmake
is not installed, or even if it is, the underlying script seems to be missing/inaccessible). - Attempt to manually run the download script:
bash scripts/download-test-data.sh
(fails with "No such file or directory" because thescripts/
folder is no longer present onmain
). - Attempt to download the
test-data.zip
directly via browser from the previously provided Hugging Face dataset URL:https://huggingface.co/datasets/huggingface/tokenizers-test-data/resolve/main/test-data.zip
(fails with "Repository not found" or 404). - Attempt to access the
scripts/
folder on GitHub'smain
branch:https://github.com/huggingface/tokenizers/tree/main/scripts
(results in a 404 error, indicating the folder is gone).
Expected behavior
added_tokens.rs
and other integration tests should pass after successfully downloading the required test data. The test data should be accessible via make test
or a clear, public download link.
Screenshots/Error Messages
(You can copy-paste the specific error messages you've shown me, like the Files not found...
and the Repository not found
messages for the URL.)
Environment:
- OS: Windows 10/11 (or your specific version)
- Shell: Git Bash / PowerShell
- Rust version: (e.g.,
rustc --version
)
Additional context
This issue significantly impacts local development and testing, making it difficult for new contributors to verify changes against the full test suite. My primary change (an optimized whitespace pre-tokenizer) is ready, and 192 core unit tests pass, but these integration tests are blocked by this data accessibility issue.