Skip to content

Cannot download test data: 'make test' and direct links fail with "Repository not found" / 404 #1820

@8ria

Description

@8ria

Describe the bug
I am unable to download the necessary test data for added_tokens.rs and other integration tests. Running cargo test --test added_tokens results in "Files not found" errors, specifically:
Files not found, run make test to download these files: Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }

Steps to Reproduce

  1. Clone the tokenizers repository (or pull to main if already cloned).
  2. Attempt to run the integration tests: cargo test --test added_tokens (fails).
  3. Attempt to use the previously documented method to download test data: make test (fails with 'command not found' if make is not installed, or even if it is, the underlying script seems to be missing/inaccessible).
  4. Attempt to manually run the download script: bash scripts/download-test-data.sh (fails with "No such file or directory" because the scripts/ folder is no longer present on main).
  5. Attempt to download the test-data.zip directly via browser from the previously provided Hugging Face dataset URL: https://huggingface.co/datasets/huggingface/tokenizers-test-data/resolve/main/test-data.zip (fails with "Repository not found" or 404).
  6. Attempt to access the scripts/ folder on GitHub's main branch: https://github.com/huggingface/tokenizers/tree/main/scripts (results in a 404 error, indicating the folder is gone).

Expected behavior
added_tokens.rs and other integration tests should pass after successfully downloading the required test data. The test data should be accessible via make test or a clear, public download link.

Screenshots/Error Messages
(You can copy-paste the specific error messages you've shown me, like the Files not found... and the Repository not found messages for the URL.)

Environment:

  • OS: Windows 10/11 (or your specific version)
  • Shell: Git Bash / PowerShell
  • Rust version: (e.g., rustc --version)

Additional context
This issue significantly impacts local development and testing, making it difficult for new contributors to verify changes against the full test suite. My primary change (an optimized whitespace pre-tokenizer) is ready, and 192 core unit tests pass, but these integration tests are blocked by this data accessibility issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions