Skip to content

fix binary file check #377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ix-56h
Copy link
Contributor

@ix-56h ix-56h commented Jul 3, 2025

Closes #375

Topic

tiktoken crash sometimes with binary files, after diging i've found that we did some check to ignore binary files but not strong enough.

@ix-56h ix-56h self-assigned this Jul 3, 2025
@ix-56h
Copy link
Contributor Author

ix-56h commented Jul 3, 2025

@NicolasIRAGNE i've fix encoding issue on windows, we now use utf-16 le instead of utf-16.

I let you decide what we need to do from now.

@NicolasIRAGNE
Copy link
Contributor

I'll take a look whenever I have a bit of time, I'm not sure what this does but it seems better than initial check

But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars?

Also, #375 did this fix the problem as well?

@ix-56h
Copy link
Contributor Author

ix-56h commented Jul 4, 2025

But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars?

The current problem:

  1. we read a chunk of a file
  2. we wrongly determine binaries using a too much simple method
  3. this leads to include bytes within the final output context
  4. the tiktoken package, which uses rust, panic when unwrapping parsing result

Also, #375 did this fix the problem as well?

Yes. This is the main purpose of this with another one : The context should now ignore a lot more (if it's not all) binaries, so the context should be a lot more usable for LLMs on various repositories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: stackoverflow error with some repos
4 participants