-
Notifications
You must be signed in to change notification settings - Fork 808
fix binary file check #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@NicolasIRAGNE i've fix encoding issue on windows, we now use utf-16 le instead of utf-16. I let you decide what we need to do from now. |
I'll take a look whenever I have a bit of time, I'm not sure what this does but it seems better than initial check But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars? Also, #375 did this fix the problem as well? |
The current problem:
Yes. This is the main purpose of this with another one : The context should now ignore a lot more (if it's not all) binaries, so the context should be a lot more usable for LLMs on various repositories. |
Closes #375
Topic
tiktoken crash sometimes with binary files, after diging i've found that we did some check to ignore binary files but not strong enough.