-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag strings without a known encoding as ASCII-8BIT. #619
base: master
Are you sure you want to change the base?
Conversation
Almost all of the string used inside a git repository are encoding unaware. This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as US-ASCII 8-BIT (binary).
return rugged_signature_new( | ||
git_commit_author(commit), | ||
git_commit_message_encoding(commit)); | ||
return rugged_signature_new(git_commit_author(commit)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is its own can of worms, but for some use-cases we have assumed that the encoding of the signature is the same as of the commit message. This sometimes even holds true, and this is a place where we kinda have a good idea of what the encoding is likely to be, rather than no idea like in the other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably an OK assumption to make. I haven't seen any exceptions regarding this in production (only the path and tag stuff)
Just to be clear, US-ASCII and ASCII-8BIT are totally different. US-ASCII is a known encoding, where ASCII-8BIT means "we don't know what the encoding of this string is". All strings should have an encoding, ASCII-8BIT just means that we don't know what the encoding is (or that it's truly binary data like an image or something). |
I was thinking about this last night. I thought it might be nice if we could configure a repository with encodings. Something like this: repo = Rugged::Repository.new(ARGV[0], path_encoding: ::Encoding::UTF_8) That way if you're dealing with a repository where you know the encoding of the paths, you can just configure the repo with the encoding to use. We could provide options for the bits of data where we can't know the encoding like paths and tags (those are the only two I can think of, blobs should always be binary). If we add this configuration value, then we can default it to UTF-8 in order to maintain backwards compatibility. |
Whoops, you're right, I mixed that up! 😄 |
I'm not sure this is a good solution. If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞 |
On Windows, at least, NTFS stores filenames as UTF-16 and there is, AFAIK, no way to use any other encoding. This is fortunate, since the Windows Git implementations do an insane amount of UTF16 <-> UTF8 conversion to be able to talk to the Windows APIs, since they all speak UTF-16 (or UCS-2 in some cases. Yay!) Similarly, HFS+ will use a canonically decomposed UTF-16. I'm just providing data here, I still think that returning these as |
@tenderlove Is this going to break anything if we would push this into production as-is? |
…r/binary-all-the-things
I think it will work but we should test. I think we're retagging them as binary in all places so this should be OK Aaron Patterson
|
Almost all of the string used inside a git repository are encoding unaware.
This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as ASCII 8-BIT (binary).