Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag strings without a known encoding as ASCII-8BIT. #619

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

arthurschreiber
Copy link
Member

@arthurschreiber arthurschreiber commented Jul 8, 2016

Almost all of the string used inside a git repository are encoding unaware.

This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as ASCII 8-BIT (binary).

Almost all of the string used inside a git repository are encoding unaware.

This includes things like refnames, commit metadata, path names, and a lot more. The exception is commit messages, which can optionally be tagged with an encoding through a header in the commit metadata. We should only tag strings with an encoding if we know the exact encoding, and otherwise tag them as US-ASCII 8-BIT (binary).
return rugged_signature_new(
git_commit_author(commit),
git_commit_message_encoding(commit));
return rugged_signature_new(git_commit_author(commit));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is its own can of worms, but for some use-cases we have assumed that the encoding of the signature is the same as of the commit message. This sometimes even holds true, and this is a place where we kinda have a good idea of what the encoding is likely to be, rather than no idea like in the other cases.

Copy link
Contributor

@tenderlove tenderlove Jul 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably an OK assumption to make. I haven't seen any exceptions regarding this in production (only the path and tag stuff)

@tenderlove
Copy link
Contributor

tenderlove commented Jul 8, 2016

US-ASCII 8-BIT

Just to be clear, US-ASCII and ASCII-8BIT are totally different. US-ASCII is a known encoding, where ASCII-8BIT means "we don't know what the encoding of this string is". All strings should have an encoding, ASCII-8BIT just means that we don't know what the encoding is (or that it's truly binary data like an image or something).

@tenderlove
Copy link
Contributor

tenderlove commented Jul 8, 2016

I was thinking about this last night. I thought it might be nice if we could configure a repository with encodings. Something like this:

repo = Rugged::Repository.new(ARGV[0], path_encoding: ::Encoding::UTF_8)

That way if you're dealing with a repository where you know the encoding of the paths, you can just configure the repo with the encoding to use. We could provide options for the bits of data where we can't know the encoding like paths and tags (those are the only two I can think of, blobs should always be binary).

If we add this configuration value, then we can default it to UTF-8 in order to maintain backwards compatibility.

@arthurschreiber
Copy link
Member Author

Just to be clear, US-ASCII and ASCII-8BIT are totally different.

Whoops, you're right, I mixed that up! 😄

@arthurschreiber
Copy link
Member Author

I thought it might be nice if we could configure a repository with encodings.

I'm not sure this is a good solution. If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

@ethomson
Copy link
Member

If you have different people that work on a repo, they might have set different encodings on their machines. I don't think that's too uncommon, especially for people working on windows. 😞

On Windows, at least, NTFS stores filenames as UTF-16 and there is, AFAIK, no way to use any other encoding. This is fortunate, since the Windows Git implementations do an insane amount of UTF16 <-> UTF8 conversion to be able to talk to the Windows APIs, since they all speak UTF-16 (or UCS-2 in some cases. Yay!)

Similarly, HFS+ will use a canonically decomposed UTF-16.

I'm just providing data here, I still think that returning these as ASCII-8BIT probably still makes a lot of sense.

@arthurschreiber
Copy link
Member Author

@tenderlove Is this going to break anything if we would push this into production as-is?

@tenderlove
Copy link
Contributor

I think it will work but we should test. I think we're retagging them as binary in all places so this should be OK

Aaron Patterson
http://tenderlovemaking.com/
I'm on an iPhone so I apologize for top posting.

On Aug 18, 2016, at 7:48 AM, Arthur Schreiber [email protected] wrote:

@tenderlove Is this going to break anything if we would push this into production as-is?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants