From d746dac6590a34ffdbfa31fb3767151157b3f319 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Sat, 23 Nov 2024 13:38:49 -0800 Subject: [PATCH] wip --- Documentation/curated-examples-from-issues.md | 424 ++++++++++++++++++ 1 file changed, 424 insertions(+) create mode 100644 Documentation/curated-examples-from-issues.md diff --git a/Documentation/curated-examples-from-issues.md b/Documentation/curated-examples-from-issues.md new file mode 100644 index 00000000..fcfab993 --- /dev/null +++ b/Documentation/curated-examples-from-issues.md @@ -0,0 +1,424 @@ +# Curated examples from issues + +Lots of people have filed issues against git-filter-repo, and many times it +boils down into questions of "How do I?" or "Why doesn't this work?" + +I thought I'd collect a bunch of these as example repository filterings +that others may be interested in. + +## Table of Contents + + * [Adding files to root commits](#adding-files-to-root-commits) + * [Purge a large list of files](#purge-a-large-list-of-files) + +## Adding files to root commits + + + +Here's an example that will take `/path/to/existing/README.md` and +store it as `README.md` in the repository, and take +`/home/myusers/mymodule.gitignore` and store it as `src/.gitignore` in +the repository: + +``` +git filter-repo --commit-callback "if not commit.parents: commit.file_changes += [ + FileChange(b'M', b'README.md', b'$(git hash-object -w '/path/to/existing/README.md')', b'100644'), + FileChange(b'M', b'src/.gitignore', b'$(git hash-object -w '/home/myusers/mymodule.gitignore')', b'100644')]" +``` + +Alternatively, you could also use the [insert-beginning contrib script](../contrib/filter-repo-demos/insert-beginning). + +## Purge a large list of files + + + +Stick all the files in some file (one per line), +e.g. ../DELETED_FILENAMES.txt, and then run + +``` +git filter-repo --invert-paths --paths-from-file ../DELETED_FILENAMES.txt +``` + +## Extracting a libary to a separate repo + + + +``` +git filter-repo \ + --path src/some-folder/some-feature \ + --path-rename src/some-folder/some-feature/:src/ +``` + +## Replace words in all commit messages + + + +``` +git-filter-repo --message-callback 'return message.replace(b"stuff", b"task")' +``` + +## Only keep files from two branches + + + +Let's say you know that the files currently present on two branches +are the only files that matter. Files that used to exist in either of +these branches, or files that only exist on some other branch, should +all be deleted from all versions of history. This can be accomplished +by getting a list of files from each branch, combining them, sorting +the list and picking out just the unique entries, then passing to +`--paths-from-file`: + +``` +git ls-tree -r ${BRANCH1} >../my-files +git ls-tree -r ${BRANCH2} >>../my-files +sort ../my-files | uniq >../my-relevant-files +git filter-repo --paths-from-file ../my-relevant-files +``` + +## Renormalize end-of-line characters and add a .gitattributes + + + +``` +contrib/filter-repo-demos/lint-history dos2unix +[edit .gitattributes] +contrib/filter-repo-demos/insert-beginning .gitattributes +``` + +## Remove spaces at the end of lines + + + +Removing all spaces at the end of lines of non-binary files, including +stripping trailing carriage returns: + +``` +git filter-repo --replace-text <(echo 'regex:[\r\t ]+(\n|$)==>\n') +``` + +## Having both exclude and include rules for filenames + + + +If you want to have rules to both include and exclude filenames, you +can simply invoke `git filter-repo` multiple times. Alternatively, +you can dispense with `--path` arguments and instead use the more +generic `--filename-callback`. For example to include all files under +`src/` except for `src/README.md`: + +``` +git filter-repo --filename-callback ' + if filename == b"src/README.md": + return None + if filename.startswith(b"src/"): + return filename + return None' +``` + +## Removing paths with a certain extension + + + +``` +git filter-repo --invert-paths --path-glob '*.xsa' +``` + +or + +``` +git filter-repo --filename-callback ' + if filename.endswith(b".xsa"): + return None + return filename' +``` + +## Removing a directory + + + +``` +git filter-repo --path node_modules/electron/dist/ --invert-paths +``` + +## Convert from NFD filenames to NFC + + + +Given that Mac does utf-8 normalization of filenames, and has +historically switched which kind of normalization it does, users may +have committed files with alternative normalizations to their +repository. If someone wants to convert filenames in NFD form to NFC, +they could run + +``` +git filter-repo --filename-callback ' + try: + return subprocess.check_output("iconv -f utf-8-mac -t utf-8".split(), + input=filename) + except: + return filename +' +``` + +or + +``` +git filter-repo --filename-callback ' + import unicodedata + try: + return bytearray(unicodedata.normalize('NFC', filename.decode('utf-8')), 'utf-8') + except: + return filename +' +``` + +## Set the committer of the last few commits to myself + + + +``` +git filter-repo --refs main~5..main --commit-callback ' + commit.commiter_name = b"My Wonderful Self" + commit.committer_email = b"my@self.org" +' +``` + +## Handling special characters, e.g. accents in names + + + +Since characters like ë and á are multi-byte characters and python +won't allow you to directly place those in a bytestring +(e.g. b"Raphaël González" would result in a `SyntaxError: bytes can +only contain ASCII literal characters` error from Python), you just +need to make a normal string and then convert to a bytestring to +handle these. For example, changing the author name and email where +the author email is currently `example@test.com`: + +``` +git filter-repo --refs main~5..main --commit-callback ' + if commit.author_email = b"example@test.com": + commit.author_name = "Raphaël González".encode() + commit.author_email = b"rgonzalez@test.com" +' +``` + +## Handling repository corruption + + + +First, run fsck to get a list of the corrupt objects, e.g.: +``` +$ git fsck +error in commit 166f57b3fbe31257100361ecaf735f305b533b21: missingSpaceBeforeDate: invalid author/committer line - missing space before date +Checking object directories: 100% (256/256), done. +``` + +Then print out that object literally to a temporary file: +``` +$ git cat-file -p 166f57b3fbe31257100361ecaf735f305b533b21 >tmp +``` + +Taking a look at the file would show, for example: +``` +$ cat tmp +tree e1d871155fce791680ec899fe7869067f2b4ffd2 +author My Name 1673287380 -0800 +committer My Name 1673287380 -0800 + +Initial +``` + +Edit that file to fix the error (in this case, the missing space +between author email and author date): + +``` +tree e1d871155fce791680ec899fe7869067f2b4ffd2 +author My Name 1673287380 -0800 +committer My Name 1673287380 -0800 + +Initial +``` + +Save the updated file, then use `git-replace` to make a replace reference +for it. +``` +$ git replace -f 166f57b3fbe31257100361ecaf735f305b533b21 $(git hash-object -t commit -w tmp) +``` + +Then remove the temporary file `tmp` and run `filter-repo` to consume +the replace reference and make it permanent: + +``` +$ rm tmp +$ git filter-repo --proceed +``` + +Note that if you have multiple corrupt objects, you only need to run +filter-repo once; just wait to do that step until you have all the +replacements in place. + +## Removing all files with a backslash in them + + + +``` +git filter-repo --filename-callback 'return None if b'\\' in filename else filename' +``` + +## Replace a binary blob in history + + + +Let's say you committed a binary blob, perhaps an image file, with +sensitive data, and never modified it. You want to replace it with +the contents of some alternate file, currently found at +`../alternative-file.jpg` (it can have a different filename than what +is stored in the repository). Let's also say the hash of the old file +was `f4ede2e944868b9a08401dafeb2b944c7166fd0a`. You can replace it +with either + +``` +git filter-repo --blob-callback ' + if blob.original_id == b"f4ede2e944868b9a08401dafeb2b944c7166fd0a": + blob.data = open("../alternative-file.jpg", "rb").read() +' +``` + +or + +``` +git replace -f f4ede2e944868b9a08401dafeb2b944c7166fd0a $(git hash-object -w ../alternative-file.jpg) +git filter-repo --proceed +``` + +## Remove commits older than N days + + + +This is such a bad usecase. I'm tempted to leave it out, but it has +come up multiple times, and there are people who are totally fine with +changing every commit hash in their repository and throwing away +history periodically. First, identify an ${OLD_COMMIT} that you want +to be a new root commit, then run: + +``` +git replace --graft ${OLD_COMMIT} +git filter-repo --proceed +``` + +## Replacing pngs with compressed alternative + + + +Let's say you committed thousands of pngs that were poorly compressed, +but later aggressively recompressed the pngs and commited and pushed. +Unfortunately, clones are slow because they still contain the poorly +compressed pngs and you'd like to rewrite history to pretend that the +aggressively compressed versions were used when the files were first +introduced. + +First, take a look at the commit that aggressively recompressed the pngs: + +``` +git log -1 --raw --no-abbrev ${COMMIT_WHERE_YOU_COMPRESSED_PNGS} +``` + +that will show output like +``` +:100755 100755 edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12 M resources/foo.png +:100755 100755 644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820 M resources/bar.png +``` + +Use that to make a --file-info-callback to fix up the original versions: +``` +git filter-repo --file-info-callback ' + if filename == b"resources/foo.png" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09": + blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12" + if filename == b"resources/bar.png" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11": + blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820" + return (filename, mode, blob_id) +' +``` + +## Updating submodule hashes + + + +Let's say you have a repo with a submodule at src/my-submodule, and +that you feel the wrong commit-hashes of the submodule were commited +within your project and you want them updated according to the +following table: +``` +old new +edf570fde099c0705432a389b96cb86489beda09 9cce52ae0806d695956dcf662cd74b497eaa7b12 +644f7c55e1a88a29779dc86b9ff92f512bf9bc11 88b02e9e45c0a62db2f1751b6c065b0c2e538820 +``` + +You could do this as follows: +``` +git filter-repo --file-info-callback ' + if filename == b"src/my-submodule" and blob_id == b"edf570fde099c0705432a389b96cb86489beda09": + blob_id = b"9cce52ae0806d695956dcf662cd74b497eaa7b12" + if filename == b"src/my-submodule" and blob_id == b"644f7c55e1a88a29779dc86b9ff92f512bf9bc11": + blob_id = b"88b02e9e45c0a62db2f1751b6c065b0c2e538820" + return (filename, mode, blob_id) +``` + +Yes, `blob_id` is kind of a misnomer here since the file's hash +actually refers to a commit from the sub-project. But `blob_id` is +the name of the parameter passed to the --file-info-callback, so that +is what must be used. + +## Using multi-line strings in callbacks + + + +Since the text for callbacks have spaces inserted at the front of every +line, multi-line strings are normally munged. For example, the command + +``` +git filter-repo --blob-callback ' + blob.data = bytes("""\ + This is the new + file that I am + replacing every blob + with. It is great. + """), "utf-8") +' +``` + +would likely result in a file with extra spaces at the front of every line: +``` + This is the new + file that I am + replacing every blob + with. It is great. +``` + +(Note that each line starts with 6 spaces, even though there were only +4 spaces in your callback.) + + +However, you can use textwrap.dedent to avoid this. For example: + +``` +git filter-repo --blob-callback ' + import textwrap + blob.data = bytes(textwrap.dedent("""\ + This is the new + file that I am + replacing every blob + with. It is great. + """), "utf-8") +' +``` + +That will result in a file with contents +``` +This is the new +file that I am +replacing every blob +with. It is great. +```