From 615720731589138739c656d2f04c3b613ce0410c Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Wed, 23 Oct 2024 13:15:20 -0700 Subject: [PATCH] filter-repo: add a --file-info-callback This callback answers a common request users have to be able to operate on both file names and content, and throws in the mode while at it. It also makes our lint-history contrib example re-implementable as a few lines of shell script, but we'll leave it around anyway. Signed-off-by: Elijah Newren --- .../converting-from-filter-branch.md | 50 ++++-- Documentation/git-filter-repo.txt | 118 +++++++++++++- git-filter-repo | 154 ++++++++++++++++-- t/t9392-filter-repo-python-callback.sh | 144 ++++++++++++++++ ...epo-sanity-checks-and-bigger-repo-setup.sh | 9 + 5 files changed, 445 insertions(+), 30 deletions(-) diff --git a/Documentation/converting-from-filter-branch.md b/Documentation/converting-from-filter-branch.md index 959b3527..7543cb6c 100644 --- a/Documentation/converting-from-filter-branch.md +++ b/Documentation/converting-from-filter-branch.md @@ -320,20 +320,44 @@ filter-branch: ' ``` -filter-repo decided not to provide a way to run an external program to -do filtering, because most filter-branch uses of this ability are -riddled with [safety -problems](https://git-scm.com/docs/git-filter-branch#SAFETY) and -[performance -issues](https://git-scm.com/docs/git-filter-branch#PERFORMANCE). -However, in special cases like this it's fairly safe. One can write a -script that uses filter-repo as a library to achieve this, while also -gaining filter-repo's automatic handling of other concerns like -rewriting commit IDs in commit messages or pruning commits that become -empty. In fact, one of the [contrib +though it has the disadvantage of running on every c file for every +commit in history, even if some commits do not modify any c files. This +means this kind of command can be excruciatingly slow. + +The same functionality is slightly more involved in filter-repo for +two reasons: + - fast-export and fast-import split file contents and file names into + completely different data structures that aren't normally available + together + - to run a program on a file, you'll need to write the contents to the + a file, execute the program on that file, and then read the contents + of the file back in + +```shell + git filter-repo --file-info-callback ' + if not filename.endswith(b".c"): + return (filename, mode, blob_id) # no changes + + contents = value.get_contents_by_identifier(blob_id) + tmpfile = os.path.basename(filename) + with open(tmpfile, "wb") as f: + f.write(contents) + subprocess.check_call(["clang-format", "-style=file", "-i", filename]) + with open(filename, "rb") as f: + contents = f.read() + new_blob_id = value.insert_file_with_contents(contents) + + return (filename, mode, new_blob_id) + ' +``` + +However, one can write a script that uses filter-repo as a library to +simplify this, while also gaining filter-repo's automatic handling of +other concerns like rewriting commit IDs in commit messages or pruning +commits that become empty. In fact, one of the [contrib demos](../contrib/filter-repo-demos), -[lint-history](../contrib/filter-repo-demos/lint-history), handles -this exact type of situation already: +[lint-history](../contrib/filter-repo-demos/lint-history), was +specifically written to make this kind of case really easy: ```shell lint-history --relevant 'return filename.endswith(b".c")' \ diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt index e4c99e43..5717a9ca 100644 --- a/Documentation/git-filter-repo.txt +++ b/Documentation/git-filter-repo.txt @@ -288,6 +288,14 @@ Generic callback code snippets --refname-callback :: Python code body for processing refnames; see <>. +--file-info-callback :: + Python code body for processing the combination of filename, mode, + and associated file contents; see <. Note that when + --file-info-callback is specified, any replacements specified by + --replace-text will not be automatically applied; instead, you + have control within the --file-info-callback to choose which files + to apply those transformations to. + --blob-callback :: Python code body for processing blob objects; see <>. @@ -1164,8 +1172,9 @@ that you should be aware of before using them; see the "API BACKWARD COMPATIBILITY CAVEAT" comment near the top of git-filter-repo source code. -All callback functions are of the same general format. For a command line -argument like +Most callback functions are of the same general format +(--file-info-callback is an exception which will be noted later). For +a command line argument like -------------------------------------------------- --foo-callback 'BODY' @@ -1209,6 +1218,7 @@ callbacks are: --name-callback --email-callback --refname-callback +--file-info-callback -------------------------------------------------- in each you are expected to simply return a new value based on the one @@ -1272,10 +1282,106 @@ git-filter-repo --filename-callback ' ' -------------------------------------------------- -In contrast, the blob, reset, tag, and commit callbacks are not -expected to return a value, but are instead expected to modify the -object passed in. Major fields for these objects are (subject to API -backward compatibility caveats mentioned previously): +The file-info callback is more involved. It is designed to be used in +cases where filtering depends on both filename and contents (and maybe +mode). It is called for file changes other than deletions (since +deletions have no file contents to operate on). The file info +callback takes four parameters (filename, mode, blob_id, and value), +and expects three to be returned (filename, mode, blob_id). The +filename is handled similar to the filename callback; it can be used +to rename the file (or set to None to drop the change). The mode is a +simple bytestring (b"100644" for regular non-executable files, +b"100755" for executable files/scripts, b"120000" for symlinks, and +b"160000" for submodules). The blob_id is most useful in conjunction +with the value parameter. The value parameter is an instance of a +class that has the following functions + value.get_contents_by_identifier(blob_id) -> contents (bytestring) + value.get_size_by_identifier(blob_id) -> size_of_blob (int) + value.insert_file_with_contents(contents) -> blob_id + value.is_binary(contents) -> bool + value.apply_replace_text(contents) -> new_contents (bytestring) +and has the following member data you can write to + value.data (dict) +These functions allow you to get the contents of the file, or its +size, create a new file in the stream whose blob_id you can return, +check whether some given contents are binary (using the heuristic from +the grep(1) command), and apply the replacement rules from --replace-text +(note that --file-info-callback makes the changes from --replace-text not +auto-apply). You could use this for example to only apply the changes +from --replace-text to certain file types and simultaneously rename the +files it applies the changes to: + +-------------------------------------------------- +git-filter-repo --file-info-callback ' + if not filename.endswith(b".config"): + # Make no changes to the file; return as-is + return (filename, mode, blob_id) + + new_filename = filename[0:-7] + b".cfg" + + contents = value.get_contents_by_identifier(blob_id) + new_contents = value.apply_replace_text(contents) + new_blob_id = value.insert_file_with_contents(new_contents) + + return (new_filename, mode, new_blob_id) +-------------------------------------------------- + +Note that if history has multiple revisions with the same file +(e.g. it was cherry-picked to multiple branches or there were a number +of reverts), then the --file-info-callback will be called multiple +times. If you want to avoid processing the same file multiple times, +then you can stash transformation results in the value.data dict. +For, example, we could modify the above example to make it only apply +transformations on blob_ids we have not seen before: + +-------------------------------------------------- +git-filter-repo --file-info-callback ' + if not filename.endswith(b".config"): + # Make no changes to the file; return as-is + return (filename, mode, blob_id) + + new_filename = filename[0:-7] + b".cfg" + + if blob_id in value.data: + return (new_filename, mode, value.data[blob_id]) + + contents = value.get_contents_by_identifier(blob_id) + new_contents = value.apply_replace_text(contents) + new_blob_id = value.insert_file_with_contents(new_contents) + value.data[blob_id] = new_blob_id + + return (new_filename, mode, new_blob_id) +-------------------------------------------------- + +An alternative example for the --file-info-callback is to make all +.sh files executable and add an extra trailing newline to the .sh +files: + +-------------------------------------------------- +git-filter-repo --file-info-callback ' + if not filename.endswith(b".sh"): + # Make no changes to the file; return as-is + return (filename, mode, blob_id) + + # There are only 4 valid modes in git: + # - 100644, for regular non-executable files + # - 100755, for executable files/scripts + # - 120000, for symlinks + # - 160000, for submodules + new_mode = b"100755" + + contents = value.get_contents_by_identifier(blob_id) + new_contents = contents + b"\n" + new_blob_id = value.insert_file_with_contents(new_contents) + + return (filename, new_mode, new_blob_id) +-------------------------------------------------- + +In contrast to the previous callback types, the blob, reset, tag, and +commit callbacks are not expected to return a value, but are instead +expected to modify the object passed in. Major fields for these +objects are (subject to API backward compatibility caveats mentioned +previously): * Blob: `original_id` (original hash) and `data` * Reset: `ref` (name of reference) and `from_ref` (hash or integer mark) diff --git a/git-filter-repo b/git-filter-repo index ee6da900..174773d1 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -625,6 +625,7 @@ class Blob(_GitElementWithId): """ self.dumped = 1 BLOB_HASH_TO_NEW_ID[self.original_id] = self.id + BLOB_NEW_ID_TO_HASH[self.id] = self.original_id file_.write(b'blob\n') file_.write(b'mark :%d\n' % self.id) @@ -1543,6 +1544,7 @@ def record_id_rename(old_id, new_id): _IDS = _IDs() _SKIPPED_COMMITS = set() BLOB_HASH_TO_NEW_ID = {} +BLOB_NEW_ID_TO_HASH = {} class SubprocessWrapper(object): @staticmethod @@ -1802,7 +1804,7 @@ class FilteringOptions(object): # Provide a long helpful examples section example_text = _('''CALLBACKS - All callback functions are of the same general format. For a command line + Most callback functions are of the same general format. For a command line argument like --foo-callback 'BODY' @@ -1810,7 +1812,11 @@ class FilteringOptions(object): def foo_callback(foo): BODY - Thus, to replace 'Jon' with 'John' in author/committer/tagger names: + The exception on callbacks is the --file-info-callback, which will be + discussed further below. + + Given the callback style, we can thus make a simple callback to replace + 'Jon' with 'John' in author/committer/tagger names: git filter-repo --name-callback 'return name.replace(b"Jon", b"John")' To remove all 'Tested-by' tags in commit (or tag) messages: @@ -1822,6 +1828,29 @@ class FilteringOptions(object): Note that if BODY resolves to a filename, then the contents of that file will be used as the BODY in the callback function. + The --file-info-callback has a more involved function callback; for it the + following code will be compiled and called: + def file_info_callback(filename, mode, blob_id, value): + BODY + + It is designed to be used in cases where filtering depends on both + filename and contents (and maybe mode). It is called for file changes + other than deletions (since deletions have no file contents to operate + on). This callback is expected to return a tuple of (filename, mode, + blob_id). It can make use of the following functions from the value + instance: + value.get_contents_by_identifier(blob_id) -> contents (bytestring) + value.get_size_by_identifier(blob_id) -> size_of_blob (int) + value.insert_file_with_contents(contents) -> blob_id + value.is_binary(contents) -> bool + value.apply_replace_text(contents) -> new_contents (bytestring) + and can read/write the following data member from the value instance: + value.data (dict) + + The filename can be used for renaming the file similar to + --filename-callback (or None to drop the change), and mode is one + of b'100644', b'100755', b'120000', or b'160000'. + For more detailed examples and explanations AND caveats, see https://htmlpreview.github.io/?https://github.com/newren/git-filter-repo/blob/docs/html/git-filter-repo.html#CALLBACKS @@ -2048,6 +2077,9 @@ EXAMPLES callback.add_argument('--filename-callback', metavar="FUNCTION_BODY_OR_FILE", help=_("Python code body for processing filenames; see CALLBACKS " "sections below.")) + callback.add_argument('--file-info-callback', metavar="FUNCTION_BODY_OR_FILE", + help=_("Python code body for processing file and metadata; see " + "CALLBACKS sections below.")) callback.add_argument('--message-callback', metavar="FUNCTION_BODY_OR_FILE", help=_("Python code body for processing messages (both commit " "messages and tag messages); see CALLBACKS section below.")) @@ -2232,6 +2264,10 @@ EXAMPLES args.max_blob_size = int(args.max_blob_size[0:-1]) * mult[suffix] else: args.max_blob_size = int(args.max_blob_size) + if args.file_info_callback and ( + args.stdin or args.blob_callback or args.filename_callback): + raise SystemExit(_("Error: --file-info-callback is incompatible with " + "--stdin, --blob-callback,\nand --filename-callback.")) @staticmethod def get_replace_text(filename): @@ -2821,6 +2857,56 @@ class RepoAnalyze(object): RepoAnalyze.write_report(reportdir, stats) sys.stdout.write(_("done.\n")) +class FileInfoValueHelper: + def __init__(self, replace_text, insert_blob_func, source_working_dir): + self.data = {} + self._replace_text = replace_text + self._insert_blob_func = insert_blob_func + cmd = ['git', 'cat-file', '--batch-command'] + self._cat_file_process = subproc.Popen(cmd, + stdin = subprocess.PIPE, + stdout = subprocess.PIPE, + cwd = source_working_dir) + + def finalize(self): + self._cat_file_process.stdin.close() + self._cat_file_process.wait() + + def get_contents_by_identifier(self, blobhash): + self._cat_file_process.stdin.write(b'contents '+blobhash+b'\n') + self._cat_file_process.stdin.flush() + line = self._cat_file_process.stdout.readline() + (oid, oidtype, size) = line.split() + size = int(size) # Convert e.g. b'6283' to 6283 + assert(oidtype == b'blob') + contents_plus_newline = self._cat_file_process.stdout.read(size+1) + return contents_plus_newline[:-1] # return all but the newline + + def get_size_by_identifier(self, blobhash): + self._cat_file_process.stdin.write(b'info '+blobhash+b'\n') + self._cat_file_process.stdin.flush() + line = self._cat_file_process.stdout.readline() + (oid, oidtype, size) = line.split() + size = int(size) # Convert e.g. b'6283' to 6283 + assert(oidtype == b'blob') + return size + + def insert_file_with_contents(self, contents): + blob = Blob(contents) + self._insert_blob_func(blob) + return blob.id + + def is_binary(self, contents): + return b"\0" in contents[0:8192] + + def apply_replace_text(self, contents): + new_contents = contents + for literal, replacement in self._replace_text['literals']: + new_contents = new_contents.replace(literal, replacement) + for regex, replacement in self._replace_text['regexes']: + new_contents = regex.sub(replacement, new_contents) + return new_contents + class InputFileBackup: def __init__(self, input_file, output_file): self.input_file = input_file @@ -2869,7 +2955,8 @@ class RepoFilter(object): commit_callback = None, tag_callback = None, reset_callback = None, - done_callback = None): + done_callback = None, + file_info_callback = None): self._args = args @@ -2889,8 +2976,12 @@ class RepoFilter(object): self._name_callback = name_callback # author, committer, tagger self._email_callback = email_callback # author, committer, tagger self._refname_callback = refname_callback # from commit/tag/reset + self._file_info_callback = file_info_callback # various file info self._handle_arg_callbacks() + # Helpers for callbacks + self._file_info_value = None + # Defaults for input self._input = None self._fep = None # Fast Export Process @@ -2972,15 +3063,20 @@ class RepoFilter(object): self._hash_re = re.compile(br'(\b[0-9a-f]{7,40}\b)') def _handle_arg_callbacks(self): - def make_callback(argname, bdy): + def make_callback(args, bdy): callback_globals = {g: globals()[g] for g in public_globals} callback_locals = {} - exec('def callback({}, _do_not_use_this_var = None):\n'.format(argname)+ + if type(args) == str: + args = (args, '_do_not_use_this_var = None') + exec('def callback({}):\n'.format(', '.join(args))+ ' '+'\n '.join(bdy.splitlines()), callback_globals, callback_locals) return callback_locals['callback'] - def handle(which): - callback_field = '_{}_callback'.format(which) - code_string = getattr(self._args, which+'_callback') + def handle(which, args=None): + which_under = which.replace('-','_') + if not args: + args = which + callback_field = '_{}_callback'.format(which_under) + code_string = getattr(self._args, which_under+'_callback') if code_string: if os.path.exists(code_string): with open(code_string, 'r', encoding='utf-8') as f: @@ -2988,12 +3084,12 @@ class RepoFilter(object): if getattr(self, callback_field): raise SystemExit(_("Error: Cannot pass a %s_callback to RepoFilter " "AND pass --%s-callback" - % (which, which))) + % (which_under, which))) if 'return ' not in code_string and \ which not in ('blob', 'commit', 'tag', 'reset'): raise SystemExit(_("Error: --%s-callback should have a return statement") % which) - setattr(self, callback_field, make_callback(which, code_string)) + setattr(self, callback_field, make_callback(args, code_string)) handle('filename') handle('message') handle('name') @@ -3003,6 +3099,7 @@ class RepoFilter(object): handle('commit') handle('tag') handle('reset') + handle('file-info', ('filename', 'mode', 'blob_id', 'value')) def _run_sanity_checks(self): self._sanity_checks_handled = True @@ -3498,6 +3595,7 @@ class RepoFilter(object): blob.skip() if ( self._args.replace_text + and not self._file_info_callback # not (if blob contains zero byte in the first 8Kb, that is, if blob is binary data) and not b"\0" in blob.data[0:8192] ): @@ -3749,6 +3847,37 @@ class RepoFilter(object): orig_file_changes = set(commit.file_changes) self._filter_files(commit) + # Process the --file-info-callback + if self._file_info_callback: + if self._file_info_value is None: + source_working_dir = self._args.source or b'.' + self._file_info_value = FileInfoValueHelper(self._args.replace_text, + self.insert, + source_working_dir) + new_file_changes = [] + for change in commit.file_changes: + if change.type != b'D': + assert(change.type == b'M') + (filename, mode, blob_id) = \ + self._file_info_callback(change.filename, + change.mode, + change.blob_id, + self._file_info_value) + if mode is None: + # TODO: Should deletion of the file even be a feature? Might + # want to remove this branch of the if-elif-else. + assert(filename is not None) + assert(blob_id is not None) + new_change = FileChange(b'D', filename) + elif filename is None: + continue # Drop the FileChange from this commit + else: + new_change = FileChange(b'M', filename, blob_id, mode) + else: + new_change = change # use change as-is for deletions + new_file_changes.append(new_change) + commit.file_changes = new_file_changes + # Call the user-defined callback, if any if self._commit_callback: self._commit_callback(commit, self.callback_metadata(aux_info)) @@ -3963,7 +4092,8 @@ class RepoFilter(object): else: self._read_stash() skip_blobs = (self._blob_callback is None and - self._args.replace_text is None and + (self._args.replace_text is None or + self._file_info_callback is not None) and self._args.source == self._args.target) extra_flags = [] if skip_blobs: @@ -4068,6 +4198,8 @@ class RepoFilter(object): self._finalize_handled = True self._done_callback and self._done_callback() + if self._file_info_value: + self._file_info_value.finalize() if not self._args.quiet: self._progress_writer.finish() diff --git a/t/t9392-filter-repo-python-callback.sh b/t/t9392-filter-repo-python-callback.sh index 1f8be45f..2109a83c 100755 --- a/t/t9392-filter-repo-python-callback.sh +++ b/t/t9392-filter-repo-python-callback.sh @@ -60,6 +60,19 @@ test_expect_success '--filename-callback' ' ) ' +test_expect_success '--file-info-callback acting like --filename-callback' ' + setup fileinfo-as-filename-callback && + ( + cd fileinfo-as-filename-callback && + git filter-repo --file-info-callback "return (None if filename.endswith(b\".doc\") else b\"src/\"+filename, mode, blob_id)" && + git log --format=%n --name-only | sort | uniq | grep -v ^$ > f && + ! grep file.doc f && + COMPARE=$(wc -l filtered_f && + test_line_count = $COMPARE filtered_f + ) +' + test_expect_success '--message-callback' ' setup message-callback && ( @@ -129,6 +142,21 @@ test_expect_success '--blob-callback' ' ) ' +test_expect_success '--file-info-callback acting like --blob-callback' ' + setup fileinfo-as-blob-callback && + ( + cd fileinfo-as-blob-callback && + git log --format=%n --name-only | sort | uniq | grep -v ^$ > f && + test_line_count = 5 f && + rm f && + git filter-repo --file-info-callback " + size = value.get_size_by_identifier(blob_id) + return (None if size > 25 else filename, mode, blob_id)" && + git log --format=%n --name-only | sort | uniq | grep -v ^$ > f && + test_line_count = 2 f + ) +' + test_expect_success '--commit-callback' ' setup commit-callback && ( @@ -227,4 +255,120 @@ test_expect_success 'tweaking just a tag' ' ) ' +test_expect_success '--file-info-callback messing with history' ' + setup messing_with_files && + ( + cd messing_with_files && + + echo "1-2-3-4==>1-2-3-4-5" >replacement && + # Trying to count the levels of backslash escaping is not fun. + echo "regex:\\\$[^\$]*\\\$==>cvs is lame" >>replacement && + git filter-repo --force --file-info-callback " + size = value.get_size_by_identifier(blob_id) + contents = value.get_contents_by_identifier(blob_id) + if not value.is_binary(contents): + contents = value.apply_replace_text(contents) + if contents[-1] != 10: + contents += bytes([10]) + blob_id = value.insert_file_with_contents(contents) + newname = bytes(reversed(filename)) + if size == 27 and len(contents) == 27: + newname = None + return (newname, mode, blob_id) + " --replace-text replacement && + + cat <<-EOF >expect && + c.raboof + dlrow + ecivda + terces + EOF + + git ls-files >actual && + test_cmp expect actual && + + echo "The launch code is 1-2-3-4-5." >expect && + test_cmp expect terces && + + echo " cvs is lame" >expect && + test_cmp expect c.raboof + ) +' + +test_expect_success '--file-info-callback and deletes and drops' ' + setup file_info_deletes_drops && + ( + cd file_info_deletes_drops && + + git rm file.doc && + git commit -m "Nuke doc file" && + + git filter-repo --force --file-info-callback " + size = value.get_size_by_identifier(blob_id) + (newname, newmode) = (filename, mode) + if filename == b\"world\" and size == 12: + newname = None + if filename == b\"advice\" and size == 77: + newmode = None + return (newname, newmode, blob_id) + " + + cat <<-EOF >expect && + foobar.c + secret + world + EOF + + echo 1 >expect && + git rev-list --count HEAD -- world >actual && + test_cmp expect actual && + + echo 2 >expect && + git rev-list --count HEAD -- advice >actual && + test_cmp expect actual && + + echo hello >expect && + test_cmp expect world + ) +' + +test_lazy_prereq UNIX2DOS ' + unix2dos -h + test $? -ne 127 +' + +test_expect_success UNIX2DOS '--file-info-callback acting like lint-history' ' + setup lint_history_replacement && + ( + cd lint_history_replacement && + git ls-files -s | grep -v file.doc >expect && + + git filter-repo --force --file-info-callback " + if not filename.endswith(b\".doc\"): + return (filename, mode, blob_id) + + if blob_id in value.data: + return (filename, mode, value.data[blob_id]) + + contents = value.get_contents_by_identifier(blob_id) + tmpfile = os.path.basename(filename) + with open(tmpfile, \"wb\") as f: + f.write(contents) + subprocess.check_call([\"unix2dos\", filename]) + with open(filename, \"rb\") as f: + contents = f.read() + new_blob_id = value.insert_file_with_contents(contents) + + value.data[blob_id] = new_blob_id + return (filename, mode, new_blob_id) + " && + + git ls-files -s | grep -v file.doc >actual && + test_cmp expect actual && + + printf "A file that you cant trust\r\n" >expect && + test_cmp expect file.doc + ) +' + test_done diff --git a/t/t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh b/t/t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh index 82e107b5..6dacc84b 100755 --- a/t/t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh +++ b/t/t9394-filter-repo-sanity-checks-and-bigger-repo-setup.sh @@ -1001,6 +1001,15 @@ test_expect_success 'other startup error cases and requests for help' ' test_must_fail git filter-repo --path-rename foo:bar --use-base-name 2>err && test_i18ngrep ": --use-base-name and --path-rename are incompatible" err && + test_must_fail git filter-repo --file-info-callback "None" --stdin 2>err && + test_i18ngrep ": --file-info-callback is incompatible with" err && + + test_must_fail git filter-repo --file-info-callback "None" --blob-callback "None" 2>err && + test_i18ngrep ": --file-info-callback is incompatible with" err && + + test_must_fail git filter-repo --file-info-callback "None" --filename-callback "None" 2>err && + test_i18ngrep ": --file-info-callback is incompatible with" err && + test_must_fail git filter-repo --path-rename foo:bar/ 2>err && test_i18ngrep "either ends with a slash then both must." err &&