@@ -15,6 +15,183 @@ subscribe, see [the Git Rev News page](https://git.github.io/rev_news/rev_news/)
15
15
16
16
This edition covers what happened during the month of January 2020.
17
17
18
+ ## The Pros and Cons of Reposurgeon (* written by <a href =" http://www.catb.org/~esr/ " >Eric S. Raymond</a >* )
19
+
20
+ On January 12th 2020, the history of the GNU Compiler Collection was
21
+ lifted from Subversion to Git. At 280K commits, with a history
22
+ containing traces of two previous version-control systems (CVS and
23
+ RCS) this was the largest and most complex repository conversion of an
24
+ open-source project ever. It swamped the previous record-holder -
25
+ Emacs's move from Bazaar to Git back in 2011 - by an order of magnitude.
26
+
27
+ Both those conversions were done by [ reposurgeon] ( https://gitlab.com/esr/reposurgeon ) .
28
+ Neither of them could practically have been performed by any other
29
+ conversion tool available. This article will explain why that is, and
30
+ under what circumstances you might consider using reposurgeon
31
+ yourself.
32
+
33
+ Let's start with a brief description of what reposurgeon actually
34
+ does. When you use it, you start by reading in a version-control
35
+ repository...but actually, that's not quite right. What reposurgeon
36
+ actually does is read in a git fast-import stream. It looks like it
37
+ reads repositories because it knows how to call front ends that use
38
+ exporters such as git-fast-export and cvs-fast-export to serialize a
39
+ repository for it.
40
+
41
+ Actually, that's not quite right either. Subversion doesn't have an
42
+ exporter - there is no svn-fast-export (well, not one that works for
43
+ more than trivial cases, anyway). Instead, reposurgeon reads the
44
+ native serialization produced by Subversion's svnadmin dump
45
+ tool. Internally, this is massaged into the equivalent of a git
46
+ fast-import stream and represented as one inside reposurgeon.
47
+
48
+ There are reposurgeon-compatible exporters for RCS, CVS, bzr, hg, SRC,
49
+ bk, and of course git itself. With a little extra work using sccs2rcs
50
+ it's possible to reach all the way back to collections of SCCS files.
51
+
52
+ Now that you've caught your repository, what can you do with it?
53
+
54
+ I observed earlier that what you have, internally, is a deserialized
55
+ version of a git fast-input stream. A productive way to think about
56
+ what reposurgeon does is to remember that this is basically just a DAG
57
+ (directed acyclic graph) with text attached to the nodes. Now think of
58
+ reposurgeon as an editor for this graph and its nodes. Then, think of
59
+ it as a DSL (domain-specific language) designed to be * scripted* -
60
+ that is, designed to reproducibly apply editing procedures to this
61
+ graph.
62
+
63
+ So the general answer to "what can you do with it" is "anything you
64
+ want to". I enjoy thinking about and implementing DSLs, and once I had
65
+ the basic design idea it was pretty much inevitable that I was going
66
+ to write the most general set of primitives I could imagine - and I
67
+ have a very fertile imagination.
68
+
69
+ Elijah Newren's aside on reposurgeon in [ Git Rev News 54] ( https://git.github.io/rev_news/2019/08/21/edition-54/ )
70
+ described it as “GDB for history rewriting”. That's a pretty good
71
+ analogy, actually. Better than even I knew until recently, because it
72
+ turns out the Python Cmd library I originally used to write its
73
+ command interpreter was designed to emulate the interface style of gdb
74
+ and earlier symbolic debuggers.
75
+
76
+ Accordingly, you can immediately use reposurgeon for a lot of
77
+ relatively simple tasks like (1) removing extremely bulky content that
78
+ shouldn't have been checked in, (2) partitioning and merging
79
+ repositories, (3) transcoding Latin-1 metadata to UTF-8, (4)
80
+ debubbling an unnecessary complex history to make reading it easier.
81
+
82
+ Often, though, those things can be done with other tools like his
83
+ git-filter-repo. It's repository conversions for which you are likely
84
+ to actually * need* the full power of a domain-spesific language
85
+ designed for repository surgery.
86
+
87
+ Which brings us to how you write out your graph as a live
88
+ repository. Reposurgeon doesn't do that directly either. When it needs
89
+ to write out a repository, it hands a git fast-import stream to an
90
+ importer back end. That could be git fast-import itself, or the
91
+ corresponding importers for hg, bzr, darcs, bk, RCS, or SRC.
92
+
93
+ Here's what reading in and immediately converting a small Subversion
94
+ dump would look like:
95
+
96
+ ``` shell
97
+ $ reposurgeon
98
+ reposurgeon% read < foo.svn
99
+ 23 svn revisions (0K/s)
100
+ * foo
101
+ reposurgeon% prefer git
102
+ git is the preferred type.
103
+ reposurgeon% rebuild bar
104
+ reposurgeon: rebuild is complete.
105
+ reposurgeon: no preservations.
106
+ reposurgeon%
107
+ ```
108
+
109
+ In theory you now have a Git repository named "bar" in your current
110
+ directory that is a perfect translation of foo. In practice, for any
111
+ nontrivial repository, you probably have a bit of a mess on your
112
+ hands.
113
+
114
+ If you had read in any Git repository and written it out again, you'd
115
+ get a perfect copy. But when you're moving histories between
116
+ * different* version-control systems, you have to deal with the
117
+ mismatch between the source system's model of version control and the
118
+ target's.
119
+
120
+ A good example of this is the fact that Subversion doesn't have
121
+ anything directly corresponding to a Git tag. A Subversion tag is
122
+ actually a directory copy operation with a target under the tags/
123
+ directory. The copy operation leaves a commit in place which, if moved
124
+ literally to gitspace, would just be junk. What you want is to move
125
+ the metadata of that commit to an annotated tag.
126
+
127
+ Many attempts at importers silently botch this in practice, but least
128
+ it handled automatically in theory - and reposurgeon does that. The
129
+ mess you're likely to have on your hands anyway is due to Subversion
130
+ operator errors, scar tissue for a previous conversion out of CVS, and
131
+ use of git-svn as a live gateway to the repository.
132
+
133
+ The most common symptom of all these error sources is misplaced branch
134
+ joins; in extreme cases you may even have disconnected
135
+ branches. Reposurgeon enables you audit for and repair this kind of
136
+ defect. Here are a few examples of that kind of repair done on the GCC
137
+ repository:
138
+
139
+ ```
140
+ # /branches/GC_5_0_ALPHA_1
141
+ <27855>|<27860> reparent --use-order
142
+ # /branches/apple-200511-release-branch
143
+ <105446>|<105574> reparent --use-order
144
+ # /branches/apple-gcc_os_35-branch
145
+ <90334>|<90607> reparent --use-order
146
+ # /branches/apple-tiger-release-branch
147
+ <96593>|<96595> reparent --use-order
148
+ ```
149
+
150
+ The GCC conversion was pretty hairy - 343 lines of DSL scripting - but
151
+ there are whole new levels of complexity when, as still sometimes
152
+ happens, you need to recover history from pre-version-controlled
153
+ sources to stitch the repository together.
154
+
155
+ In [ one extreme case] ( http://esr.ibiblio.org/?p=2491 ) , I ended up
156
+ stitching together material from 18 different release tarballs, 11
157
+ unreleased snapshot tarballs, one release tarball I could reconstruct,
158
+ one release tarball mined out of an obsolete Red Hat source RPM, two
159
+ shar archives, a pax archive, five published patches, two zip files, a
160
+ darcs archive, and a partial RCS history,
161
+
162
+ But reposurgeon can handle this, because it make conversion
163
+ experiments easy. The workflow it's designed for is carefully building
164
+ a script that assembles your source repository and other data into a
165
+ simulacrum of what a Git repository tracking your project from the
166
+ beginning of time would have looked like.
167
+
168
+ Almost never will you get this right the first time. It takes testing,
169
+ polishing, tripping over assumptions you didn't know you and your
170
+ tools were making, and correcting for those assumptions. In the GCC
171
+ case it took many hours of work to locate and develop fixes for the
172
+ misplaced branch joins.
173
+
174
+ A subtle but important point is that I didn't do that work
175
+ myself. That kind of thing is not a job for reposurgeon's maintainer,
176
+ it's a job for a "Mr. Inside" who knows the project's history
177
+ intimately - in this case it was actually the GCC project lead, Joseph
178
+ Myers. One of reposurgeon's requirements is that it has to be a tool
179
+ that a "Mr. Inside" can learn to use with minimum friction.
180
+
181
+ And generally it is, if you're being driven to it by the kind of
182
+ problem it was designed to solve - it's like gdb that way. I've been
183
+ taken to task about the tool having no intro documentation; this is
184
+ not because I'm lazy, it's because there's
185
+ [ no plausible way to write any] ( http://esr.ibiblio.org/?p=8551 ) , any
186
+ more than there is for gdb. You're ready to learn reposurgeon, as
187
+ Joseph Myers did, when you're stuck into a conversion or editing
188
+ problem so deep that the * very* complete
189
+ [ reposurgeon command reference] ( http://www.catb.org/~esr/reposurgeon/reposurgeon.html )
190
+ starts to make sense to you.
191
+
192
+ You can find more about conversions with reposurgeon
193
+ [ here] ( http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html ) .
194
+
18
195
## Discussions
19
196
20
197
<!-- -
54
231
Jakub Narębski
< ; < [email protected] > > ; ,
55
232
Markus Jansen
< ; < [email protected] > > ; and
56
233
Kaartic Sivaraam
< ; <
[email protected] >
> ;
57
- with help from XXX .
234
+ with help from Eric S. Raymond .
0 commit comments