Skip to content

Commit bc844d3

Browse files
committed
rn-60: add reposurgeon article
1 parent c593d66 commit bc844d3

File tree

1 file changed

+178
-1
lines changed

1 file changed

+178
-1
lines changed

rev_news/drafts/edition-60.md

Lines changed: 178 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,183 @@ subscribe, see [the Git Rev News page](https://git.github.io/rev_news/rev_news/)
1515

1616
This edition covers what happened during the month of January 2020.
1717

18+
## The Pros and Cons of Reposurgeon (*written by <a href="http://www.catb.org/~esr/">Eric S. Raymond</a>*)
19+
20+
On January 12th 2020, the history of the GNU Compiler Collection was
21+
lifted from Subversion to Git. At 280K commits, with a history
22+
containing traces of two previous version-control systems (CVS and
23+
RCS) this was the largest and most complex repository conversion of an
24+
open-source project ever. It swamped the previous record-holder -
25+
Emacs's move from Bazaar to Git back in 2011 - by an order of magnitude.
26+
27+
Both those conversions were done by [reposurgeon](https://gitlab.com/esr/reposurgeon).
28+
Neither of them could practically have been performed by any other
29+
conversion tool available. This article will explain why that is, and
30+
under what circumstances you might consider using reposurgeon
31+
yourself.
32+
33+
Let's start with a brief description of what reposurgeon actually
34+
does. When you use it, you start by reading in a version-control
35+
repository...but actually, that's not quite right. What reposurgeon
36+
actually does is read in a git fast-import stream. It looks like it
37+
reads repositories because it knows how to call front ends that use
38+
exporters such as git-fast-export and cvs-fast-export to serialize a
39+
repository for it.
40+
41+
Actually, that's not quite right either. Subversion doesn't have an
42+
exporter - there is no svn-fast-export (well, not one that works for
43+
more than trivial cases, anyway). Instead, reposurgeon reads the
44+
native serialization produced by Subversion's svnadmin dump
45+
tool. Internally, this is massaged into the equivalent of a git
46+
fast-import stream and represented as one inside reposurgeon.
47+
48+
There are reposurgeon-compatible exporters for RCS, CVS, bzr, hg, SRC,
49+
bk, and of course git itself. With a little extra work using sccs2rcs
50+
it's possible to reach all the way back to collections of SCCS files.
51+
52+
Now that you've caught your repository, what can you do with it?
53+
54+
I observed earlier that what you have, internally, is a deserialized
55+
version of a git fast-input stream. A productive way to think about
56+
what reposurgeon does is to remember that this is basically just a DAG
57+
(directed acyclic graph) with text attached to the nodes. Now think of
58+
reposurgeon as an editor for this graph and its nodes. Then, think of
59+
it as a DSL (domain-specific language) designed to be *scripted* -
60+
that is, designed to reproducibly apply editing procedures to this
61+
graph.
62+
63+
So the general answer to "what can you do with it" is "anything you
64+
want to". I enjoy thinking about and implementing DSLs, and once I had
65+
the basic design idea it was pretty much inevitable that I was going
66+
to write the most general set of primitives I could imagine - and I
67+
have a very fertile imagination.
68+
69+
Elijah Newren's aside on reposurgeon in [Git Rev News 54](https://git.github.io/rev_news/2019/08/21/edition-54/)
70+
described it as “GDB for history rewriting”. That's a pretty good
71+
analogy, actually. Better than even I knew until recently, because it
72+
turns out the Python Cmd library I originally used to write its
73+
command interpreter was designed to emulate the interface style of gdb
74+
and earlier symbolic debuggers.
75+
76+
Accordingly, you can immediately use reposurgeon for a lot of
77+
relatively simple tasks like (1) removing extremely bulky content that
78+
shouldn't have been checked in, (2) partitioning and merging
79+
repositories, (3) transcoding Latin-1 metadata to UTF-8, (4)
80+
debubbling an unnecessary complex history to make reading it easier.
81+
82+
Often, though, those things can be done with other tools like his
83+
git-filter-repo. It's repository conversions for which you are likely
84+
to actually *need* the full power of a domain-spesific language
85+
designed for repository surgery.
86+
87+
Which brings us to how you write out your graph as a live
88+
repository. Reposurgeon doesn't do that directly either. When it needs
89+
to write out a repository, it hands a git fast-import stream to an
90+
importer back end. That could be git fast-import itself, or the
91+
corresponding importers for hg, bzr, darcs, bk, RCS, or SRC.
92+
93+
Here's what reading in and immediately converting a small Subversion
94+
dump would look like:
95+
96+
```shell
97+
$ reposurgeon
98+
reposurgeon% read <foo.svn
99+
23 svn revisions (0K/s)
100+
* foo
101+
reposurgeon% prefer git
102+
git is the preferred type.
103+
reposurgeon% rebuild bar
104+
reposurgeon: rebuild is complete.
105+
reposurgeon: no preservations.
106+
reposurgeon%
107+
```
108+
109+
In theory you now have a Git repository named "bar" in your current
110+
directory that is a perfect translation of foo. In practice, for any
111+
nontrivial repository, you probably have a bit of a mess on your
112+
hands.
113+
114+
If you had read in any Git repository and written it out again, you'd
115+
get a perfect copy. But when you're moving histories between
116+
*different* version-control systems, you have to deal with the
117+
mismatch between the source system's model of version control and the
118+
target's.
119+
120+
A good example of this is the fact that Subversion doesn't have
121+
anything directly corresponding to a Git tag. A Subversion tag is
122+
actually a directory copy operation with a target under the tags/
123+
directory. The copy operation leaves a commit in place which, if moved
124+
literally to gitspace, would just be junk. What you want is to move
125+
the metadata of that commit to an annotated tag.
126+
127+
Many attempts at importers silently botch this in practice, but least
128+
it handled automatically in theory - and reposurgeon does that. The
129+
mess you're likely to have on your hands anyway is due to Subversion
130+
operator errors, scar tissue for a previous conversion out of CVS, and
131+
use of git-svn as a live gateway to the repository.
132+
133+
The most common symptom of all these error sources is misplaced branch
134+
joins; in extreme cases you may even have disconnected
135+
branches. Reposurgeon enables you audit for and repair this kind of
136+
defect. Here are a few examples of that kind of repair done on the GCC
137+
repository:
138+
139+
```
140+
# /branches/GC_5_0_ALPHA_1
141+
<27855>|<27860> reparent --use-order
142+
# /branches/apple-200511-release-branch
143+
<105446>|<105574> reparent --use-order
144+
# /branches/apple-gcc_os_35-branch
145+
<90334>|<90607> reparent --use-order
146+
# /branches/apple-tiger-release-branch
147+
<96593>|<96595> reparent --use-order
148+
```
149+
150+
The GCC conversion was pretty hairy - 343 lines of DSL scripting - but
151+
there are whole new levels of complexity when, as still sometimes
152+
happens, you need to recover history from pre-version-controlled
153+
sources to stitch the repository together.
154+
155+
In [one extreme case](http://esr.ibiblio.org/?p=2491), I ended up
156+
stitching together material from 18 different release tarballs, 11
157+
unreleased snapshot tarballs, one release tarball I could reconstruct,
158+
one release tarball mined out of an obsolete Red Hat source RPM, two
159+
shar archives, a pax archive, five published patches, two zip files, a
160+
darcs archive, and a partial RCS history,
161+
162+
But reposurgeon can handle this, because it make conversion
163+
experiments easy. The workflow it's designed for is carefully building
164+
a script that assembles your source repository and other data into a
165+
simulacrum of what a Git repository tracking your project from the
166+
beginning of time would have looked like.
167+
168+
Almost never will you get this right the first time. It takes testing,
169+
polishing, tripping over assumptions you didn't know you and your
170+
tools were making, and correcting for those assumptions. In the GCC
171+
case it took many hours of work to locate and develop fixes for the
172+
misplaced branch joins.
173+
174+
A subtle but important point is that I didn't do that work
175+
myself. That kind of thing is not a job for reposurgeon's maintainer,
176+
it's a job for a "Mr. Inside" who knows the project's history
177+
intimately - in this case it was actually the GCC project lead, Joseph
178+
Myers. One of reposurgeon's requirements is that it has to be a tool
179+
that a "Mr. Inside" can learn to use with minimum friction.
180+
181+
And generally it is, if you're being driven to it by the kind of
182+
problem it was designed to solve - it's like gdb that way. I've been
183+
taken to task about the tool having no intro documentation; this is
184+
not because I'm lazy, it's because there's
185+
[no plausible way to write any](http://esr.ibiblio.org/?p=8551), any
186+
more than there is for gdb. You're ready to learn reposurgeon, as
187+
Joseph Myers did, when you're stuck into a conversion or editing
188+
problem so deep that the *very* complete
189+
[reposurgeon command reference](http://www.catb.org/~esr/reposurgeon/reposurgeon.html)
190+
starts to make sense to you.
191+
192+
You can find more about conversions with reposurgeon
193+
[here](http://www.catb.org/~esr/reposurgeon/dvcs-migration-guide.html).
194+
18195
## Discussions
19196

20197
<!---
@@ -54,4 +231,4 @@ Christian Couder &lt;<[email protected]>&gt;,
54231
Jakub Narębski &lt;<[email protected]>&gt;,
55232
Markus Jansen &lt;<[email protected]>&gt; and
56233
Kaartic Sivaraam &lt;<[email protected]>&gt;
57-
with help from XXX.
234+
with help from Eric S. Raymond.

0 commit comments

Comments
 (0)