On 8/3/07, Michael Haggerty <mhagger alum.mit.edu> wrote:
> Jon Smirl wrote:
> > How big was the dump file? A fully exploded
Mozilla CVS is hundreds of
> > gigabytes. It takes a long time to do hundreds of
gigabytes of IO.
>
> No, it was more like 30 Gb, maybe 50 Gb (I didn't write
the number
> down). Note that the dump file eliminates some
redundancy; for example,
> a file copy is simply an annotation and doesn't require
the file's full
> text to be written out. I've been using dumpfile
output for my tests
> simply because it is faster than piping the output into
"svnadmin load",
> making my cvs2svn benchmarks complete faster (I'm
mostly interested in
> relative numbers). But you are right--I should be a
little bit less
> lazy and write a --no-output option to use just for
testing.
>
> > I was doing a full import including all the
branches, not trunk only.
> > git-fast import totally avoids wiring fully
exploded versions anywhere
> > but across the pipe between the two processes.
> >
> > The process worked like this:
> > 1) parse the cvs file and start reconstructing
versions from it
> > 2) feed these versions to git-fastimport
> > 3) git-fastimport accepts the versions, diffs
them, and writes the
> > diffs straight into a pack file.
> > 3) cvs2svn figured out all of the changesets
> > 4) send the changeset data to git-fastimport which
uses the info to
> > make commit trees.
> >
> > disk io was totally minimized.
> > Each cvs file is only scanned once.
> > the packfile written by git-fastimport is all
sequential writes.
>
> I can't say what happens in git-fast-import, but the
cvs2svn side in the
> current code is exactly as you describe when outputting
to git.
> Except...see below.
>
> But when outputting to SVN, the procedure is
necessarily different
> because the text is needed in OutputPass, not in
CollectRevsPass:
>
> 1. Parse the CVS file and write deltas to a database
along with file
> dependency tree information. Invert the deltas on
trunk, and also write
> the fulltext of revision 1.1 to the database.
>
> 2. In FilterSymbolsPass, compute how many times each
revision will be
> needed after cvs2svn has pruned off excluded branches
and unneeded
> revisions.
>
> 3. In OutputPass, retrieve revision N by retrieving
revision N-1 then
> applying the diff. If revision N will be needed again
(for example as
> the base revision for another diff), store its fulltext
to a database.
The svn database format forces the revisions to be written
into the
file in changeset order, right? There is no indirection
layer like git
has since it uses the digest as an index key. The decoupling
of the
commit tree and the actual revisions allows much faster
imports. git's
trick is that the indirection allows the deltas to be stored
in any
order.
If there is no indirection layer it may be faster to reread
the CVS
files to get the revisions. Rereading the files will lower
the
pressure against the disk cache a lot. Writing the revisions
to a
database guarantees that the disk cache will get flushed.
Are you IO
or compute bound?
I think you said once that there was an indirection layer
inside the
SVN database but none of the external tools provided access
to it. git
was the same way, fastimport was specifically written to get
at the
indirection.
How hard would it be to write a tool that allows the
versions to be
recorded in file order instead of change set order?
>
> > Compute and IO pipelining was achieved by
computing the sha in cvs2git
> > and storing it along with the revision info. When
the change set was
> > computed cvs2git could supply fastimport with the
sha's needed to
> > build the tree objects. This allowed everything to
be sent into
> > fastimport in a fire and forget mode allowing
asynchronously
> > processing. Asynchronous processes were important
to disk and compute
> > scheduling efficiency.
>
> I don't understand the need to compute digests in
cvs2git. I've been
> using cvs-fast-import's marks to correlate the blobs
written in
> CollectRevsPass with the changesets output in
OutputPass. Is there an
> advantage to using digests instead of marks?
It is probably 0.1% faster. It was just the original way we
started
doing things, Shawn didn't want to keep a table mapping
marks to
digests. For Mozilla the table ends up being 200MB.
>
> Michael
>
--
Jon Smirl
jonsmirl gmail.com
------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribe cvs2svn.tigris.org
For additional commands, e-mail: users-help cvs2svn.tigris.org
|