List Info

Thread: Re: cvs2svn conversion directly to git ready for experimentation




Re: cvs2svn conversion directly to git ready for experimentation
user name
2007-08-03 18:34:26
Jon Smirl wrote:
> How big was the dump file? A fully exploded Mozilla CVS
is hundreds of
> gigabytes. It takes a long time to do hundreds of
gigabytes of IO.

No, it was more like 30 Gb, maybe 50 Gb (I didn't write the
number
down).  Note that the dump file eliminates some redundancy;
for example,
a file copy is simply an annotation and doesn't require the
file's full
text to be written out.  I've been using dumpfile output for
my tests
simply because it is faster than piping the output into
"svnadmin load",
making my cvs2svn benchmarks complete faster (I'm mostly
interested in
relative numbers).  But you are right--I should be a little
bit less
lazy and write a --no-output option to use just for
testing.

> I was doing a full import including all the branches,
not trunk only.
> git-fast import totally avoids wiring fully exploded
versions anywhere
> but across the pipe between the two processes.
> 
> The process worked like this:
> 1) parse the cvs file and start reconstructing versions
from it
> 2) feed these versions to git-fastimport
> 3) git-fastimport accepts the versions, diffs them, and
writes the
> diffs straight into a pack file.
> 3) cvs2svn figured out all of the changesets
> 4) send the changeset data to git-fastimport which uses
the info to
> make commit trees.
> 
> disk io was totally minimized.
> Each cvs file is only scanned once.
> the packfile written by git-fastimport is all
sequential writes.

I can't say what happens in git-fast-import, but the cvs2svn
side in the
current code is exactly as you describe when outputting to
git.
Except...see below.

But when outputting to SVN, the procedure is necessarily
different
because the text is needed in OutputPass, not in
CollectRevsPass:

1. Parse the CVS file and write deltas to a database along
with file
dependency tree information.  Invert the deltas on trunk,
and also write
the fulltext of revision 1.1 to the database.

2. In FilterSymbolsPass, compute how many times each
revision will be
needed after cvs2svn has pruned off excluded branches and
unneeded
revisions.

3. In OutputPass, retrieve revision N by retrieving revision
N-1 then
applying the diff.  If revision N will be needed again (for
example as
the base revision for another diff), store its fulltext to a
database.

> Compute and IO pipelining was achieved by computing the
sha in cvs2git
> and storing it along with the revision info. When the
change set was
> computed cvs2git could supply fastimport with the sha's
needed to
> build the tree objects. This allowed everything to be
sent into
> fastimport in a fire and forget mode allowing
asynchronously
> processing. Asynchronous processes were important to
disk and compute
> scheduling efficiency.

I don't understand the need to compute digests in cvs2git. 
I've been
using cvs-fast-import's marks to correlate the blobs written
in
CollectRevsPass with the changesets output in OutputPass. 
Is there an
advantage to using digests instead of marks?

Michael

------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribecvs2svn.tigris.org
For additional commands, e-mail: users-helpcvs2svn.tigris.org


Re: cvs2svn conversion directly to git ready for experimentation
user name
2007-08-03 18:57:07
On 8/3/07, Michael Haggerty <mhaggeralum.mit.edu> wrote:
> Jon Smirl wrote:
> > How big was the dump file? A fully exploded
Mozilla CVS is hundreds of
> > gigabytes. It takes a long time to do hundreds of
gigabytes of IO.
>
> No, it was more like 30 Gb, maybe 50 Gb (I didn't write
the number
> down).  Note that the dump file eliminates some
redundancy; for example,
> a file copy is simply an annotation and doesn't require
the file's full
> text to be written out.  I've been using dumpfile
output for my tests
> simply because it is faster than piping the output into
"svnadmin load",
> making my cvs2svn benchmarks complete faster (I'm
mostly interested in
> relative numbers).  But you are right--I should be a
little bit less
> lazy and write a --no-output option to use just for
testing.
>
> > I was doing a full import including all the
branches, not trunk only.
> > git-fast import totally avoids wiring fully
exploded versions anywhere
> > but across the pipe between the two processes.
> >
> > The process worked like this:
> > 1) parse the cvs file and start reconstructing
versions from it
> > 2) feed these versions to git-fastimport
> > 3) git-fastimport accepts the versions, diffs
them, and writes the
> > diffs straight into a pack file.
> > 3) cvs2svn figured out all of the changesets
> > 4) send the changeset data to git-fastimport which
uses the info to
> > make commit trees.
> >
> > disk io was totally minimized.
> > Each cvs file is only scanned once.
> > the packfile written by git-fastimport is all
sequential writes.
>
> I can't say what happens in git-fast-import, but the
cvs2svn side in the
> current code is exactly as you describe when outputting
to git.
> Except...see below.
>
> But when outputting to SVN, the procedure is
necessarily different
> because the text is needed in OutputPass, not in
CollectRevsPass:
>
> 1. Parse the CVS file and write deltas to a database
along with file
> dependency tree information.  Invert the deltas on
trunk, and also write
> the fulltext of revision 1.1 to the database.
>
> 2. In FilterSymbolsPass, compute how many times each
revision will be
> needed after cvs2svn has pruned off excluded branches
and unneeded
> revisions.
>
> 3. In OutputPass, retrieve revision N by retrieving
revision N-1 then
> applying the diff.  If revision N will be needed again
(for example as
> the base revision for another diff), store its fulltext
to a database.

The svn database format forces the revisions to be written
into the
file in changeset order, right? There is no indirection
layer like git
has since it uses the digest as an index key. The decoupling
of the
commit tree and the actual revisions allows much faster
imports. git's
trick is that the indirection allows the deltas to be stored
in any
order.

If there is no indirection layer it may be faster to reread
the CVS
files to get the revisions. Rereading the files will lower
the
pressure against the disk cache a lot. Writing the revisions
to a
database guarantees that the disk cache will get flushed.
Are you IO
or compute bound?

I think you said once that there was an indirection layer
inside the
SVN database but none of the external tools provided access
to it. git
was the same way, fastimport was specifically written to get
at the
indirection.

How hard would it be to write a tool that allows the
versions to be
recorded in file order instead of change set order?


>
> > Compute and IO pipelining was achieved by
computing the sha in cvs2git
> > and storing it along with the revision info. When
the change set was
> > computed cvs2git could supply fastimport with the
sha's needed to
> > build the tree objects. This allowed everything to
be sent into
> > fastimport in a fire and forget mode allowing
asynchronously
> > processing. Asynchronous processes were important
to disk and compute
> > scheduling efficiency.
>
> I don't understand the need to compute digests in
cvs2git.  I've been
> using cvs-fast-import's marks to correlate the blobs
written in
> CollectRevsPass with the changesets output in
OutputPass.  Is there an
> advantage to using digests instead of marks?

It is probably 0.1% faster. It was just the original way we
started
doing things, Shawn didn't want to keep a table mapping
marks to
digests. For Mozilla the table ends up being 200MB.

>
> Michael
>


-- 
Jon Smirl
jonsmirlgmail.com

------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribecvs2svn.tigris.org
For additional commands, e-mail: users-helpcvs2svn.tigris.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )