List Info

Thread: Re: cvs2svn conversion directly to git ready for experimentation




Re: cvs2svn conversion directly to git ready for experimentation
user name
2007-08-03 17:41:05
Jon Smirl wrote:
> On 8/3/07, Michael Haggerty <mhaggeralum.mit.edu> wrote:
>> Jon Smirl wrote:
>>> On 8/3/07, Michael Haggerty <mhaggeralum.mit.edu> wrote:
>>>> Jon, I would like very much to hear how you
propose to get an 60-fold
>>>> speed increase in cvs2svn.  I've never
heard of any plausible way to
>>>> accomplish anything even close to this.
>>>>
>>>> Please note that the user wants to convert
to Subversion, not git.  But
>>>> even converting to git, I don't think that
such speeds are possible
>>>> without massive changes that would include
processing everything in RAM
>>>> and switching large parts of cvs2svn from
Python to a compiled language.
>>> Make a bulk importer for SVN like
git-fastimport. I measured some SVN
>>> imports and the bulk of the time was spent
forking off SVN. Before
>>> git-fast import it would have taken git two
weeks to import Mozilla
>>> CVS.
>> I'm curious about your measurements.  Were these
measurements using
>> cvs2svn with the "-s" option (output
directly to SVN repository)?  With
>> this option, cvs2svn uses "svnadmin
load", which is only invoked once
>> for all commits.  I don't understand why
"svnadmin load" would be
>> spawning SVN processes.
> 
> It's been six months (or more) since I worked with
this. I don't
> remember the exact results. It's easy enough to
reproduce. Just build
> a kernel with oprofile enabled and everything else is
very simple. I
> do remember that the cvs2svn python code was taking
about 10% of the
> time and other things were taking 90%.
> 
> You need the kernel oprofile support turned on. In the
cases I looked
> at the bulk of the time was spent in the kernel. If the
top line of
> the kernel oprofile is copy_page_range fork is the
source of the
> problem.

Thanks for the tip.

> I never had trouble with the speed of the cvs2svn
python code. All of
> the performance problems I looked at were in other
parts of the
> process.

The python code is taking up more time now to handle the
dependency
graph processing.  I haven't done any exact measurements
lately, though.

> Unless you've changed things cvs2svn is still forking
off cvs to parse
> the input files. That's an o(n*n/2)*big K process
versus parsing them
> internally o(n). For millions of revisions this
difference is very
> obvious.

We *have* changed things, as I mentioned in another part of
this thread.
 That's what --use-internal-co is all about.

>> (NB: git-fast-import has the advantage of dumping
raw data to disk and
>> not having to worry about computing diffs etc.
right away, because one
>> can do a "git repack" later.  The
"git repack" is really part of the
>> import process too, but it does not contribute to
the repository
>> downtime because it can be run while the repository
is live.  An SVN
>> bulk importer wouldn't have that luxury.)
> 
> git-fast import is computing the diffs and the
repository it makes is
> a usable one it's just not packed as efficiently as
possible. The
> repository built from git-fastimport on MozillaCVS was
about 700MB. If
> it wasn't computing the diffs it would have been 20GB.
> 
> Running git-repack compressed the 700MB more
efficiently and turned it
> into 400MB. It's smaller because git-repack can select
from many
> versions to compute the smallest diff, git-repack
always computes the
> diff from the previous version.

OK, I had misunderstood this point.  Thanks for the
clarification.

Michael

------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribecvs2svn.tigris.org
For additional commands, e-mail: users-helpcvs2svn.tigris.org


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )