Jon Smirl wrote:
> On 8/3/07, Michael Haggerty <mhagger alum.mit.edu> wrote:
>> Jon Smirl wrote:
>>> On 8/3/07, Michael Haggerty <mhagger alum.mit.edu> wrote:
>>>> Jon, I would like very much to hear how you
propose to get an 60-fold
>>>> speed increase in cvs2svn. I've never
heard of any plausible way to
>>>> accomplish anything even close to this.
>>>>
>>>> Please note that the user wants to convert
to Subversion, not git. But
>>>> even converting to git, I don't think that
such speeds are possible
>>>> without massive changes that would include
processing everything in RAM
>>>> and switching large parts of cvs2svn from
Python to a compiled language.
>>> Make a bulk importer for SVN like
git-fastimport. I measured some SVN
>>> imports and the bulk of the time was spent
forking off SVN. Before
>>> git-fast import it would have taken git two
weeks to import Mozilla
>>> CVS.
>> I'm curious about your measurements. Were these
measurements using
>> cvs2svn with the "-s" option (output
directly to SVN repository)? With
>> this option, cvs2svn uses "svnadmin
load", which is only invoked once
>> for all commits. I don't understand why
"svnadmin load" would be
>> spawning SVN processes.
>
> It's been six months (or more) since I worked with
this. I don't
> remember the exact results. It's easy enough to
reproduce. Just build
> a kernel with oprofile enabled and everything else is
very simple. I
> do remember that the cvs2svn python code was taking
about 10% of the
> time and other things were taking 90%.
>
> You need the kernel oprofile support turned on. In the
cases I looked
> at the bulk of the time was spent in the kernel. If the
top line of
> the kernel oprofile is copy_page_range fork is the
source of the
> problem.
Thanks for the tip.
> I never had trouble with the speed of the cvs2svn
python code. All of
> the performance problems I looked at were in other
parts of the
> process.
The python code is taking up more time now to handle the
dependency
graph processing. I haven't done any exact measurements
lately, though.
> Unless you've changed things cvs2svn is still forking
off cvs to parse
> the input files. That's an o(n*n/2)*big K process
versus parsing them
> internally o(n). For millions of revisions this
difference is very
> obvious.
We *have* changed things, as I mentioned in another part of
this thread.
That's what --use-internal-co is all about.
>> (NB: git-fast-import has the advantage of dumping
raw data to disk and
>> not having to worry about computing diffs etc.
right away, because one
>> can do a "git repack" later. The
"git repack" is really part of the
>> import process too, but it does not contribute to
the repository
>> downtime because it can be run while the repository
is live. An SVN
>> bulk importer wouldn't have that luxury.)
>
> git-fast import is computing the diffs and the
repository it makes is
> a usable one it's just not packed as efficiently as
possible. The
> repository built from git-fastimport on MozillaCVS was
about 700MB. If
> it wasn't computing the diffs it would have been 20GB.
>
> Running git-repack compressed the 700MB more
efficiently and turned it
> into 400MB. It's smaller because git-repack can select
from many
> versions to compute the smallest diff, git-repack
always computes the
> diff from the previous version.
OK, I had misunderstood this point. Thanks for the
clarification.
Michael
------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribe cvs2svn.tigris.org
For additional commands, e-mail: users-help cvs2svn.tigris.org
|