On 8/3/07, Michael Haggerty <mhagger alum.mit.edu> wrote:
> Jon Smirl wrote:
> > On 8/3/07, Michael Haggerty <mhagger alum.mit.edu> wrote:
> >> Jon, I would like very much to hear how you
propose to get an 60-fold
> >> speed increase in cvs2svn. I've never heard
of any plausible way to
> >> accomplish anything even close to this.
> >>
> >> Please note that the user wants to convert to
Subversion, not git. But
> >> even converting to git, I don't think that
such speeds are possible
> >> without massive changes that would include
processing everything in RAM
> >> and switching large parts of cvs2svn from
Python to a compiled language.
> >
> > Make a bulk importer for SVN like git-fastimport.
I measured some SVN
> > imports and the bulk of the time was spent forking
off SVN. Before
> > git-fast import it would have taken git two weeks
to import Mozilla
> > CVS.
>
> I'm curious about your measurements. Were these
measurements using
> cvs2svn with the "-s" option (output directly
to SVN repository)? With
> this option, cvs2svn uses "svnadmin load",
which is only invoked once
> for all commits. I don't understand why "svnadmin
load" would be
> spawning SVN processes.
It's been six months (or more) since I worked with this. I
don't
remember the exact results. It's easy enough to reproduce.
Just build
a kernel with oprofile enabled and everything else is very
simple. I
do remember that the cvs2svn python code was taking about
10% of the
time and other things were taking 90%.
You need the kernel oprofile support turned on. In the cases
I looked
at the bulk of the time was spent in the kernel. If the top
line of
the kernel oprofile is copy_page_range fork is the source of
the
problem.
I never had trouble with the speed of the cvs2svn python
code. All of
the performance problems I looked at were in other parts of
the
process.
Another quick experiment. Alter cvs2svn to not output
anything and
measure it that way.
Unless you've changed things cvs2svn is still forking off
cvs to parse
the input files. That's an o(n*n/2)*big K process versus
parsing them
internally o(n). For millions of revisions this difference
is very
obvious.
>
> "svnadmin load" is conceptually very much
like git-fast-import. Its
> input is an svn dumpfile, which is analogous to
git-fast-import's import
> format and comparable in terms of verbosity and
parseability. (It can
> be fed svndiffs instead of revision fulltexts, but we
don't do that yet.
> I don't think it would save much time anyway, as AFAIK
Subversion
> recomputes the svndiffs on the "server
side".)
>
> To be sure, "svnadmin load" is one or two
abstraction layers further
> from the repository than git-fast-import, as it works
through the SVN
> libraries. But I think that most of its work is
munging the data into
> SVN's not-very-fast repository file format, which would
have to be done
> by a bulk importer, anyway.
>
> (NB: git-fast-import has the advantage of dumping raw
data to disk and
> not having to worry about computing diffs etc. right
away, because one
> can do a "git repack" later. The "git
repack" is really part of the
> import process too, but it does not contribute to the
repository
> downtime because it can be run while the repository is
live. An SVN
> bulk importer wouldn't have that luxury.)
git-fast import is computing the diffs and the repository it
makes is
a usable one it's just not packed as efficiently as
possible. The
repository built from git-fastimport on MozillaCVS was about
700MB. If
it wasn't computing the diffs it would have been 20GB.
Running git-repack compressed the 700MB more efficiently and
turned it
into 400MB. It's smaller because git-repack can select from
many
versions to compute the smallest diff, git-repack always
computes the
diff from the previous version.
>
> Michael
>
--
Jon Smirl
jonsmirl gmail.com
------------------------------------------------------------
---------
To unsubscribe, e-mail: users-unsubscribe cvs2svn.tigris.org
For additional commands, e-mail: users-help cvs2svn.tigris.org
|