Hi..
Alex:
Undoubtedly, XenSource is the cause of the OProfile problem.
We looked
into a patch for that, but there is one only for version 3
and we are
using version 4. (and, yes, we are using the
"official" XenSource one).
As for the disks, each of the 6 disks is defined as an lvm,
where one is
allocated with 80Gb for system/log and the other 5 and
allocated fully
(465GB of the 465.5GB) to the databases. Each two databases
sit on the
same disk. No other instance (since none exist) or Dom0 is
using the 5
disks of the databases.
Hope that clarifies the disk mapping.
James:
While anything we put on the server complicates things a
bit, xen is not
really the issue here I believe. If the problem was Xen
related, why
would a scheduling problem effect "no recip" in
such a consistent way,
even after compacting the databases and moving them around?
If Xapian
used DMA, undocumented interrupts or something else out of
the ordinary,
I understand why it would be something to look into first,
but what
makes you think that Xen in the mix can explain the
variation in
estimates, the strange performance issues with specific
queries only,
and other strange things we see?
We will certainly try to profile things, even test without
Xen if we
can't profile on it. Again, being the only VM instance
running on that
machine, there is little scheduling to do and no competition
over IO and
other resources. But even if there were, why would it be so
constant on
"no recip" search? Don't make too much sense
unless we are missing
something.
We indeed tested things well over 3 times. This is why I
picked "no
recip" as a search. It is constantly performing badly
even when searched
second of third time right after the first (see debug
output).
Below are stats from 100 runs:
Chris:
We will try to test it without Xen as well later on. Keep in
mind that
to do so we will have to move aside 10 databases of 50GB,
reinstall the
machine and remove the database into place. We would do it
first thing
if we believed its Xen's issue, although if we can't profile
things we
might do this anyway (or test it on a different machine).
Olly: Sorry, we removed the old Database10 after compressing
it. Since
then we didn't see the seg fault. We will keep a close eye
and contact
you as soon as we see such error again.
Best regards,
Ron.
Alexandre Gauthier wrote:
> Chris Good a écrit :
>> Ron Kass wrote:
>>
>>> Not sure what you mean by "other VMs could
well be confusing your
>>> results"
>>> We use XenServer on this machine, but we have
only one instance
>>> (DomU), and only this instance is running
everything locally. So
>>> there are no other VMs to confuse things, and
even if there were,
>>> they have nothing to do with the VM we run the
test on or with the
>>> test itself.
>>> (Can you clarify what you mean?)
>>>
>>
>> If you have multiple VMs sharing the same hardware
then activity on one
>> will obviously affect the performance on other VMs.
Since you're
>> running
>> a lone DomU other DomUs aren't going to be
competing for resources
>> but it's possible that something in Dom0 is getting
swapped in and
>> running.
>>
>> How are you accessing your drives, is DomU
accessing the raw devices
>> or is
>> it mapped via virtual files from Dom0?
>>
>> Is it possible to run these tests either directly
from Dom0 or even
>> better
>> with a non-xen kernel?
>>
>> Given your current configuration of a single VM xen
isn't adding
>> anything so removing it would eliminate any
side-effects of it. I
>> also suspect
>> that it would cure your oprofile issue.
>>
>> Chris
>>
>>
> Sorry to intrude, but if I may offer some insight, the
Dom0 instance
> in a Xen set-up is just as paravirtualized as a DomU --
it just has
> control access to memory inside DomUs, and offers the
drivers back-end
> interfaces. The Dom0 and DomUs both run on top of the
Xen kernel.
>
> Also, if he is running a commercial Xen from XenSource,
he won't have
> access to the Dom0, which is a custom frankenstein mix
of SuSE and
> RHEL witth no other puprose but to control the DomUs, a
bit like ESX.
>
> The question of the DomU's disk mapping is still valid,
and I'd be
> curious to hear the answer. I also think Xen is
responsible for the
> oprofile troubles, I get that on a Debian DomU as
well.
>
> I hope this vaguely helps...
>
> Alex
>
>
James Aylett wrote:
> On Wed, Oct 24, 2007 at 04:04:22PM +0200, Ron Kass
wrote:
>
>
>> Although we should never rule out something
completely without checking,
>> I believe quite strongly that the issues we are
seeing are not coming
>> from Xen, as per this instance it is a regular
dedicated Linux (centos
>> 5) machine and the resources are fully dedicated to
it.
>>
>
> It seems to me that there are two distinct problems.
You have some
> queries that are underperforming, which with some
profiling will
> expose either something unusual about your database or
code, or a
> bottleneck or optimisation problem in Xapian.
>
> The other is the variation. I agree with Chris that
adding Xen into
> the mix is complicating matters considerably. Things
like IO
> scheduling, for instance, become harder in even the
best
> virtualisation systems. It's bad enough that a single
instance of an
> OS can suddenly start doing things you don't expect,
even with no
> other significant userspace clients :-/
>
> Out of interest, are your figures averages of multiple
runs? If not,
> I'd be interested in seeing 1st, 2nd and 3rd query
times (broken down
> as Olly suggests), but with mean & sd over say 100
runs.
>
> (Apologies if you have done that - I've been trying to
follow this
> thread closely, but an explosion of posts has combined
with a busy
> period at my end
>
> J
>
>
Chris Good wrote:
> Ron Kass wrote:
>
>> I believe quite strongly that the issues we are
seeing are not coming
>> from Xen, as per this instance it is a regular
dedicated Linux (centos
>> 5) machine and the resources are fully dedicated to
it.
>>
>
> I'd still encourage you to give it a go if only to rule
it out and let
> you run oprofile. Running inside Xen certainly
shouldn't affect your
> match sets but it the diskwriter process kicking in
could fully explain
> some of the timing variances that you've seen when
re-running queries.
>
Olly Betts wrote:
>> Anyway, we have actually used xapian-compress on
the databases to see if
>> it helps. It appears to have rid of the
segmentation fault error on
>> database 10, but the slowness and the variations in
estimates still exist.
>>
>
> A seg fault is clearly a bug somewhere, and I'd really
like to know
> where. Do you still have the un-compacted database, or
if not can you
> recreate it? If so, please rerun the test on it under
gdb as I
> requested in my previous mail!
>
> Cheers,
> Olly
_______________________________________________
Xapian-discuss mailing list
Xapian-discuss lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
a>
|