List Info

Thread: Hard to track problem




Hard to track problem
user name
2006-10-25 08:53:10
One of our Xen systems reliably stops working in an
interesting, but not
entirely easy-to-understand way.

First, a sketch of our general configuration:

We're running two mostly identical systems in a remote data
center.  The
systems run 4.0_BETA on Xen 2. Hardware-wise, they're AMD64
(single core
running NetBSD/i386 because of Xen; they have re-driven
Realtek 8169B
ethernet on board.  The card is driven by the dom0 but
doesn't have a
publicly reachable IP address there -- the first interface
responding to
outside traffic is a single-purpose dom0 that does
firewalling and
routing to the other domUs.

Because we don't have a private network at the colocation
center, the
router domUs also have an IPsec-based VPN amongst each
other.  The domUs
run ucarp over that VPN so that we can fail over single
domUs if
something happens.

One of the domUs is our 'database engine', running pgpool as
the HA
layer atop of postgresql so we can keep the databases in
sync on both
machines easily.  But as soon as we feed a larger database
dump onto
pgpool, the entire machine stops to respond to outside
network traffic.
The dom0 console logs a "re0: watchdog timeout",
pinging outside
machines from one of the domUs leads to a "no buffers
available" message
on the xennet interface. First tests indicate that the
problem is that
too much network traffic is generated at once -- feeding the
dump to
just one postgresql server is fine, as soon as pgpool starts
to
distribute the transactions onto both machines, the one
sending the feed
is dead again.

We've already increased NMBCLUSTERS to 20480 -- so I suppose
that should
suffice for a while.  Alas, I have no idea on how to debug
this.  What
buffers could we still increase? And -- how can we trace
down this
problem well enough to stop it from happening?

Cheers,
 Konrad

Hard to track problem
user name
2006-10-25 11:11:44
On Wed, Oct 25, 2006 at 10:53:10AM +0200, Konrad Neuwirth
wrote:
> One of our Xen systems reliably stops working in an
interesting, but not
> entirely easy-to-understand way.
> 
> First, a sketch of our general configuration:
> 
> We're running two mostly identical systems in a remote
data center.  The
> systems run 4.0_BETA on Xen 2. Hardware-wise, they're
AMD64 (single core
> running NetBSD/i386 because of Xen; they have re-driven
Realtek 8169B
> ethernet on board.  The card is driven by the dom0 but
doesn't have a
> publicly reachable IP address there -- the first
interface responding to
> outside traffic is a single-purpose dom0 that does
firewalling and
> routing to the other domUs.
> 
> Because we don't have a private network at the
colocation center, the
> router domUs also have an IPsec-based VPN amongst each
other.  The domUs
> run ucarp over that VPN so that we can fail over single
domUs if
> something happens.
> 
> One of the domUs is our 'database engine', running
pgpool as the HA
> layer atop of postgresql so we can keep the databases
in sync on both
> machines easily.  But as soon as we feed a larger
database dump onto
> pgpool, the entire machine stops to respond to outside
network traffic.
> The dom0 console logs a "re0: watchdog
timeout", pinging outside
> machines from one of the domUs leads to a "no
buffers available" message
> on the xennet interface. First tests indicate that the
problem is that
> too much network traffic is generated at once --
feeding the dump to
> just one postgresql server is fine, as soon as pgpool
starts to
> distribute the transactions onto both machines, the one
sending the feed
> is dead again.
> 
> We've already increased NMBCLUSTERS to 20480 -- so I
suppose that should
> suffice for a while.  Alas, I have no idea on how to
debug this.  What
> buffers could we still increase? And -- how can we
trace down this
> problem well enough to stop it from happening?

My guess it that it's a bug in the re(4) driver. Could you
try using
the rtk driver instead (if it supports your card), or using
-current ?
There have been some fixes to re(4) in current which may
solve your
problem.

-- 
Manuel Bouyer, LIP6, Universite Paris VI.          
Manuel.Bouyerlip6.fr
     NetBSD: 26 ans d'experience feront toujours la
difference
--
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )