List Info

Thread: Hard to track problem




Hard to track problem
user name
2006-10-25 11:11:44
On Wed, Oct 25, 2006 at 10:53:10AM +0200, Konrad Neuwirth
wrote:
> One of our Xen systems reliably stops working in an
interesting, but not
> entirely easy-to-understand way.
> 
> First, a sketch of our general configuration:
> 
> We're running two mostly identical systems in a remote
data center.  The
> systems run 4.0_BETA on Xen 2. Hardware-wise, they're
AMD64 (single core
> running NetBSD/i386 because of Xen; they have re-driven
Realtek 8169B
> ethernet on board.  The card is driven by the dom0 but
doesn't have a
> publicly reachable IP address there -- the first
interface responding to
> outside traffic is a single-purpose dom0 that does
firewalling and
> routing to the other domUs.
> 
> Because we don't have a private network at the
colocation center, the
> router domUs also have an IPsec-based VPN amongst each
other.  The domUs
> run ucarp over that VPN so that we can fail over single
domUs if
> something happens.
> 
> One of the domUs is our 'database engine', running
pgpool as the HA
> layer atop of postgresql so we can keep the databases
in sync on both
> machines easily.  But as soon as we feed a larger
database dump onto
> pgpool, the entire machine stops to respond to outside
network traffic.
> The dom0 console logs a "re0: watchdog
timeout", pinging outside
> machines from one of the domUs leads to a "no
buffers available" message
> on the xennet interface. First tests indicate that the
problem is that
> too much network traffic is generated at once --
feeding the dump to
> just one postgresql server is fine, as soon as pgpool
starts to
> distribute the transactions onto both machines, the one
sending the feed
> is dead again.
> 
> We've already increased NMBCLUSTERS to 20480 -- so I
suppose that should
> suffice for a while.  Alas, I have no idea on how to
debug this.  What
> buffers could we still increase? And -- how can we
trace down this
> problem well enough to stop it from happening?

My guess it that it's a bug in the re(4) driver. Could you
try using
the rtk driver instead (if it supports your card), or using
-current ?
There have been some fixes to re(4) in current which may
solve your
problem.

-- 
Manuel Bouyer, LIP6, Universite Paris VI.          
Manuel.Bouyerlip6.fr
     NetBSD: 26 ans d'experience feront toujours la
difference
--
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )