One of our Xen systems reliably stops working in an
interesting, but not
entirely easy-to-understand way.
First, a sketch of our general configuration:
We're running two mostly identical systems in a remote data
center. The
systems run 4.0_BETA on Xen 2. Hardware-wise, they're AMD64
(single core
running NetBSD/i386 because of Xen; they have re-driven
Realtek 8169B
ethernet on board. The card is driven by the dom0 but
doesn't have a
publicly reachable IP address there -- the first interface
responding to
outside traffic is a single-purpose dom0 that does
firewalling and
routing to the other domUs.
Because we don't have a private network at the colocation
center, the
router domUs also have an IPsec-based VPN amongst each
other. The domUs
run ucarp over that VPN so that we can fail over single
domUs if
something happens.
One of the domUs is our 'database engine', running pgpool as
the HA
layer atop of postgresql so we can keep the databases in
sync on both
machines easily. But as soon as we feed a larger database
dump onto
pgpool, the entire machine stops to respond to outside
network traffic.
The dom0 console logs a "re0: watchdog timeout",
pinging outside
machines from one of the domUs leads to a "no buffers
available" message
on the xennet interface. First tests indicate that the
problem is that
too much network traffic is generated at once -- feeding the
dump to
just one postgresql server is fine, as soon as pgpool starts
to
distribute the transactions onto both machines, the one
sending the feed
is dead again.
We've already increased NMBCLUSTERS to 20480 -- so I suppose
that should
suffice for a while. Alas, I have no idea on how to debug
this. What
buffers could we still increase? And -- how can we trace
down this
problem well enough to stop it from happening?
Cheers,
Konrad
|