[ Sorry for the late reply... ]
On Sat, 11 Aug 2007, Greg Blair wrote:
> We can tolerate an exchange data drop out but cannot
tolerate excessive
> timeouts, say greater than 20 msec.
Then I'd say that you have chosen poorly MPI over TCP/IP for
data
exchange between processes. Something like UDP seems a lot
more
apropriate, possibly with some control mechanisms like RDP
(Reliable
Datagram Protocol) or even better RTP (Real-time Transport
Protocol)
which is often used for video/audio transmissions with the
same
characteristics as your transmission: dropping is bad, delay
is worse.
> 4. Recompiling the kernel with TCP timeout reduced
from 250 to 50 msec
> - helps but does not solve the problem.
This just allows the kernel to notice that a packet might be
missing
and retry transmission - it only eases the symptoms, but
does not cure
tha cause. You can check this by looking for retransmission
count
amoung the TCP statistics (f.e. 'netstat --statistics
--tcp')
> 5. Changing 1 gigE switches - same problems but
frequency of problem
> varies with switch.
This seems to indicate that the hardware side is reponsible
for
loosing packets. It doesn't necessarily mean that the switch
is bad,
could also be a problem of cabling, network cards and
especially link
negotiation between card and switch port.
> 9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3 - no
change (We have not
> tried 2.1.4)
LAM is currently at 7.1.x, is this a typo on your side ???
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches
Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg,
GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu IWR.Uni-Heidelberg.De
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|