List Info

Thread: LAM: Send/Recv delay due to network timeouts + getitimer interrupts




LAM: Send/Recv delay due to network timeouts + getitimer interrupts
country flaguser name
Canada
2007-08-11 11:37:01
We have had a 4 machine MPI application running under LAM
for about a
year.  Each machine acquires and processes real-time data. 
Information
about the acquired data is exchanged with the other 3
machines.  The
system uses matched MPI Ssend/Recv calls over jumbo frame
(MTU=9000)
1gigE Ethernet.  Each machine is connected to a 1 gigE
jumbo-frame
configured switch. 

About 99.9999% of the time this works. 

However the TCP transfers, underneath the MPI software
layer, sometimes
time out.  The kernel generates retries and eventually the
TCP packet is
transferred, the Ssend/Recv calls complete and we have our
data.  This
creates an excessive delay for our application, the
real-time
acquisition falls apart and we have to restart the system. 

We can tolerate an exchange data drop out but cannot
tolerate excessive
timeouts, say greater than 20 msec. 

We have tried: 

1.  Send and Ssend calls - made no difference 

2. Using standard Ethernet MTU=1500 in place of jumbo-frame
MTU=9000
ethernet - jumbo-frames is about 10 % faster and does not
affect the
time-out issue. 

3. Kernels from 2.6.17 through to 2.6.21 - made no
difference. 

4.  Recompiling the kernel with TCP timeout reduced from 250
to 50 msec
- helps but does not solve the problem. 

5. Changing 1 gigE switches - same problems but frequency of
problem
varies with switch. 

6. Interrupting the Ssend/Recv calls with a SIGALRM signal
generated
from a "getitimer" system call.  MPI does not
return an error code (as
expected).  It hangs when interrupted. 

7. Enabling system call interrupts with the
"sysinterrupt" system call
and using the "getitimer" SIGALRM mechanism - no
change - MPI still
hangs. 

8. Tried GRID MPI. GRID MPI attempts to solve the bursty
packet problem
by pacing the packets at a fixed spacing between packets.
See
http://www.gridmpi.org/
and specifically the
http://www.gridmpi.org/publications/cluster05-matsuda.p
df document.
GRIDMPI, while plausibly explained our dilemma, did not cure
it. We went
back to LAM-MPI. 

9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3  -  no change
(We have not
tried 2.1.4)

Any thought or suggestions?

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Send/Recv delay due to network timeouts + getitimer interrupts
country flaguser name
United States
2007-08-13 20:51:46
On Aug 11, 2007, at 10:37 AM, Greg Blair wrote:

> We have had a 4 machine MPI application running under
LAM for about a
> year.  Each machine acquires and processes real-time
data.   
> Information
> about the acquired data is exchanged with the other 3
machines.  The
> system uses matched MPI Ssend/Recv calls over jumbo
frame (MTU=9000)
> 1gigE Ethernet.  Each machine is connected to a 1 gigE
jumbo-frame
> configured switch.
>
> About 99.9999% of the time this works.
>
> However the TCP transfers, underneath the MPI software
layer,  
> sometimes
> time out.  The kernel generates retries and eventually
the TCP  
> packet is
> transferred, the Ssend/Recv calls complete and we have
our data.  This
> creates an excessive delay for our application, the
real-time
> acquisition falls apart and we have to restart the
system.

Unfortunately, this is out of the scope of what LAM/MPI was
designed  
for (real time message delivery, that is), so I can't offer
too much  
advice.  I don't have enough in-depth knowledge of the TCP
stack to  
have an opinion of how to keep its retransmission time down
to the  
point you need.  Perhaps there's someone on this list more
familiar  
with TCP than I am...

Good luck,

Brian

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Send/Recv delay due to network timeouts + getitimer interrupts
country flaguser name
Germany
2007-08-17 11:34:32
[ Sorry for the late reply... ]

On Sat, 11 Aug 2007, Greg Blair wrote:

> We can tolerate an exchange data drop out but cannot
tolerate excessive
> timeouts, say greater than 20 msec.

Then I'd say that you have chosen poorly MPI over TCP/IP for
data 
exchange between processes. Something like UDP seems a lot
more 
apropriate, possibly with some control mechanisms like RDP
(Reliable 
Datagram Protocol) or even better RTP (Real-time Transport
Protocol) 
which is often used for video/audio transmissions with the
same 
characteristics as your transmission: dropping is bad, delay
is worse.

> 4.  Recompiling the kernel with TCP timeout reduced
from 250 to 50 msec
> - helps but does not solve the problem.

This just allows the kernel to notice that a packet might be
missing 
and retry transmission - it only eases the symptoms, but
does not cure 
tha cause. You can check this by looking for retransmission
count 
amoung the TCP statistics (f.e. 'netstat --statistics
--tcp')

> 5. Changing 1 gigE switches - same problems but
frequency of problem
> varies with switch.

This seems to indicate that the hardware side is reponsible
for 
loosing packets. It doesn't necessarily mean that the switch
is bad, 
could also be a problem of cabling, network cards and
especially link 
negotiation between card and switch port.

> 9. Switched from LAM 2.1.1 to 2.1.2 to 2.1.3  -  no
change (We have not
> tried 2.1.4)

LAM is currently at 7.1.x, is this a typo on your side ???

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches
Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg,
GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.CostescuIWR.Uni-Heidelberg.De
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )