List Info

Thread: LAM: Trouble Specifying SSI Collectives




LAM: Trouble Specifying SSI Collectives
user name
2006-05-11 19:39:49
Hi,

We have an application that is exhibiting very poor
performance in a
section of code dominated by calls to various MPI
collectives.  The
performance we're seeing with LAM 7.1.2 is much worse than
with MPICH 1.2.6.

In an attempt to figure out where the trouble is, I decided
to compare 
the performance of LAM's different collective modules by
explicitly 
specifying "-ssi coll xxx" on my mpirun command.
 This works fine for 
lam_basic, but with both smp and shmem, I get complaints
from LAM:

"No SSI coll modules said that they were available to
run.  This should
not happen."

The smp test case is running with eight processes spread
across four
dual-processor nodes (two per node); the shmem test uses
four processes 
on a single quad-processor node.

I double-checked the log files from my LAM build, and the
smp and shmem
modules both configured and compiled cleanly. 
lamtests-7.1.2 runs
successfully as well.

I suspect there's something simple I've overlooked, and
I'm hoping
someone on the list can enlighten me.  Here's the mpirun
command I use
with "smp":

/usr/local/v9a/generic/lam-7.1.2/bin/mpirun -ssi boot rsh
-ssi rpi usysv
-ssi coll smp -ssi coll_base_associative 1 -ssi ssi_verbose
stdout
-nsigs -pty -w -wd ~tom/tests/oceanM -sa -v -nger
/tmp/pbslam.app_schema.28722

"shmem" is identical except I substitute
"shmem" for "smp".

Here are the options we used to configure LAM:

./configure \
    --prefix=/usr/local/v9a/generic/lam-7.1.2 \
    --with-boot=rsh \
    --with-rpi=usysv \
    --with-rsh=/bin/rsh \
    --with-rpi-gm=/usr/local/gm \
    --with-rpi-gm-lib=/usr/local/gm/lib/sparcv9 \
    --with-fd-size=4096

We're running this under Solaris.  LAM is built using
Sun's Studio 11
compiler suite.  We see the same problem under Solaris 9 on
UltraSPARC
and Solaris 10 on x86/x64 (AMD64).  Off-node communication
in both cases
is TCP/IP over Gigabit or Fast Ethernet with usysv.

Any thoughts on what's going wrong and/or how to fix it
would be greatly
appreciated.

-Tom

-- 
Tom Crockett

College of William and Mary               email:  tomcompsci.wm.edu
Computational Science Cluster             phone:  (757)
221-2762
Savage House                              fax:    (757)
221-2023
P.O. Box 8795
Williamsburg, VA  23187-8795



_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Trouble Specifying SSI Collectives
user name
2006-05-30 16:59:18
Tom Crockett wrote:
> We have an application that is exhibiting very poor
performance in a
> section of code dominated by calls to various MPI
collectives.

After a few weeks of experimentation, we are now convinced
this is not a 
LAM issue, but rather a problem with the bge Gigabit
Ethernet driver in 
Solaris 10 x86.  If we run the same executable on the same
nodes using 
IP over InfiniBand instead of IP over Gigabit Ethernet, the
performance 
discrepancy disappears.  So LAM 7's collectives perform
just fine 

-Tom

-- 
Tom Crockett

College of William and Mary               email:  tomcompsci.wm.edu
Computational Science Cluster             phone:  (757)
221-2762
Savage House                              fax:    (757)
221-2023
P.O. Box 8795
Williamsburg, VA  23187-8795


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )