List Info

Thread: LAM: Problem porting to 7.1.4...




LAM: Problem porting to 7.1.4...
user name
2007-12-05 11:08:12
Greetings!  We've used lam 6.x for years successfully, but
now have
problems running the same application recompiled against lam
7.1.4.

1) When using the lamd rpi, certain nodes report a bad rank
in
   MPI_Allgather: 


MPI_Recv: internal MPI error: Bad address (rank 3, comm 3)
Rank (12, MPI_COMM_WORLD): Call stack within LAM:
Rank (12, MPI_COMM_WORLD):  - MPI_Recv()
Rank (12, MPI_COMM_WORLD):  - MPI_Allgather()
Rank (12, MPI_COMM_WORLD):  - main()

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread -1222748480 (LWP 23432)]
0xb72b9dee in write () from /lib/tls/libc.so.6

The call in question is:

  MPI_Allgather(w,mnr,MPI_DOUBLE,q1,mnr,MPI_DOUBLE,ccomm);

ccomm is setup thus: (np=16)


  comm=MPI_COMM_WORLD;

  MPI_Comm_rank(comm,&id);


  MPI_Comm_size(comm,&np);
  if (np!=npn)
    error("np!=npnn");

  idr=id/ncb;
  idc=id%ncb;
  
  MPI_Comm_split(comm,idr,idc,&rcomm);
  MPI_Comm_split(comm,idc,idr,&ccomm);


I can confirm both sets of code are executed by node 15.


2) I had written by hand versions of allreduce and bcast
which no
   longer work (random message corruption as yet not
diagnosed
   further) 

static __inline__ int
qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype d,int
size,
	      void (*f)(void *,void *,int)) {

  int i,j,k,r,s;
  static MPI_Comm sc;
  static int si,sj,ss,sr;
  MPI_Request req;
  static void *b1,*b,*be;

  if (be-b1<size*nn)
    r_mem(b,size*nn);

  if (sc==c) {
    i=si;
    j=sj;
    r=sr;
    s=ss;
  } else {
    MPI_Comm_rank(c,&r);
    MPI_Comm_size(c,&s);
    for (i=0,j=1;j+j<=s;i++,j+=j);
    si=i;
    sj=j;
    sr=r;
    ss=s;
    sc=c;
  }

  if (r>=sj) {
    
    MPI_Isend(a,nn,d,r-(s-sj),s,c,&req);
    MPI_Wait(&req,MPI_STATUS_IGNORE);
    MPI_Recv(a,nn,d,r-(s-sj),s,c,MPI_STATUS_IGNORE);

  } else {

    if (r>=sj-(s-sj)) {
      MPI_Recv(b1,nn,d,r+(s-sj),s,c,MPI_STATUS_IGNORE);
      (*f)(a,b1,nn);
    }

    for (i--,j=j/2;i>=0;i--,j=j/2) {
      
      k=r/(j+j);
      k+=k;
      k=r/j-k;
      k=k ? r-j : r+j;
      
      MPI_Isend(a,nn,d,k,i,c,&req);
      MPI_Recv(b1,nn,d,k,i,c,MPI_STATUS_IGNORE);
      MPI_Wait(&req,MPI_STATUS_IGNORE);
      (*f)(a,b1,nn);
      
    }

    if (r>=sj-(s-sj)) {
      MPI_Isend(a,nn,d,r+(s-sj),s,c,&req);
      MPI_Wait(&req,MPI_STATUS_IGNORE);
    }

  }

  return 0;

}

static __inline__ int
qdp_bcast_lin(double *a,int nn,MPI_Comm c,int r,int s,int w)
{

  int i;
  static MPI_Request *rq1,*rq,*rqe;

  if (s-1>rqe-rq1)
    r_mem(rq,s-1);

  if (r!=w)
    MPI_Recv(a,nn,MPI_DOUBLE,w,0,c,MPI_STATUS_IGNORE);
  else
    for (i=1,rq=rq1;i<s;i++)
      MPI_Send_init(a,nn,MPI_DOUBLE,(w+i)%s,0,c,rq++);

  if (rq>rq1) {
    MPI_Startall(rq-rq1,rq1);
    for (;--rq>=rq1;)
      MPI_Request_free(rq);
    rq++;
  }


  return 0;

}

Has anything changed regarding the blocking/non-blocking
status of any
of these calls?

Finally, my code is in several libraries, two of which
independently
setup static communicators for parallelization -- is there
now some
internal interference for such a strategy within the lam
library?

Please let me know if any further details are needed. 
lamtests
appears to run fine on this installation.


Thanks!


-- 
Camm Maguire			     			cammenhanced.com
============================================================
==============
"The earth is but one country, and mankind its
citizens."  --  Baha'u'llah
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Problem porting to 7.1.4...
user name
2007-12-06 08:40:09
On Dec 5, 2007, at 12:08 PM, Camm Maguire wrote:

> Greetings!  We've used lam 6.x for years successfully,
but now have
> problems running the same application recompiled
against lam 7.1.4.
>
> 1) When using the lamd rpi, certain nodes report a bad
rank in
>   MPI_Allgather:
>
> MPI_Recv: internal MPI error: Bad address (rank 3, comm
3)
> Rank (12, MPI_COMM_WORLD): Call stack within LAM:
> Rank (12, MPI_COMM_WORLD):  - MPI_Recv()
> Rank (12, MPI_COMM_WORLD):  - MPI_Allgather()
> Rank (12, MPI_COMM_WORLD):  - main()

Does it work with the other RPI's?  (unlikely, but I thought
I'd ask)

>
> 2) I had written by hand versions of allreduce and
bcast which no
>   longer work (random message corruption as yet not
diagnosed
>   further)
>
> static __inline__ int
> qdp_allreduce(void *a,int nn,MPI_Comm c,MPI_Datatype
d,int size,
> 	      void (*f)(void *,void *,int)) {
>
>  int i,j,k,r,s;

Woof; that's a little too much for me to analyze without a
Cisco  
support contract, and Open MPI.  

> Has anything changed regarding the
blocking/non-blocking status of any
> of these calls?

Not really.  I think the core algorithms for allgather have
not  
changed in LAM for a long, long time.  But I'm afraid that I
don't  
remember the specifics...

There was a big change in the 7 series when we moved to the
component  
architecture stuff.  So there was a bit of refactoring of
the  
collective algorithm code, but the core algorithms should
still be the  
same.

Have you tried configuring LAM for memory debugging and
running your  
code through a memory-checking debugger?

Have you tried Open MPI?

> Finally, my code is in several libraries, two of which
independently
> setup static communicators for parallelization -- is
there now some
> internal interference for such a strategy within the
lam library?

I'm not quite sure what you're asking -- MPI gives you
MPI_COMM_WORLD  
by default.  If you need to subset beyond that, then you can
use calls  
like MPI_COMM_SPLIT (like you showed, above) and friends. 
Are you  
asking something beyond that?

-- 
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Problem porting to 7.1.4...
user name
2008-01-02 10:49:23
Greetings -- so sorry for the delay here.

Jeff Squyres <jsquyrescisco.com> writes:

> On Dec 5, 2007, at 12:08 PM, Camm Maguire wrote:
> 
> > Greetings!  We've used lam 6.x for years
successfully, but now have
> > problems running the same application recompiled
against lam 7.1.4.
> >
> > 1) When using the lamd rpi, certain nodes report a
bad rank in
> >   MPI_Allgather:
> >
> > MPI_Recv: internal MPI error: Bad address (rank 3,
comm 3)
> > Rank (12, MPI_COMM_WORLD): Call stack within LAM:
> > Rank (12, MPI_COMM_WORLD):  - MPI_Recv()
> > Rank (12, MPI_COMM_WORLD):  - MPI_Allgather()
> > Rank (12, MPI_COMM_WORLD):  - main()
> 
> Does it work with the other RPI's?  (unlikely, but I
thought I'd ask)
> 

No, but the point of failure is often different.

> >
> > 2) I had written by hand versions of allreduce and
bcast which no
> >   longer work (random message corruption as yet
not diagnosed
> >   further)
> >
> > static __inline__ int
> > qdp_allreduce(void *a,int nn,MPI_Comm
c,MPI_Datatype d,int size,
> > 	      void (*f)(void *,void *,int)) {
> >
> >  int i,j,k,r,s;
> 
> Woof; that's a little too much for me to analyze
without a Cisco  
> support contract, and Open MPI.  
> 

OK, no need, as vanilla allreduce triggers the problem.

> > Has anything changed regarding the
blocking/non-blocking status of any
> > of these calls?
> 
> Not really.  I think the core algorithms for allgather
have not  
> changed in LAM for a long, long time.  But I'm afraid
that I don't  
> remember the specifics...
> 
> There was a big change in the 7 series when we moved to
the component  
> architecture stuff.  So there was a bit of refactoring
of the  
> collective algorithm code, but the core algorithms
should still be the  
> same.
> 

OK, I confirm that the same code compiled against lam 6.5.9
runs
flawlessly on the same cluster.  So either there is an
error
introduced in subsequent lam, or 6.5.9 has a bug which masks
a bug in
my code, which seems less likely.  How can I chase this
down?

> Have you tried configuring LAM for memory debugging and
running your  
> code through a memory-checking debugger?
> 

Not yet, but this looks promising.  I take it using mpirun
to launch
an xterm on each node, running gdb on the code with
LD_PRELOAD set to
libefence.so.0.0 is the method of choice?  If not, any more
details
here please?

> Have you tried Open MPI?
> 

Alas, no.  As you know, I maintain lam for Debian, and have
not found
the time to package openmpi.  Someone else now has, and I am
unsure
whether the source compatibility design between the lam and
mpich
packages has been maintained or not.  Simply have not had
time.

> > Finally, my code is in several libraries, two of
which independently
> > setup static communicators for parallelization --
is there now some
> > internal interference for such a strategy within
the lam library?
> 
> I'm not quite sure what you're asking -- MPI gives you
MPI_COMM_WORLD  
> by default.  If you need to subset beyond that, then
you can use calls  
> like MPI_COMM_SPLIT (like you showed, above) and
friends.  Are you  
> asking something beyond that?
> 

Not really, just confirming that one should be able to
divide the
cluster in different ways in different subroutines.  In any
case, this
is not the issue, as removing same does not remove the
error.

Suggestions most appreciated.

> -- 
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
> 
> 
> 

-- 
Camm Maguire			     			cammenhanced.com
============================================================
==============
"The earth is but one country, and mankind its
citizens."  --  Baha'u'llah
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )