List Info

Thread: LAM: Patch for lamhalt lamd segfault if tkill not found




LAM: Patch for lamhalt lamd segfault if tkill not found
user name
2006-10-11 19:21:30
Hi.  I sent this message to the lam-devel list a few days
ago, but that 
list appears dead.  I'm assuming the developers have moved
on to 
OpenMPI.  So, a repost here.

In LAM 7.1.2, I found a segfault in lamd when
"lamhalt" is used to
tear down a LAM network.

It happens if the "tkill" executable is not found.

It's in the appropriately named function diediedie() in
otb/sys/haltd/haltd.c

I traced it out, and what's going on is this:

It's building a list of locations to search for
"tkill", and passes that
to sfh_path_findv().

The result is the tkillpath string.  It is not checked
before passing it
along into sfh_argv_add() later, when it's building up a
command line
for tkill to execute with.

Problem is, sfh_argv_add() can't accept a NULL string, as it
attempts to
do strlen() on it, and segfaults there.

The underlying problem is that $PATH is not being searched
correctly.

The sfh_path_findv() function will expand environment
variables, so
$PATH gets expanded as /bin:/usr/bin:/whatever = not the
correct
behaviour for $PATH.  We need to further break up $PATH, and
iterate
through each of its components.

I have a small patch that does this:

1) Use a different function, sfh_path_env_find(), which
*does* correctly
break up $PATH, if tkillpath isn't found earlier.

2) If this also fails to find tkillpath, then punt, by doing
exit(1).
There's nothing more that can be done anyway, as the
execution of tkill
is guaranteed to fail if it can't be found.  This exit call
is what 
would be done anyway if tkill's fork/exec fails.

This patch works for me, no more segfault.

Josh Lehan
Scyld

diff -urN OLD/lam-7.1.2/otb/sys/haltd/haltd.c
NEW/lam-7.1.2/otb/sys/haltd/haltd.c
--- OLD/lam-7.1.2/otb/sys/haltd/haltd.c	2006-02-23
15:26:55.000000000 -0800
+++ NEW/lam-7.1.2/otb/sys/haltd/haltd.c	2006-10-06
21:05:42.000000000 -0700
 -214,8
+217,27 
   sfh_argv_add(&pathc, &pathv,
"$LAMHOME/bin");
   sfh_argv_add(&pathc, &pathv, LAM_BINDIR);
 
-  tkillpath = sfh_path_findv(fname, pathv, R_OK, environ);
+  tkillpath = sfh_path_findv(fname, pathv, X_OK, environ);
   sfh_argv_free(pathv);
+  
+  if (NULL == tkillpath)
+  {
+    tkillpath = sfh_path_env_find(fname, X_OK);
+
+    if (NULL == tkillpath)
+    {
+    	 exit(1);
+	  }
+  }
+  
   sfh_argv_add(&argc, &argv, tkillpath);
 
   if (ao_taken(lam_daemon_optd, "d")) {

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Patch for lamhalt lamd segfault if tkill not found
user name
2006-10-13 21:01:24
On Oct 11, 2006, at 12:21 PM, Josh Lehan wrote:

> Hi.  I sent this message to the lam-devel list a few
days ago, but  
> that list appears dead.  I'm assuming the developers
have moved on  
> to OpenMPI.

It's not dead dead dead , but only
viewed with quite low frequency  
because most of us are spending 99% of our time on Open MPI.
 

> In LAM 7.1.2, I found a segfault in lamd when
"lamhalt" is used to
> tear down a LAM network.
>
> It happens if the "tkill" executable is not
found.

How exactly does this happen, actually?  The code as it
stands  
searches $LAMHOME/bin and the compiled-in default $bindir;
is tkill  
not found there?  Do you not have the tkill binary
distributed to the  
back-end bproc nodes?

> It's in the appropriately named function diediedie() in
> otb/sys/haltd/haltd.c
>
> I traced it out, and what's going on is this:
>
> It's building a list of locations to search for
"tkill", and passes  
> that
> to sfh_path_findv().

I'd actually augment your patch to not even search $PATH in
the first  
place (since it's going to be searched wrong).  Something
like the  
attached.

Thanks for the patch -- I'm no longer a LAM maintainer, but
this  
patch meets with my approval.  

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Patch for lamhalt lamd segfault if tkill not found
user name
2006-10-16 18:10:47
Jeff Squyres wrote:
> How exactly does this happen, actually?  The code as it
stands searches 
> $LAMHOME/bin and the compiled-in default $bindir; is
tkill not found 
> there?  Do you not have the tkill binary distributed to
the back-end 
> bproc nodes?

That's right, the tkill binary either is not there, or is
just in the 
PATH.  LAMHOME is not set.

> I'd actually augment your patch to not even search
$PATH in the first 
> place (since it's going to be searched wrong). 
Something like the 
> attached.

OK, I'll delete the line that searches $PATH in the wrong
way.

> Thanks for the patch -- I'm no longer a LAM maintainer,
but this patch 
> meets with my approval.  

You're welcome 

Josh
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Patch for lamhalt lamd segfault if tkill not found
user name
2006-12-28 18:32:20
On Oct 16, 2006, at 2:10 PM, Josh Lehan wrote:

> Jeff Squyres wrote:
>> How exactly does this happen, actually?  The code
as it stands  
>> searches
>> $LAMHOME/bin and the compiled-in default $bindir;
is tkill not found
>> there?  Do you not have the tkill binary
distributed to the back-end
>> bproc nodes?
>
> That's right, the tkill binary either is not there, or
is just in the
> PATH.  LAMHOME is not set.
>
>> I'd actually augment your patch to not even search
$PATH in the first
>> place (since it's going to be searched wrong). 
Something like the
>> attached.
>
> OK, I'll delete the line that searches $PATH in the
wrong way.
>
>> Thanks for the patch -- I'm no longer a LAM
maintainer, but this  
>> patch
>> meets with my approval.  
>
> You're welcome 

Sorry for the extraordinarily long delay.  I've applied
Jeff's  
version of the patch to the LAM source tree.  I'll be
releasing a  
second beta of LAM 7.1.3 this afternoon that will include
the patch.   
Hopefully, if there are no other outstanding issues with the
beta we  
can actually do a 7.1.3 release in the near future.

Thanks again!

Brian

-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )