Hi. I sent this message to the lam-devel list a few days
ago, but that
list appears dead. I'm assuming the developers have moved
on to
OpenMPI. So, a repost here.
In LAM 7.1.2, I found a segfault in lamd when
"lamhalt" is used to
tear down a LAM network.
It happens if the "tkill" executable is not found.
It's in the appropriately named function diediedie() in
otb/sys/haltd/haltd.c
I traced it out, and what's going on is this:
It's building a list of locations to search for
"tkill", and passes that
to sfh_path_findv().
The result is the tkillpath string. It is not checked
before passing it
along into sfh_argv_add() later, when it's building up a
command line
for tkill to execute with.
Problem is, sfh_argv_add() can't accept a NULL string, as it
attempts to
do strlen() on it, and segfaults there.
The underlying problem is that $PATH is not being searched
correctly.
The sfh_path_findv() function will expand environment
variables, so
$PATH gets expanded as /bin:/usr/bin:/whatever = not the
correct
behaviour for $PATH. We need to further break up $PATH, and
iterate
through each of its components.
I have a small patch that does this:
1) Use a different function, sfh_path_env_find(), which
*does* correctly
break up $PATH, if tkillpath isn't found earlier.
2) If this also fails to find tkillpath, then punt, by doing
exit(1).
There's nothing more that can be done anyway, as the
execution of tkill
is guaranteed to fail if it can't be found. This exit call
is what
would be done anyway if tkill's fork/exec fails.
This patch works for me, no more segfault.
Josh Lehan
Scyld
diff -urN OLD/lam-7.1.2/otb/sys/haltd/haltd.c
NEW/lam-7.1.2/otb/sys/haltd/haltd.c
--- OLD/lam-7.1.2/otb/sys/haltd/haltd.c 2006-02-23
15:26:55.000000000 -0800
+++ NEW/lam-7.1.2/otb/sys/haltd/haltd.c 2006-10-06
21:05:42.000000000 -0700
 -214,8
+217,27 
sfh_argv_add(&pathc, &pathv,
"$LAMHOME/bin");
sfh_argv_add(&pathc, &pathv, LAM_BINDIR);
- tkillpath = sfh_path_findv(fname, pathv, R_OK, environ);
+ tkillpath = sfh_path_findv(fname, pathv, X_OK, environ);
sfh_argv_free(pathv);
+
+ if (NULL == tkillpath)
+ {
+ tkillpath = sfh_path_env_find(fname, X_OK);
+
+ if (NULL == tkillpath)
+ {
+ exit(1);
+ }
+ }
+
sfh_argv_add(&argc, &argv, tkillpath);
if (ao_taken(lam_daemon_optd, "d")) {
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/ |