List Info

Thread: Obscure bad file descriptor error on FreeBSD




Obscure bad file descriptor error on FreeBSD
user name
2007-06-07 03:06:07
Hello.

I've been trying to track down an incredibly obsure problem
that's been
causing my client to use 100% of the CPU time when
apparently doing
nothing. I tracked the problem down to an invalid file
descriptor being
passed to poll() in a loop, resulting in a bad file
descriptor error and
in effect creating a busy loop. My OS is FreeBSD 6.2, silc
toolkit 1.1.

As can be seen in the output below, my program is reponsible
for file
descriptor 0, but the silc client library seems to be under
the
impression that file descriptor 0 belongs to it:

$ ./s706 TEST_CONF1 TEST_CONTROL1
request: enter private key password: ********
confdir_chdir opened 5
confdir_restoredir closed 5
confdir_chdir opened 0
confdir_restoredir closed 0
confdir_chdir opened 0
confdir_restoredir closed 0
confdir_chdir opened 0
confdir_restoredir closed 0
info: client running
info: pid 38065
info: user mc
info: connect 192.168.3.33 10706
confdir_chdir opened 9
confdir_restoredir closed 9
info: found cached server key
info: verified server key
9D48B5CA79D3CFE4A99974E97BD1124C069D9650
Assertion failed: (fcntl(fd, 3, 0) != -1), function
silc_poll, file
silcunixschedule.c, line 147.
Abort trap (core dumped)

I've gone over this repeatedly and I'm pretty much certain
the error is
not my fault (the confdir_* functions above are the only
functions in my
client that call open() and close() and as can be seen, the
confdir_chdir function receives file descriptor 0 as the
result of an
open() call and is therefore the owner of this descriptor).

I believe the silc client library is incorrectly caching a
file
descriptor that it no longer owns, and as a result is
passing this
descriptor to poll() and causing repeated 'bad file
descriptor' errors.
This may alternatively be down to the way that the FreeBSD
thread
library behaves - the fork() call looks suspect to me - that
can
obviously mean that two different entities have references
to the same
table of file descriptors.

I added an assertion in silcunixschedule.c, just before the
call to
poll() to check for bad file descriptors. The assertion is
essentially:

  assert(fcntl(fd, F_GETFL, 0) == -1);

...which is just a call that can check if a file descriptor
is valid or
not.

Here is a system call trace:

Note that the silc library reads from file descriptor 0
here, early in
the execution of the program:

 37529 s706     RET   write 752/0x2f0
 37529 s706     CALL  fcntl(0,0x3,0)
 37529 s706     RET   fcntl 2
 37529 s706     CALL  fcntl(0x4,0x3,0)
 37529 s706     RET   fcntl 6
 37529 s706     CALL  fcntl(0x7,0x3,0)
 37529 s706     RET   fcntl 6
 37529 s706     CALL  poll(0x806a000,0x3,0x7464)
 37529 s706     RET   poll 1
 37529 s706     CALL  read(0,0xbfbfe92f,0x1)
 37529 s706     GIO   fd 0 read 1 byte
       "!"
 37529 s706     RET   read 1
 37529 s706     CALL  gettimeofday(0xbfbfe960,0)
 37529 s706     RET   gettimeofday 0
 37529 s706     CALL  fcntl(0,0x3,0)
 37529 s706     RET   fcntl 2
 37529 s706     CALL  fcntl(0x4,0x3,0)
 37529 s706     RET   fcntl 6
 37529 s706     CALL  fcntl(0x7,0x3,0)
 37529 s706     RET   fcntl 6

Then in the next system call:

 37529 s706     CALL  poll(0x806a000,0x3,0x73b6)
 37529 s706     RET   fork 0
 37529 s706     CALL  kse_release(0x805bf40)
 37529 s706     RET   kse_release 0
 37529 s706     CALL  read(0x7,0x8083000,0x10400)
 37529 s706     GIO   fd 7 read 784 bytes

The process fork()ed - this is most likely due to the
FreeBSD thread
library, as hinted at by the presence of kse_release()
straight after.
A few instructions later:

 37529 s706     RET   open 9
 37529 s706     CALL  fchdir(0x3)
 37529 s706     RET   fchdir 0
 37529 s706     CALL  open(0x804e5c6,0x104,0)
 37529 s706     NAMI  "server_key"
 37529 s706     RET   open -1 errno 2 No such file or
directory
 37529 s706     CALL  fchdir(0x9)
 37529 s706     RET   fchdir 0
 37529 s706     CALL  close(0)
 37529 s706     RET   close 0
 37529 s706     CALL  write(0x2,0x80536c0,0x42)
 37529 s706     GIO   fd 2 wrote 66 bytes
       "warn: unknown server key
9D48B5CA79D3CFE4A99974E97BD1124C069D9650
       "
 37529 s706     RET   write 66/0x42
 37529 s706     CALL  open(0x804eb5d,0x4,0)
 37529 s706     NAMI  "."
 37529 s706     RET   open 0
 37529 s706     CALL  fchdir(0x3)
 37529 s706     RET   fchdir 0
 37529 s706     CALL  open(0x804e5eb,0x104,0)
 37529 s706     NAMI  "server_trust_initially"
 37529 s706     RET   open 10/0xa
 37529 s706     CALL  fstat(0xa,0xbfbfe5b0)

Note that my process called open() and got file descriptor 0
returned
to it. A bit further on:

 37529 s706     RET   read 1
 37529 s706     CALL  write(0x7,0x8065800,0x30)
 37529 s706     GIO   fd 7 wrote 48 bytes
       0x0000 0026 0002 0a00 1008 0200 0000 0000 0000 
|.&..............|
       0x0010 0000 0000 0000 0000 0001 c0a8 0321 d229 
|.............!.)|
       0x0020 c4eb 2e70 c30e e2a6 4f1f da46 0000 0000 
|...p....O..F....|
 37529 s706     RET   write 48/0x30
 37529 s706     CALL  fcntl(0,0x3,0)
                      ^^^^^^^^^^^^^^
                      This is where I added an assertion in
                      silcunixschedule.c

 37529 s706     RET   fcntl -1 errno 9 Bad file descriptor
 37529 s706     CALL  write(0x2,0xbfbfe2c0,0x62)
 37529 s706     GIO   fd 2 wrote 98 bytes
       "Assertion failed: (fcntl(fd, 3, 0) != -1),
function silc_poll, file si
        lcunixschedule.c, line 147.
       "
 37529 s706     RET   write 98/0x62
 37529 s706     CALL  kse_thr_interrupt(0,0x4,0x6)
 37529 s706     NAMI  "s706.core"

I believe that the silc client library thread still believes
file
descriptor 0 to be the socket connected to the server when
in fact
isn't. This could have happened because when the process
called fork(),
my client modified the file descriptor table by closing a
file and
obviously the silc client library couldn't be aware of
this.

Unfortunately, I've still not been able to track down why
the file
descriptor becomes invalid. I can connect to the server, run
commands,
do all of the usual essentials but when the client is doing
nothing,
it's endlessly looping on a failed poll() call...

Any assistance or ideas would be much appreciated.

MC
____________________________________________________________
___________
Info:    https://lists.silcnet.org/mailman/listinfo/silc-announce

Archive: htt
ps://lists.silcnet.org/pipermail/silc-announce
FAQ:     http://silcnet.org/su
pport/faq/

Re: Obscure bad file descriptor error on FreeBSD
user name
2007-06-07 03:55:30
: Note that my process called open() and got file descriptor
0 returned
: to it. A bit further on:
: 
:  37529 s706     RET   read 1
:  37529 s706     CALL  write(0x7,0x8065800,0x30)
:  37529 s706     GIO   fd 7 wrote 48 bytes
:        0x0000 0026 0002 0a00 1008 0200 0000 0000 0000 
|.&..............|
:        0x0010 0000 0000 0000 0000 0001 c0a8 0321 d229 
|.............!.)|
:        0x0020 c4eb 2e70 c30e e2a6 4f1f da46 0000 0000 
|...p....O..F....|
:  37529 s706     RET   write 48/0x30
:  37529 s706     CALL  fcntl(0,0x3,0)
:                       ^^^^^^^^^^^^^^
:                       This is where I added an assertion
in
:                       silcunixschedule.c
: 
: descriptor 0 to be the socket connected to the server when
in fact
: isn't. This could have happened because when the process
called fork(),
: my client modified the file descriptor table by closing a
file and
: obviously the silc client library couldn't be aware of
this.
: 
: Unfortunately, I've still not been able to track down why
the file
: descriptor becomes invalid. I can connect to the server,
run commands,
: do all of the usual essentials but when the client is
doing nothing,
: it's endlessly looping on a failed poll() call...
: 
First, it's odd that you get fd 0 in the first place.  Isn't
0 reserved 
for stdin, at least according to stdin(3).

Second, do you pass these file descriptors you open with
these confdir_* 
functions to SILC Scheduler?  And if you do, how do you pass
them?

Third, if SILC scheduler does listen to these fd's and you
close them or 
FreeBSD closes them, the Fd task must be removed from the
scheduler.  If 
it isn't it will re-schedule it can the end result is bad
file descriptor 
error.

Fourth, the byte "!" that the SILC scheduler read
is the 
silc_schedule_wakeup call that sends "!" byte to
the wakeup socket.  The 
socket is opened by the scheduler when silc_schedule_init is
called.  It 
must not be closed.  If it is, the behavior of the scheduler
is 
undefined.  You don't have access to that fd directly so
only way to close 
that is by doing it intentionally, ie. specifically closing
that fd.

Without seeing code and debug outputs I can't say what's
wrong.

	Pekka
____________________________________________________________
____________
 Pekka Riikonen                                 priikone at
silcnet.org
 Secure Internet Live Conferencing (SILC)       http://silcnet.org/
____________________________________________________________
___________
Info:    https://lists.silcnet.org/mailman/listinfo/silc-announce

Archive: htt
ps://lists.silcnet.org/pipermail/silc-announce
FAQ:     http://silcnet.org/su
pport/faq/

Re: Obscure bad file descriptor error on FreeBSD
user name
2007-06-07 04:13:36
Isn't the following the cause of the problem:

:  37529 s706     RET   open 9
fd = open();
fd == 9
:  37529 s706     CALL  fchdir(0x3)
:  37529 s706     RET   fchdir 0
ret = fchdir(3);
ret == 0;
:  37529 s706     CALL  open(0x804e5c6,0x104,0)
:  37529 s706     NAMI  "server_key"
:  37529 s706     RET   open -1 errno 2 No such file or
directory
fd = open("server_key")
fd == -1
:  37529 s706     CALL  fchdir(0x9)
:  37529 s706     RET   fchdir 0
ret = fchdir(9);
ret == 0;
:  37529 s706     CALL  close(0)
:  37529 s706     RET   close 0
ret == close(0);
^^^^^^^^^^^^^^
ret == 0

shouldn't this read close(9) or close(3).  I can't see you
opening fd 0 
anywhere but I see you closing it.

:  37529 s706     CALL  write(0x2,0x80536c0,0x42)
:  37529 s706     GIO   fd 2 wrote 66 bytes
:        "warn: unknown server key
9D48B5CA79D3CFE4A99974E97BD1124C069D9650
:        "

	Pekka
____________________________________________________________
____________
 Pekka Riikonen                                 priikone at
silcnet.org
 Secure Internet Live Conferencing (SILC)       http://silcnet.org/
____________________________________________________________
___________
Info:    https://lists.silcnet.org/mailman/listinfo/silc-announce

Archive: htt
ps://lists.silcnet.org/pipermail/silc-announce
FAQ:     http://silcnet.org/su
pport/faq/

Re: Obscure bad file descriptor error on FreeBSD
user name
2007-06-07 04:47:51
On 07/06/07, Pekka Riikonen <priikoneiki.fi> wrote:
>
> : Note that my process called open() and got file
descriptor 0 returned
> : to it. A bit further on:
> :
> :  37529 s706     RET   read 1
> :  37529 s706     CALL  write(0x7,0x8065800,0x30)
> :  37529 s706     GIO   fd 7 wrote 48 bytes
> :        0x0000 0026 0002 0a00 1008 0200 0000 0000 0000
 |.&..............|
> :        0x0010 0000 0000 0000 0000 0001 c0a8 0321 d229
 |.............!.)|
> :        0x0020 c4eb 2e70 c30e e2a6 4f1f da46 0000 0000
 |...p....O..F....|
> :  37529 s706     RET   write 48/0x30
> :  37529 s706     CALL  fcntl(0,0x3,0)
> :                       ^^^^^^^^^^^^^^
> :                       This is where I added an
assertion in
> :                       silcunixschedule.c
> :
> : descriptor 0 to be the socket connected to the server
when in fact
> : isn't. This could have happened because when the
process called fork(),
> : my client modified the file descriptor table by
closing a file and
> : obviously the silc client library couldn't be aware
of this.
> :
> : Unfortunately, I've still not been able to track down
why the file
> : descriptor becomes invalid. I can connect to the
server, run commands,
> : do all of the usual essentials but when the client is
doing nothing,
> : it's endlessly looping on a failed poll() call...
> :
> First, it's odd that you get fd 0 in the first place. 
Isn't 0 reserved
> for stdin, at least according to stdin(3).

Yes.

>
> Second, do you pass these file descriptors you open
with these confdir_*
> functions to SILC Scheduler?  And if you do, how do you
pass them?
>

No, they're never used outside of the confdir_* functions.

>
> Fourth, the byte "!" that the SILC scheduler
read is the
> silc_schedule_wakeup call that sends "!" byte
to the wakeup socket.  The
> socket is opened by the scheduler when
silc_schedule_init is called.  It
> must not be closed.  If it is, the behavior of the
scheduler is
> undefined.  You don't have access to that fd directly
so only way to close
> that is by doing it intentionally, ie. specifically
closing that fd.
>

Ok.

More to follow...
MC
____________________________________________________________
___________
Info:    https://lists.silcnet.org/mailman/listinfo/silc-announce

Archive: htt
ps://lists.silcnet.org/pipermail/silc-announce
FAQ:     http://silcnet.org/su
pport/faq/

Re: Obscure bad file descriptor error on FreeBSD
user name
2007-06-07 05:19:43
On 07/06/07, Pekka Riikonen <priikoneiki.fi> wrote:
>
> Isn't the following the cause of the problem:
>
> :  37529 s706     RET   open 9
> fd = open();
> fd == 9
> :  37529 s706     CALL  fchdir(0x3)
> :  37529 s706     RET   fchdir 0
> ret = fchdir(3);
> ret == 0;
> :  37529 s706     CALL  open(0x804e5c6,0x104,0)
> :  37529 s706     NAMI  "server_key"
> :  37529 s706     RET   open -1 errno 2 No such file or
directory
> fd = open("server_key")
> fd == -1
> :  37529 s706     CALL  fchdir(0x9)
> :  37529 s706     RET   fchdir 0
> ret = fchdir(9);
> ret == 0;
> :  37529 s706     CALL  close(0)
> :  37529 s706     RET   close 0
> ret == close(0);
> ^^^^^^^^^^^^^^
> ret == 0
>
> shouldn't this read close(9) or close(3).  I can't see
you opening fd 0
> anywhere but I see you closing it.

Oh dear, oh dear, oh dear.

I've been doing this since 5AM, and after replacing every
single open() and close() call with a central function
in a vain attempt to track file descriptors (valgrind
would just helpfully crash as pthread_attr* is not
currently implemented and the SILC library uses it
somewhere), I discover this gem:

  if (confdir_close_fd(cd->prevfd == -1)) {
     ...

Look at that one closely. Seems a ')' decided to go on
a short walk. That's one of the ugliest bugs I've ever
witnessed.

Thanks for not judging too harshly. I'm off to the home
for damaged C programmers.

In conclusion, not SILC's fault.
MC
____________________________________________________________
___________
Info:    https://lists.silcnet.org/mailman/listinfo/silc-announce

Archive: htt
ps://lists.silcnet.org/pipermail/silc-announce
FAQ:     http://silcnet.org/su
pport/faq/

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )