List Info

Thread: startup cache scan hang




startup cache scan hang
country flaguser name
United States
2007-05-27 15:17:19
Hi,

Has anyone seen OpenAFS hang on startup, seemingly during a
cache
scan?  I experienced hangs on over 50% of our machines after
both
recent security updates - even after waiting 15 minutes or
more, /afs
doesn't mount.  This is OpenAFS 1.4.4 on both Intel and
PowerPC
machines (the problem seems a bit more prevalent on
PowerPC).  We
don't have any similar problems on Linux or Solaris.

Here's what the system log says:

May 27 14:42:24 bender kernel[0]: Starting AFS cache
scan...
[...]
May 27 14:42:28 bender kernel[0]: [256] waiting for
afs_osi_ctxtp
May 27 14:42:33 bender kernel[0]: [256] waiting for
afs_osi_ctxtp

And there seem to be a bunch of zombie afsds around.  The
below
transcribed from the screen since my console/SSH connections
hung
entirely shortly thereafter, so there may be a few errors in
it.

USER    PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED    TIME 
COMMAND
root    237  0.0  0.4  27692  2028  ??  U    2:42PM  0:00.24
/usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5
-volumes 70 -dynroot -fakestat-all
root    255  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00
(afsd)
root    256  0.0  0.0  27692  2028  ??  Us   2:42PM  0:00.00
/usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5
-volumes 70 -dynroot -fakestat-all
root    252  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00
(afsd)
root    253  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00
(afsd)
root    254  0.0  0.0      0     0  ??  Z   31Dec69  0:00.00
(afsd)

rxdebug says:

Free packets: 130, packet reclaims: 0, calls: 60, used FDs:
64
not waiting for packets.
0 calls waiting for a thread
1 threads are idle
rx stats: free packets 130, allocs 130, alloc-failures(rcv
0/0,send 0/0,ack 0)
   greedy 0, bogusReads 0 (last from host 0), noPackets 0,
noBuffers 0, selects 0, sendSelects 0
   packets read: data 60 ack 51 busy 0 abort 0 ackall 0
challenge 0 response 0 debug 8 params 0 unused 0 unused 0
unused 0 version 0
   other read counters: data 60, ack 51, dup 0 spurious 0
dally 0
   packets sent: data 51 ack 0 busy 0 abort 9 ackall 0
challenge 0 response 0 debug 0 params 0 unused 0 unused 0
unused 0 version 0
   other send counters: ack 0, data 102 (not resends),
resends 0, pushed 0, acked&ignored 0
        (these should be small) sendFailed 0, fatalErrors 0
   3 server connections, 0 client connections, 3 peer
structs, 3 call structs, 0 free call structs
Done.

and cmdebug returns none_waiting for everything.

I've been thinking of simply blowing away the cache
directory before
starting AFS - would that be likely to help?  Is there any
other info
that's useful in diagnosing the problem?

-- 
Nicholas Riley <njrileyuiuc.edu> | <http://www.uiu
c.edu/ph/www/njriley>
_______________________________________________
port-darwin mailing list
port-darwinopenafs.org
https://lists.openafs.org/mailman/listinfo/port-darwin


Re: startup cache scan hang
country flaguser name
United States
2007-05-27 15:48:38
On Sun, May 27, 2007 at 03:17:19PM -0500, Nicholas Riley
wrote:
> And there seem to be a bunch of zombie afsds around. 
The below
> transcribed from the screen since my console/SSH
connections hung
> entirely shortly thereafter, so there may be a few
errors in it.
> 
> USER    PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED   
TIME  COMMAND
> root    237  0.0  0.4  27692  2028  ??  U    2:42PM 
0:00.24 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500
-daemons 5 -volumes 70 -dynroot -fakestat-all
> root    255  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    256  0.0  0.0  27692  2028  ??  Us   2:42PM 
0:00.00 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500
-daemons 5 -volumes 70 -dynroot -fakestat-all
> root    252  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    253  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    254  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)

So it seems two copies of afsd are starting - is this
correct?  Here's
another machine:

smithers:~ macadmin$ ps jaxww|head -1; ps jaxww|grep afs
USER       PID  PPID  PGID   SESS JOBC STAT  TT       TIME
COMMAND
root       236   129   129 3e65058    0 U     ??    0:00.12
/usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5
-volumes 70 -dynroot -fakestat-all
root       237   236   129 3e65058    0 Z     ??    0:00.00
(afsd)
root       238   236   129 3e65058    0 Z     ??    0:00.00
(afsd)
root       239   236   129 3e65058    0 Z     ??    0:00.00
(afsd)
root       240   236   129 3e65058    0 Z     ??    0:00.00
(afsd)
root       241     1   241 3e654a8    0 Us    ??    0:00.00
/usr/sbin/afsd -afsdb -stat 10000 -dcache 2500 -daemons 5
-volumes 70 -dynroot -fakestat-all
root       340   318   318 3e63678    0 U     ??    0:00.00
/usr/sbin/afsd -shutdown

Tracing pid 236's parents:

root       129    49   129 3e65058    0 Ss    ??    0:00.01
/bin/sh /Library/StartupItems/OpenAFS/OpenAFS start
root        49     1    49 3e665e8    0 SNs   ??    0:00.12
SystemStarter

On the machines where AFS started successfully, I don't see
the second
copy of afsd started from /Library/StartupItems/OpenAFS...
did it exit?

Confusedly,

-- 
Nicholas Riley <njrileyuiuc.edu> | <http://www.uiu
c.edu/ph/www/njriley>
_______________________________________________
port-darwin mailing list
port-darwinopenafs.org
https://lists.openafs.org/mailman/listinfo/port-darwin


Re: startup cache scan hang
country flaguser name
United States
2007-05-27 15:52:26
One last potentially useful data point - the afsd are stuck
in an
ioctl.  I couldn't attach GDB to them but I could get a
sample:

Analysis of sampling pid 236 every 10.000000 milliseconds
Call graph:
    100 Thread_1007
      100 0x1831
        100 0x190a
          100 0x7a31
            100 0x44b6
              100 ioctl
                100 ioctl

Analysis of sampling pid 241 every 10.000000 milliseconds
Call graph:
    100 Thread_1007
      100 0x1831
        100 0x190a
          100 0x7a31
            100 0x416e
              100 ioctl
                100 ioctl

-- 
Nicholas Riley <njrileyuiuc.edu> | <http://www.uiu
c.edu/ph/www/njriley>
_______________________________________________
port-darwin mailing list
port-darwinopenafs.org
https://lists.openafs.org/mailman/listinfo/port-darwin


Re: startup cache scan hang
country flaguser name
United States
2007-05-28 22:06:15
At 3:17 PM -0500 5/27/07, Nicholas Riley wrote:
>Hi,
>
>Has anyone seen OpenAFS hang on startup, seemingly
during a cache
>scan?  I experienced hangs on over 50% of our machines
after both
>recent security updates - even after waiting 15 minutes
or more, /afs
>doesn't mount.

I had one problem with OpenAFS hanging on startup. It
happened when
I was *downgrading* from 1.5.<mumble> to
1.4.<latest> of the day.
In my case, I rebooted in single-user mode, and removed the
entire
cache directory.  (I forget if I re-created it after
removing it,
or if I just let openafs recreate it).

I haven't seen any hang problem since then.  I'm running
openafs
on two powerPC machines and one Mactel machine.

-- 
Garance Alistair Drosehn            =   gadgilead.netel.rpi.edu
Senior Systems Programmer           or  gadfreebsd.org
Rensselaer Polytechnic Institute    or  drosihrpi.edu
_______________________________________________
port-darwin mailing list
port-darwinopenafs.org
https://lists.openafs.org/mailman/listinfo/port-darwin


Re: startup cache scan hang
country flaguser name
United States
2007-05-29 12:17:07
On Sun, 27 May 2007, Nicholas Riley wrote:

> Hi,
>
> Has anyone seen OpenAFS hang on startup, seemingly
during a cache
> scan?  I experienced hangs on over 50% of our machines
after both
> recent security updates - even after waiting 15 minutes
or more, /afs
> doesn't mount.  This is OpenAFS 1.4.4 on both Intel and
PowerPC
> machines (the problem seems a bit more prevalent on
PowerPC).  We
> don't have any similar problems on Linux or Solaris.
>
> Here's what the system log says:
>
> May 27 14:42:24 bender kernel[0]: Starting AFS cache
scan...
> [...]
> May 27 14:42:28 bender kernel[0]: [256] waiting for
afs_osi_ctxtp
> May 27 14:42:33 bender kernel[0]: [256] waiting for
afs_osi_ctxtp
>
> And there seem to be a bunch of zombie afsds around. 
The below
> transcribed from the screen since my console/SSH
connections hung
> entirely shortly thereafter, so there may be a few
errors in it.
>
> USER    PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED   
TIME  COMMAND
> root    237  0.0  0.4  27692  2028  ??  U    2:42PM 
0:00.24 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500
-daemons 5 -volumes 70 -dynroot -fakestat-all
> root    255  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    256  0.0  0.0  27692  2028  ??  Us   2:42PM 
0:00.00 /usr/sbin/afsd -afsdb -stat 10000 -dcache 2500
-daemons 5 -volumes 70 -dynroot -fakestat-all
> root    252  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    253  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)
> root    254  0.0  0.0      0     0  ??  Z   31Dec69 
0:00.00 (afsd)

> I've been thinking of simply blowing away the cache
directory before
> starting AFS - would that be likely to help?  Is there
any other info
> that's useful in diagnosing the problem?

kernel backtrace would be of tremendous help.

_______________________________________________
port-darwin mailing list
port-darwinopenafs.org
https://lists.openafs.org/mailman/listinfo/port-darwin


[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )