List Info

Thread: help locating problems...




help locating problems...
user name
2006-08-10 01:25:31
Okay, I'm new to FMA.  Solaris b44 is reporting the
following error:

TIME                 UUID                                
SUNW-MSG-ID
Aug 09 10:47:31.0395 35fc7dee-e5b9-6028-d333-cbcd5a272c35
PCI-8000-7J
  100%  fault.io.pci.device-interr

        Problem in:
hc:///motherboard=0/hostbridge=0/pcibus=1/pcidev=3/pcifn=0
           Affects: dev:////pci1f,700000/network3
               FRU: hc:///component=MB



What I can't figure out easily is how to determine where
the code that
is generating this fault resides.  I'm guessing it is in
pcisch, but
honestly, I'm a bit at a loss.

Also, are there any design documents for FMA available
anywhere? 
Ideally I'd like to have something both helps me figure out
problems
like this (down to tracking it down to a line of code), and
also gives
me information so that I know how to start adding code to
inject my own
errors from the code that I've written.  (E.g. how do I
play with FMA in
an unbundled NIC driver, etc.)

-- 
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecom
puter.com/
Phone: 951 325-2134  Fax: 951 325-2191

_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org
help locating problems...
user name
2006-08-10 02:32:35
What you're seeing is that we believe there is a fault in a
network
controller.  You should have also seen a console message
with the case
number (35fc7dee-e5b9-6028-d333-cbcd5a272c35) and the
message id
(PCI-8000-7J).  If you go the URL http://www.sun.com/msg and
type in
the message id, PCI-8000-7J you should be taken to a
knowledge article
describing this problem in a little more detail.

The faulty NIC is apparently on the motherboard (MB),
that's
the FRU that would have to be replaced to fix the problem.

You can use the case number to dig out a little more
information,

fmdump -V -u 35fc7dee-e5b9-6028-d333-cbcd5a272c35

should show you what ereports our diagnosis algorithm led us
to believe
the device was faulty.

-tim

On Wed, 9 Aug 2006, Garrett D'Amore wrote:

> Okay, I'm new to FMA.  Solaris b44 is reporting the
following error:
>
> TIME                 UUID                              
  SUNW-MSG-ID
> Aug 09 10:47:31.0395
35fc7dee-e5b9-6028-d333-cbcd5a272c35 PCI-8000-7J
>  100%  fault.io.pci.device-interr
>
>        Problem in:
>
hc:///motherboard=0/hostbridge=0/pcibus=1/pcidev=3/pcifn=0
>           Affects: dev:////pci1f,700000/network3
>               FRU: hc:///component=MB
>
>
>
> What I can't figure out easily is how to determine
where the code that
> is generating this fault resides.  I'm guessing it is
in pcisch, but
> honestly, I'm a bit at a loss.
>
> Also, are there any design documents for FMA available
anywhere?
> Ideally I'd like to have something both helps me
figure out problems
> like this (down to tracking it down to a line of code),
and also gives
> me information so that I know how to start adding code
to inject my own
> errors from the code that I've written.  (E.g. how do
I play with FMA in
> an unbundled NIC driver, etc.)
>
> -- 
> Garrett D'Amore, Principal Software Engineer
> Tadpole Computer / Computing Technologies Division,
> General Dynamics C4 Systems
> http://www.tadpolecom
puter.com/
> Phone: 951 325-2134  Fax: 951 325-2191
>
> _______________________________________________
> fm-discuss mailing list
> fm-discussopensolaris.org
>
_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org
help locating problems...
user name
2006-08-10 02:51:48
And regarding the "code that generates the
fault", there isn't one
particular call in the kernel that generates that fault. 
Instead code
in the kernel has generated one or more ereports (short for
error
reports).  Those error reports have come from the kernel to
the
user-land fmd.  The fmd has recorded them in its error log. 
An fmd
plugin, eft, has then coalesced those ereports and decided
what fault
can cause the observed symptoms.  Eft has then published
something
called a suspect list, which in this case has one entry, the
fault you
are seeing.  The suspect list was broadcast to another
plugin, the
syslog messaging agent, which should have put a summary of
this "case"
into syslog and onto the console.

That's a really brief description of the fault management
architecture.

-tim

On Wed, 9 Aug 2006, Garrett D'Amore wrote:

> Okay, I'm new to FMA.  Solaris b44 is reporting the
following error:
>
> TIME                 UUID                              
  SUNW-MSG-ID
> Aug 09 10:47:31.0395
35fc7dee-e5b9-6028-d333-cbcd5a272c35 PCI-8000-7J
>  100%  fault.io.pci.device-interr
>
>        Problem in:
>
hc:///motherboard=0/hostbridge=0/pcibus=1/pcidev=3/pcifn=0
>           Affects: dev:////pci1f,700000/network3
>               FRU: hc:///component=MB
>
>
>
> What I can't figure out easily is how to determine
where the code that
> is generating this fault resides.  I'm guessing it is
in pcisch, but
> honestly, I'm a bit at a loss.
>
> Also, are there any design documents for FMA available
anywhere?
> Ideally I'd like to have something both helps me
figure out problems
> like this (down to tracking it down to a line of code),
and also gives
> me information so that I know how to start adding code
to inject my own
> errors from the code that I've written.  (E.g. how do
I play with FMA in
> an unbundled NIC driver, etc.)
>
> -- 
> Garrett D'Amore, Principal Software Engineer
> Tadpole Computer / Computing Technologies Division,
> General Dynamics C4 Systems
> http://www.tadpolecom
puter.com/
> Phone: 951 325-2134  Fax: 951 325-2191
>
> _______________________________________________
> fm-discuss mailing list
> fm-discussopensolaris.org
>
_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org
help locating problems...
user name
2006-08-10 04:06:43
Tim Haley wrote:
>
> And regarding the "code that generates the
fault", there isn't one
> particular call in the kernel that generates that
fault.  Instead code
> in the kernel has generated one or more ereports (short
for error
> reports).  Those error reports have come from the
kernel to the
> user-land fmd.  The fmd has recorded them in its error
log.  An fmd
> plugin, eft, has then coalesced those ereports and
decided what fault
> can cause the observed symptoms.  Eft has then
published something
> called a suspect list, which in this case has one
entry, the fault you
> are seeing.  The suspect list was broadcast to another
plugin, the
> syslog messaging agent, which should have put a summary
of this "case"
> into syslog and onto the console.

I already more-or-less gathered this high level description.

What I'm having a hard time is figuring out which code is
sending the
named fault.  The current fault framework makes it very hard
(at least
to me) to identify this by, for example, grepping for
fault.io.pci.device-interr in the framework.

Is there some easy correlation where I can find things like
these
strings in a call from (for example) the PCI nexus driver?

It would be really, really helpful to be able to go from a
the logged
message to a line of code somewhere.   The notion
"well, you had a
random PCI fault on this particular piece of hardware"
is really useful
if I am a sysadmin and need to replace the hardware.  But as
an engineer
developing hardware platforms or writing device driver code,
this is a
lot less useful.

(In this particular case, we have seen some other
platform-wide problems
with ethernet on this particular device -- too many packet
drops in
sunvts for example.  We don't know why, but I'd really
like to be able
to correlate this to what some driver thinks is going wrong,
because
then I stand a much better chance of correlating it to a
problem that
may exist with, for example, the design of the platform
(maybe some kind
of electrical problem or a mis-connected trace or somesuch.)

In the old days, code that just did cmn_err() was a bit
easier because
we could grep for specific strings.  Now, with the FMA
stuff, that
doesn't seem to work anymore.

    -- Garrett
>
> That's a really brief description of the fault
management architecture.
>
> -tim
>
> On Wed, 9 Aug 2006, Garrett D'Amore wrote:
>
>> Okay, I'm new to FMA.  Solaris b44 is reporting
the following error:
>>
>> TIME                 UUID                          
      SUNW-MSG-ID
>> Aug 09 10:47:31.0395
35fc7dee-e5b9-6028-d333-cbcd5a272c35 PCI-8000-7J
>>  100%  fault.io.pci.device-interr
>>
>>        Problem in:
>>
hc:///motherboard=0/hostbridge=0/pcibus=1/pcidev=3/pcifn=0
>>           Affects: dev:////pci1f,700000/network3
>>               FRU: hc:///component=MB
>>
>>
>>
>> What I can't figure out easily is how to determine
where the code that
>> is generating this fault resides.  I'm guessing it
is in pcisch, but
>> honestly, I'm a bit at a loss.
>>
>> Also, are there any design documents for FMA
available anywhere?
>> Ideally I'd like to have something both helps me
figure out problems
>> like this (down to tracking it down to a line of
code), and also gives
>> me information so that I know how to start adding
code to inject my own
>> errors from the code that I've written.  (E.g. how
do I play with FMA in
>> an unbundled NIC driver, etc.)
>>
>> -- 
>> Garrett D'Amore, Principal Software Engineer
>> Tadpole Computer / Computing Technologies Division,
>> General Dynamics C4 Systems
>> http://www.tadpolecom
puter.com/
>> Phone: 951 325-2134  Fax: 951 325-2191
>>
>> _______________________________________________
>> fm-discuss mailing list
>> fm-discussopensolaris.org
>>


-- 
Garrett D'Amore, Principal Software Engineer
Tadpole Computer / Computing Technologies Division,
General Dynamics C4 Systems
http://www.tadpolecom
puter.com/
Phone: 951 325-2134  Fax: 951 325-2191

_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org
help locating problems...
user name
2006-08-10 20:50:35
Garrett D'Amore wrote:

> In the old days, code that just did cmn_err() was a bit
easier because
> we could grep for specific strings.  Now, with the FMA
stuff, that
> doesn't seem to work anymore.

... or you got nothing at all perhaps because the error
detector was not
enabled or the blanket response to an error interupt would
be to 
panic/complain with some generic message.

But the point is that the code which catches the error
often/typically
is not the code which aggravated the error to begin with (if
indeed
it is a driver defect as opposed to a hardware defect).  It
is typically
an error interrupt handler or trap handler.  So even it it
vomits
a cmn_err directly you're none the wiser.

In your case you have a fault.io.pci.device-interr.  Faults
are what may
be diagnosed from the incoming ereport telemetry flow.  So
you
won't find that string in the kernel but in the diagnosis
stuff.  PCI is diagnosed using the eversholt language, and
the
rules for this "internal device fault" are in
usr/src/cmd/fm/eversholt/files/common/pci.esc.  This
describes
a fault propogation tree - faults propogate (->) to
errors (abberant condition present, possibly not yet
detected)
which propogate to ereports.  When an ereport is receieved
eversholt works backwards from the ereport - what the
detector
saw - towards a fault, testing various conditions along the
way.
There are a bunch of ways to get a device internal fault
from
the tree.  But you can look at the ereports associated with
your fault (via fmdump and fmdump -e) and see what
particular
class of ereport you did experience.

Hope that helps

Cheers

Gavin
_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org
[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )