List Info

Thread: Re: FMA/networking post-UV




Re: FMA/networking post-UV
user name
2008-02-20 01:28:50
On Wed, Feb 20, 2008 at 12:35:06AM -0500, Peter Memishian
wrote:
> 
> Mike/Cindi,
> 
> As you may recall, one of the problems with "style
2" DLPI datalinks is
> that opening them bypasses the FMA I/O retire checks in
spec_open() (since
> the kernel doesn't know what piece of hardware is
actually being accessed
> until the DL_ATTACH_REQ is done).
> 
> However, the /dev/net directory introduced by the
recent Clearview UV
> putback consists only of "style 1" DLPI
links, which the spec_open()
> checks correctly catch, causing ENXIO to be returned. 
Since all libdlpi
> applications check /dev/net first, these style-1 links
are now preferred.
> >From a RAS standpoint, this is a marked
improvement.  However, we've
> already encountered a handful of systems with a network
device that
> apparently mostly worked (even though FMA had retired
it) which failed to
> open with ENXIO after upgrading to the UV bits.  Of
course, once the user
> runs "fmadm faulty", everything falls into
place -- but to most, the
> connection between the ENXIO error and FMA may not
occur (especially since
> FMA may have done the retire months ago).  I fear this
will lead to
> support calls and frustration.
> 
> As such, I had a few points I wanted your input on:
> 
> 	1. Has there been any discussion of a new errno for
this case?
> 	   If we had a new errno, such as ERETIRED or
EFAULTED, API
> 	   consumers could differentiate this case if
appropriate, and
> 	   moreover strerror() could say something more
helpful than "No
> 	   such device or address".

Not generally due to the long-standing murky issue of
whether
adding new Solaris specific errno values is a thing we do or
not.
But personally I have no issue with that.  Another approach
would
be just to have dladm deal with it -- i.e. if it gets ENXIO
then
make additional calls to realize something is faulty as
opposed
to unattached.  It is already the case that ENXIO is
overloaded:
e.g. driver failed to attach versus nothing actually there.

> 	2. It seems uneven to have retired networking hardware
but not
> 	   have anything reported by dladm -- minimally, I'd
think it
> 	   appropriate for show-phys to report this, and
(given the
> 	   severity) maybe show-link as well.  (However, I
don't want
> 	   dladm to impinge on fmadm's duties.)

It's always a good thing for participating subsystems to
report
enriched fault status for their resources, since by
definition such
reporting can always be somewhat more useful and better than
the
generic FMA view.  The key is to make it connect to the FMA
output
(e.g. msgid values).  Examples of this today include svcs
-x
and zpool status -x.  Making dladm do same would be a good
thing.
 
> 	3. It worries me that in all the cases we've seen thus
far, the
> 	   fault was "repaired" and never seen
again.  Is this common, or
> 	   is this indicative of bugs in our fault detection
code?

Do you have an example?  i.e. what fault was diagnosed?

-Mike

-- 
Mike Shapiro, Solaris Kernel Development.
blogs.sun.com/mws/
_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org

Re: FMA/networking post-UV
user name
2008-02-20 02:02:08
 > > As such, I had a few points I wanted your input
on:
 > > 
 > > 	1. Has there been any discussion of a new errno
for this case?
 > > 	   If we had a new errno, such as ERETIRED or
EFAULTED, API
 > > 	   consumers could differentiate this case if
appropriate, and
 > > 	   moreover strerror() could say something more
helpful than "No
 > > 	   such device or address".
 > 
 > Not generally due to the long-standing murky issue of
whether
 > adding new Solaris specific errno values is a thing we
do or not.

Looking at errno.h, I see some recent additions -- e.g.
extended
accounting added ENOTACTIVE, and a number of error codes
were added for
robust mutexes.  So there seems to be precedent.  (I
understand that we
are stuck with at most 151 error codes from now until
another major
release, and thus need to be cautious -- but if push came to
shove
someday, I'd hope we could recycle ENOANO and other useless
junk 

 > But personally I have no issue with that.

Cool.  To be clear, we're not in a position to make these
changes right
now, so there's plenty of time to discuss this.

 > Another approach would be just to have dladm deal with
it -- i.e. if it
 > gets ENXIO then make additional calls to realize
something is faulty as
 > opposed to unattached.  It is already the case that
ENXIO is
 > overloaded: e.g. driver failed to attach versus
nothing actually there.

I agree that's possible, though of course it doesn't improve
things for
commands that will just do a strerror(errno) after the
failed open().

 > > 	2. It seems uneven to have retired networking
hardware but not
 > > 	   have anything reported by dladm -- minimally,
I'd think it
 > > 	   appropriate for show-phys to report this, and
(given the
 > > 	   severity) maybe show-link as well.  (However,
I don't want
 > > 	   dladm to impinge on fmadm's duties.)
 > 
 > It's always a good thing for participating subsystems
to report
 > enriched fault status for their resources, since by
definition such
 > reporting can always be somewhat more useful and
better than the
 > generic FMA view.  The key is to make it connect to
the FMA output
 > (e.g. msgid values).  Examples of this today include
svcs -x
 > and zpool status -x.  Making dladm do same would be a
good thing.

Great. 

 > > 	3. It worries me that in all the cases we've
seen thus far, the
 > > 	   fault was "repaired" and never seen
again.  Is this common, or
 > > 	   is this indicative of bugs in our fault
detection code?
 > 
 > Do you have an example?  i.e. what fault was
diagnosed?

For instance, per 6664330, there was a PCI express fault
[http://sun.com/msg/P
CIEX-8000-0A] in early December, but it went
unnoticed and the networking device continued to be used
without incident
(until the eventual upgrade to build 83).

-- 
meem
_______________________________________________
fm-discuss mailing list
fm-discussopensolaris.org

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )