List Info

Thread: use PG_writeback in lustre




use PG_writeback in lustre
country flaguser name
United States
2007-05-16 14:54:20
Please don't reply to lustre-devel. Instead, comment in
Bugzilla by using the following link:
htt
ps://bugzilla.lustre.org/show_bug.cgi?id=11710



(In reply to comment #60)
> Created an attachment (id=10547)
Please don't reply to lustre-devel. Instead, comment in
Bugzilla by using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id
=10547&action=view)
> handling recoverable error. v2

Thanks Alexey, I had a chance to review the latest proposed
patch and DLD.
In general I like the refined approach, in particular
breaking the 
recoverable errors in two levels at the rpc and osc layer is
nice.
Plus the optimization of tracking the readonly state in the
import is
nice, but the side effect of having a client evicted is it a
higher
price than I think we're willing to pay.  After auditing the
patch 
I still have other concerns.

1) Minor: The proposed patch is against 1.6 not 1.4, so I
have not yet tried
the patch I've only read through it.  It will need to be
backported to
1.4 because this is serious flaw for 1.4 based release as
well.

2) I'm not thrilled with how fatal errors are now handled by
simply
flagging the page in error in the client page cache.  This
approach
is simply wasteful of the client cache and to my paranoid
mind a little
bit dangerous for client cache cohearency.  There's no
reason to leave 
this page around, it really needs to be discarded as soon as
possible. 

Perhaps the best solution which now occurs to me is to keep
a list of
these pages as was done in LLNL patch.  But instead of
adding another
ll_ap_truncate() method (which was clunky) we could
integrate this cache
reclamation in to llap_shrink_cache().  This function would
have access 
to this fatal page list and all pages in error could be
disposed of first 
before touching any other pages.  We could then simply make
sure
llap_shrunk_cache() is called from brw_interpret_oap() after

osc_ap_completion() has run on all the pages.

We also need to add a multinode test to ensure clients
mounted on
different physcal nodes see the exact same behavior in the
fatal
and recoverable case.  Consistency is very import to us.

3) With this patch ll_ap_completion cannot fail, we should
take a look
at the other users such as obdecho/liblustre and see if it
makes sense 
to make this a void return.  If they can't fail either then
the
conditional check in osc_ap_completion() is just pointless.


I'm sure we'll give this a spin at LLNL soon.  But just as a
heads up there
is a good chance we'll stick with our local patch for our
1.4.8 based release.
We have many weeks of successful testing at scale with this
patch and our
deadline for a getting a new production release ready is
almost here.

_______________________________________________
Lustre-devel mailing list
Lustre-develclusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-devel


[1]

about | contact  Other archives ( Real Estate discussion Medical topics )