List Info

Thread: 2 bonnies can stop disk activity permanently




2 bonnies can stop disk activity permanently
user name
2006-10-08 22:37:22
Hi Bruce,

I'm the "veronica" Arne mentioned in the
freebsd-fs mailinglist.
Regarding the effectiveness of a higher blocksize, these are
my findings:

areca RAID5 (8x da, 128KB stripe, default newfs, NCQ
enabled)
              -------Sequential Output-------- ---Sequential
Input--
--Random--
              -Per Char- --Block--- -Rewrite-- -Per Char-
--Block---
--Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
K/sec %CPU 
/sec %CPU
ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5 90426 86.9
172490 24.0
120.7  0.5

areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs, NCQ
enabled)
              -------Sequential Output-------- ---Sequential
Input--
--Random--
              -Per Char- --Block--- -Rewrite-- -Per Char-
--Block---
--Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
K/sec %CPU 
/sec %CPU
ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0 103261
97.8 392970
53.8 119.8  0.6

As you can see, the block read increased from ~172MB/s to
~392MB/s,
quite significant increase. Also the reqrite speed increased
from
~67MB/s to ~116MB/s.

Ofcourse these tests are on a brand clean filesystem, which
might not
tally with real-life crowded filesystems. But at least there
is much
potential in a higher blocksize, and it would be a shame for
it to crash
FreeBSD. There are quite a few people who store big files on
big RAID
arrays; they could profit from a non-crashing FreeBSD with
bigger
blocksize. Besides, a crashing VFS/Geom isn't all that sexy.


- Veronica
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-09 19:37:33
On Mon, 9 Oct 2006, Fluffles.net wrote:

> I'm the "veronica" Arne mentioned in the
freebsd-fs mailinglist.
> Regarding the effectiveness of a higher blocksize,
these are my findings:
>
> areca RAID5 (8x da, 128KB stripe, default newfs, NCQ
enabled)
>              -------Sequential Output--------
---Sequential Input--
> --Random--
>              -Per Char- --Block--- -Rewrite-- -Per
Char- --Block---
> --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec
%CPU K/sec %CPU
> /sec %CPU
> ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5 90426
86.9 172490 24.0
> 120.7  0.5
>
> areca RAID5 (8x da, 128KB stripe, 64KB blocksize newfs,
NCQ enabled)
>              -------Sequential Output--------
---Sequential Input--
> --Random--
>              -Per Char- --Block--- -Rewrite-- -Per
Char- --Block---
> --Seeks---
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec
%CPU K/sec %CPU
> /sec %CPU
> ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0
103261 97.8 392970
> 53.8 119.8  0.6
>
> As you can see, the block read increased from ~172MB/s
to ~392MB/s,
> quite significant increase. Also the reqrite speed
increased from
> ~67MB/s to ~116MB/s.
>
> Ofcourse these tests are on a brand clean filesystem,
which might not
> tally with real-life crowded filesystems. But at least
there is much
> ...

This is a bit surprising.  FreeBSD is supposed to cluster
the i/o so
that (especially for large files on new file systems) almost
all i/o
is done in blocks of size 64K or 128K.

I suspect the problems are that the 64K-block i/o is usually
perfectly
misaligned unless the fs itself has 64K-blocks and the fs's
partition
starts on a 64K-block boundary, and that some hardware or
firmware
(mainly RAIDs) want the blocks to be aligned.  I'm not very
familiar
with RAIDs but think it would take a fairly
advanced/expensive one to
reblock all the i/at so that the alignment doesn't matter. 
It would
take more advanced/complicated clustering code or better
buffering code
than FreeBSD has to do the reblocking at the clustering or
buffering
level.  Perhaps even 64K-blocks are too small with your
RAID's stripe
size of 128K.

Bruce
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-09 20:47:25
Bruce Evans wrote:
> On Mon, 9 Oct 2006, Fluffles.net wrote:
> 
>> I'm the "veronica" Arne mentioned in the
freebsd-fs mailinglist.
>> Regarding the effectiveness of a higher blocksize,
these are my findings:
>>
>> areca RAID5 (8x da, 128KB stripe, default newfs,
NCQ enabled)
>>              -------Sequential Output--------
---Sequential Input--
>> --Random--
>>              -Per Char- --Block--- -Rewrite-- -Per
Char- --Block---
>> --Seeks---
>> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU
>> /sec %CPU
>> ARC8xR5  8480 119973 91.3 247178 58.6 67862 17.5
90426 86.9 172490 24.0
>> 120.7  0.5
>>
>> areca RAID5 (8x da, 128KB stripe, 64KB blocksize
newfs, NCQ enabled)
>>              -------Sequential Output--------
---Sequential Input--
>> --Random--
>>              -Per Char- --Block--- -Rewrite-- -Per
Char- --Block---
>> --Seeks---
>> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU
K/sec %CPU K/sec %CPU
>> /sec %CPU
>> ARC8xR5  8480 128920 97.8 265920 58.9 116787 31.0
103261 97.8 392970
>> 53.8 119.8  0.6
>>
>> As you can see, the block read increased from
~172MB/s to ~392MB/s,
>> quite significant increase. Also the reqrite speed
increased from
>> ~67MB/s to ~116MB/s.
>>
>> Ofcourse these tests are on a brand clean
filesystem, which might not
>> tally with real-life crowded filesystems. But at
least there is much
>> ...
> 
> 
> This is a bit surprising.  FreeBSD is supposed to
cluster the i/o so
> that (especially for large files on new file systems)
almost all i/o
> is done in blocks of size 64K or 128K.
> 
> I suspect the problems are that the 64K-block i/o is
usually perfectly
> misaligned unless the fs itself has 64K-blocks and the
fs's partition
> starts on a 64K-block boundary, and that some hardware
or firmware
> (mainly RAIDs) want the blocks to be aligned.  I'm not
very familiar
> with RAIDs but think it would take a fairly
advanced/expensive one to
> reblock all the i/at so that the alignment doesn't
matter.  It would
> take more advanced/complicated clustering code or
better buffering code
> than FreeBSD has to do the reblocking at the clustering
or buffering
> level.  Perhaps even 64K-blocks are too small with your
RAID's stripe
> size of 128K.
> 
> Bruce

Yes, it's a well-known problem that the combination of 
fdisk+disklabel+ufs means that all FS blocks are mis-aligned
in the 
worst way possible (blocks start on odd sector numbers). 
This
_horribly_ pessimizes RAID-5 on most controllers.  Solving
it reliably
and automatically is hard, though.  The filesystem
ultimately needs to
know the physical sector that it starts on, and compensate
accordingly.
You could cheat by having the disklabel tools always align
partitions,
but the tool would still need to know the physical address
of where it
starts in the slice.  Either way, something high up needs to
get the
logical to physical translation of the sectors.

Suggestions have been made to just put blind offsets into
the disklabel
tool that assumes the common case (mbr is present and is a
known length,
and that the disklabel is in the first slice of the MBR). 
Obviously,
this is only a crude hack.  I get around this right now by
not using a
disklabel or fdisk table on arrays where I value speed.  For
those, I
just put a filesystem directly on the array, and boot off of
a small
system disk.

Scott
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-09 21:50:30
On Mon, 09 Oct 2006 14:47:25 -0600, in
sentex.lists.freebsd.fs you
wrote:

>this is only a crude hack.  I get around this right now
by not using a
>disklabel or fdisk table on arrays where I value speed. 
For those, I
>just put a filesystem directly on the array, and boot
off of a small
>system disk.


Hi Scott,
	How is that done ?  just newfs -O2 -U /dev/da0  ?

	---Mike
--------------------------------------------------------
Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
mikesentex.net, (http://www.tancsa.com)
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-09 21:53:47
Mike Tancsa wrote:
> On Mon, 09 Oct 2006 14:47:25 -0600, in
sentex.lists.freebsd.fs you
> wrote:
> 
> 
>>this is only a crude hack.  I get around this right
now by not using a
>>disklabel or fdisk table on arrays where I value
speed.  For those, I
>>just put a filesystem directly on the array, and
boot off of a small
>>system disk.
> 
> 
> 
> Hi Scott,
> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
> 
> 	---Mike

Yup.

Scott
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-09 23:13:36
On Mon, 9 Oct 2006, Scott Long wrote:

> Bruce Evans wrote:
>> ...
>> I suspect the problems are that the 64K-block i/o
is usually perfectly
>> misaligned unless the fs itself has 64K-blocks and
the fs's partition
>> starts on a 64K-block boundary, and that some
hardware or firmware
>> (mainly RAIDs) want the blocks to be aligned.  I'm
not very familiar
>> ...
>
> Yes, it's a well-known problem that the combination of
fdisk+disklabel+ufs 
> means that all FS blocks are mis-aligned in the worst
way possible (blocks 
> start on odd sector numbers).  This
> _horribly_ pessimizes RAID-5 on most controllers.

Apparently the internal fs block alignment/size problem is
not so well
known.  I knew about the external one but didn't connect it
with fs
block sizes at first.  How horribly do aligned 16K-blocks
pessimize
RAID-5?  Does it help much to have misaligned 64K-blocks
instead of
misaligned 16K-blocks?

> Solving it reliably
> and automatically is hard, though.  The filesystem
ultimately needs to
> know the physical sector that it starts on, and
compensate accordingly.
> You could cheat by having the disklabel tools always
align partitions,
> but the tool would still need to know the physical
address of where it
> starts in the slice.  Either way, something high up
needs to get the
> logical to physical translation of the sectors.

The filesystem shouldn't need to know more than that its
starting sector
is not physically misaligned.  The clustering code could use
knowledge
of physical offsets and alignment requirements to fix up
some cases.
My version of newfs_msdosfs(8) uses the (unimplemented)
ioctl
DIOCMEDIAOFFSET to ask the system for the physical offset. 
Using
this is much easier than parsing XML.

> Suggestions have been made to just put blind offsets
into the disklabel
> tool that assumes the common case (mbr is present and
is a known length,
> and that the disklabel is in the first slice of the
MBR).  Obviously,
> this is only a crude hack.  I get around this right now
by not using a
> disklabel or fdisk table on arrays where I value speed.
 For those, I
> just put a filesystem directly on the array, and boot
off of a small
> system disk.

I normally align FreeBSD slices and partitions manually to a
"cylinder"
boundary, and this sometimes gives alignment to a large
power of 2
accidentally.  I normally use a fake cylinder size of 16065
(255 fake
heads and 63 sectors per fake track).  This is just as bad
for cylinder
alignment as 63 is for track alignment, but new systems only
need it
for compatibility with other systems.

Bruce
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-11 15:53:04
On Mon, 09 Oct 2006 15:53:47 -0600, in
sentex.lists.freebsd.fs you
wrote:

>Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 14:47:25 -0600, in
sentex.lists.freebsd.fs you
>> wrote:
>> 
>> 
>>>this is only a crude hack.  I get around this
right now by not using a
>>>disklabel or fdisk table on arrays where I value
speed.  For those, I
>>>just put a filesystem directly on the array, and
boot off of a small
>>>system disk.
>> 
>> 
>> 
>> 	How is that done ?  just newfs -O2 -U /dev/da0  ?
>
>Yup.

Hi,
	Is this going to work in most/all cases ?  In other words,
how
to I make sure the file system I lay down is indeed properly
/
optimally aligned with the underlying structure ?

	---Mike
--------------------------------------------------------
Mike Tancsa, Sentex communications http://www.sentex.net
Providing Internet Access since 1994
mikesentex.net, (http://www.tancsa.com)
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-11 16:55:18
Mike Tancsa wrote:
> On Mon, 09 Oct 2006 15:53:47 -0600, in
sentex.lists.freebsd.fs you
> wrote:
> 
> 
>>Mike Tancsa wrote:
>>
>>>On Mon, 09 Oct 2006 14:47:25 -0600, in
sentex.lists.freebsd.fs you
>>>wrote:
>>>
>>>
>>>
>>>>this is only a crude hack.  I get around
this right now by not using a
>>>>disklabel or fdisk table on arrays where I
value speed.  For those, I
>>>>just put a filesystem directly on the array,
and boot off of a small
>>>>system disk.
>>>
>>>
>>>
>>>	How is that done ?  just newfs -O2 -U /dev/da0 
?
>>
>>Yup.
> 
> 
> Hi,
> 	Is this going to work in most/all cases ?  In other
words, how
> to I make sure the file system I lay down is indeed
properly /
> optimally aligned with the underlying structure ?
> 
> 	---Mike

UFS1 skips the first 8k of its space to allow for
bootstrapping/partitioning data.  UFS2 skips the first 64k.
Blocks are then aligned to that skip.  64K is a good
alignment
for most RAID cases.  But understanding exactly how RAID-5
works
will help you make appropriate choices.

(Note that in the follow write-up I'm actually describing
RAID-4.
The only difference between RAID-4 and 5 is that the parity
data
is spread out to all of the disks instead of being kept all
on a
single disk.  However, this is just a performance detail,
and it's
easier to describe how things work if you ignore it)

As you might know, RAID-4/5 takes N disks and writes data to
N-1 of
them while computing and writing a parity calculation to the
Nth
disk.  That parity calculation is a logical XOR of the data
disks.
One of the neat properties of XOR is that it's a reversible
algorithm;
you can take the final answer and re-run the XOR using all
but one of
the opriginal comoponents and get an answer that represents
the data of
the missing component.

The array is divided into 'stripes', each stripe containing
a equal
subsection of each data disk plus the parity disk.  When we
talk about
'stripe size', what we are refering to is the size of one of
those
subsections.  A 64K stripe size means that each disk is
divided into
64K subsections.  The total amount of data in a stripe is
then a
function of the stripe size and the number of disks in the
array.  If
you have 5 disks in your array and have set a stripe size of
64K, each
stripe will hold a total of 256K of data (4 data disks and 1
parity
disk).

Every time you write to an RAID-5 array, parity needs to be
updated.
As everything operates in terms of the stripes, the most
straight
forward way to do this is to read all of the data from the
stripe,
replace the portion that is being written, recompute the
parity, and
then write out the updates.  This is also the slowest way to
do it.

An easy optimization is to buffer the writes and look for
situations
where all of the data in a stripe is being written
sequentially.  If
all of the data in the stripe is being replaced, there is no
need to
read any of the old data.  Just collect all of the writes
together,
compute the parity, and write everything out all at once.

Another optimization is to recognize when only one member of
the stripe
is being updated.  For that, you read the parity, read the
old data, and
then XOR out the old data and XOR in the new data.  You
still have the
latency of waiting for a read, but on a busy system you
reduce head
movement on all of the disks, which is a big win.

Both of these optmizations rely on the writes having a
certain amount
of alignment.  If your stripe size is 64k and your writes
are 64k, but
they all start at an 8k offset into the stripe, you loose. 
Each 64K
write will have to touch 56k of one disk and 8k of the next
disk.  But,
an 8k offset can be made to work if you reduce your stripe
size to 8k.
It then becomes an excercise in balancing the parameters of
FS block
size and array stripe size to give you the best peformance
for your
needs.  The 64k offset in UFS2 gives you more room to work
here, so
that's why I say at the beginning that it's a good value. 
In any case,
you want to choose parameters that result in each block
write covering
either a single disk or a whole stripe.

Where things really go bad for BSD is when a _63_ sector
offset gets
introduced for the MBR.  Now everything is offset to an odd,
non-power-of-2 value, and there isn't anything that you can
tweak in the
filesystem or array to compensate.  The best you can do is
to manually
calculate a compensating offset in the disklabel for each
partition.
But at the point, it often becomes easier to just ditch all
of that and
put the fielsystem directly on the disk.

Scott
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
2 bonnies can stop disk activity permanently
user name
2006-10-11 19:41:19
On 10/11/06 11:55, Scott Long wrote:
> Mike Tancsa wrote:
>> On Mon, 09 Oct 2006 15:53:47 -0600, in
sentex.lists.freebsd.fs you
>> wrote:
>>
>>
>>> Mike Tancsa wrote:
>>>
>>>> On Mon, 09 Oct 2006 14:47:25 -0600, in
sentex.lists.freebsd.fs you
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> this is only a crude hack.  I get
around this right now by not using a
>>>>> disklabel or fdisk table on arrays
where I value speed.  For those, I
>>>>> just put a filesystem directly on the
array, and boot off of a small
>>>>> system disk.
>>>>
>>>>
>>>> 	How is that done ?  just newfs -O2 -U
/dev/da0  ?
>>> Yup.
>>
>> Hi,
>> 	Is this going to work in most/all cases ?  In
other words, how
>> to I make sure the file system I lay down is indeed
properly /
>> optimally aligned with the underlying structure ?
>>
>> 	---Mike
> 
> UFS1 skips the first 8k of its space to allow for
> bootstrapping/partitioning data.  UFS2 skips the first
64k.
> Blocks are then aligned to that skip.  64K is a good
alignment
> for most RAID cases.  But understanding exactly how
RAID-5 works
> will help you make appropriate choices.
> 
> (Note that in the follow write-up I'm actually
describing RAID-4.
> The only difference between RAID-4 and 5 is that the
parity data
> is spread out to all of the disks instead of being kept
all on a
> single disk.  However, this is just a performance
detail, and it's
> easier to describe how things work if you ignore it)
> 
> As you might know, RAID-4/5 takes N disks and writes
data to N-1 of
> them while computing and writing a parity calculation
to the Nth
> disk.  That parity calculation is a logical XOR of the
data disks.
> One of the neat properties of XOR is that it's a
reversible algorithm;
> you can take the final answer and re-run the XOR using
all but one of
> the opriginal comoponents and get an answer that
represents the data of
> the missing component.
> 
> The array is divided into 'stripes', each stripe
containing a equal
> subsection of each data disk plus the parity disk. 
When we talk about
> 'stripe size', what we are refering to is the size of
one of those
> subsections.  A 64K stripe size means that each disk is
divided into
> 64K subsections.  The total amount of data in a stripe
is then a
> function of the stripe size and the number of disks in
the array.  If
> you have 5 disks in your array and have set a stripe
size of 64K, each
> stripe will hold a total of 256K of data (4 data disks
and 1 parity
> disk).
> 
> Every time you write to an RAID-5 array, parity needs
to be updated.
> As everything operates in terms of the stripes, the
most straight
> forward way to do this is to read all of the data from
the stripe,
> replace the portion that is being written, recompute
the parity, and
> then write out the updates.  This is also the slowest
way to do it.
> 
> An easy optimization is to buffer the writes and look
for situations
> where all of the data in a stripe is being written
sequentially.  If
> all of the data in the stripe is being replaced, there
is no need to
> read any of the old data.  Just collect all of the
writes together,
> compute the parity, and write everything out all at
once.
> 
> Another optimization is to recognize when only one
member of the stripe
> is being updated.  For that, you read the parity, read
the old data, and
> then XOR out the old data and XOR in the new data.  You
still have the
> latency of waiting for a read, but on a busy system you
reduce head
> movement on all of the disks, which is a big win.
> 
> Both of these optmizations rely on the writes having a
certain amount
> of alignment.  If your stripe size is 64k and your
writes are 64k, but
> they all start at an 8k offset into the stripe, you
loose.  Each 64K
> write will have to touch 56k of one disk and 8k of the
next disk.  But,
> an 8k offset can be made to work if you reduce your
stripe size to 8k.
> It then becomes an excercise in balancing the
parameters of FS block
> size and array stripe size to give you the best
peformance for your
> needs.  The 64k offset in UFS2 gives you more room to
work here, so
> that's why I say at the beginning that it's a good
value.  In any case,
> you want to choose parameters that result in each block
write covering
> either a single disk or a whole stripe.
> 
> Where things really go bad for BSD is when a _63_
sector offset gets
> introduced for the MBR.  Now everything is offset to an
odd,
> non-power-of-2 value, and there isn't anything that you
can tweak in the
> filesystem or array to compensate.  The best you can do
is to manually
> calculate a compensating offset in the disklabel for
each partition.
> But at the point, it often becomes easier to just ditch
all of that and
> put the fielsystem directly on the disk.
> 
> Scott


Scott,

Just wanted to say thanks for such a well put explanation on
this, with 
all the right details.

Eric



-- 
------------------------------------------------------------
------------
Eric Anderson        Sr. Systems Administrator       
Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------
------------
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"
[1-9]

about | contact  Other archives ( Real Estate discussion Medical topics )