List Info

Thread: Re: fs/udf: vm pages "overlap" while reading large dir




Re: fs/udf: vm pages "overlap" while reading large dir
country flaguser name
Ukraine
2008-02-06 10:29:22
on 04/02/2008 17:55 Scott Long said the following:
> Andriy Gapon wrote:
[snip]
>> After some code reading and run-time debugging,
here are some facts
>> about udf directory reading:
>> 1. bread-ing is done via device vnode (as opposed
to directory vnodes),
>> as a consequence udf_strategy is not involved in
directory reading
>> 2. the device vnode has a non-NULL v_object, this
means that VMIO is used
[strike out incorrect guess]
>> 4. bread is done on as much data as possible, up to
MAXBSIZE [and even
>> slightly more in some cases] (i.e. code determines
directory data size
>> and attempts to read in everything in one go)
>> 5. physical sector number adjusted to DEV_BSIZE
(512 bytes) sectors is
>> passed to bread - this is because of #1 above and
peculiarities of buf
>> system
>> 6. directory reading and file reading are quite
different, because the
>> latter does reading via file vnode
>>
>> Let's a consider a simple case of "directory
reading" 5 * PAGE_SIZE (4K)
>> bytes starting from a physical sector N with N%2 =
0 from media with
>> sector size of 2K (most common CD/DVD case):
>> 1. we want to read 12 KB = 3 pages = 6 sectors
starting from sector N,
>> N%2 = 0
>> 2. 4*N is passed as a sector number to bread
>> 3. bo_bsize of the corresponding bufobj is a media
sector size, i.e. 2K
>> 4. actual read in bread will happen from b_iooffset
of 4*N*DEV_BSIZE =
>> N*2K, i.e. correct offset within the device
>> 5. for a fresh read getblk will be called
>> 6. getblk will set b_offset to
blkno*bo_bsize=4*N*2K, i.e. 4 times the
>> correct byte offset within the device
>> 7. allocbuf will allocate pages using B_VMIO case
>> 8. allocbuf calculates base page index as follows:
pi = b_offset/PAGE_SIZE
>> this means the following:
>> sectors N and N+1 will be bound to a page with
index 4*N*2K/4K = 2*N
>> sectors N+2 and N+3 will be bound to the next page
2*N +1
>> sectors N+4 and N+5 is tied to the next page 2*N +
2
[insert] note that bstrategy will get correct params as
b_iooffset is
set to blkno*DEV_BSIZE, i.e. correct byte offset.
>>
>> Now let's consider a "directory read" of
2 sectors (1 page) starting at
>> physical sector N+1.
>> Using the same calculations as above, we see that
the sector will be
>> tied to a page 2*(N+1) = 2*N + 2.
>> And this page already contains valid cached data
from the previous read,
>> *but* it contains data from sectors N+4 and N+5
instead of N+1 and N+2.
>>
>> So, we see, that because of b_offset being 4 times
the actual byte
>> offset, we get incorrect page indexing. Namely,
sector X gets associated
>> with different pages depending on what sector is
used as a starting
>> sector for bread. If bread starts at sector N, then
data of sector N+1
>> is associated with (second half of) page with index
2*N; but if bread
>> starts at sector N+1 then it gets associated with
(the first half of)
>> page with index 2*(N+1).
[snip]
> I think you forgot to attach the patch =-)  This is an
excellent 
> write-up, and I think you're on the right track.  It's
been nearly 8 
> years since I wrote most of this code, so my memory is
a bit fuzzy, but
> I think the reason that I used the device vnode is
because I was having
> trouble with overlapping buffers when using the
directory vnode, so this
> was the easiest way to avoid that at the time.  Since
it sounds like 
> this is no longer a viable solution, I definitely
support your ideas.

[additional analysis of the problem, with new facts that
might give rise
to a different solution]

Small summary of the above long description.
For directory reading fs/udf performs bread() on a
(underlying) device
vnode. It passes block number as if block size was 512 bytes
(i.e.
byte_offset_within_dev/512). On the other hand vm page index
calculation
code uses the following formula:
(blkno*bo_bsize)/PAGE_SIZE.
For CD/DVD devices bo_bsize is set to 2048. This results in
page index
overlap when reading sufficiently many blocks from adjacent
starting blocks.
That is, it seems that in "normal" cases page
index gets calculated as
byte_offset/PAGE_SIZE, but in this case page index gets to
be
4*byte_offset/PAGE_SIZE. More details are in the quoted
above text.

With a lot of help from Bruce Evance I found this commit
which seems to
have made a big difference:
http://www.freebsd.org/cg
i/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f
=h

Before this change page index for device vnodes was always
(blkno*512)/PAGE_SIZE. This is because of the vn_isdisk()
case.

udf_mountfs device vnode is passed to g_vfs_open() (in
geom_vfs.c) and
the latter has the following line of code:
bo->bo_bsize = pp->sectorsize;
Where pp is geom provider for the device in question.

I am not sure if the mentioned above vfs_bio.c change broke
something
else in reading from device vnodes or if it fixed something
for that. I
am also not sure what would be the consequences of setting
bo_bsize to
512 for vnodes of "disk" devices. It would most
probably fix current
code in UDF, but I am not sure if it might break anything
else.

Just wanted to let the more knowledgeable people know and
make a decision.

P.S. bo_bsize seems to be referenced only in a handful of
files and in
most of them it seems that it is related to "file"
vnodes (as opposed to
"device" vnodes).

-- 
Andriy Gapon
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"

Re: fs/udf: vm pages "overlap" while reading large dir
country flaguser name
Ukraine
2008-02-11 16:23:04
on 06/02/2008 18:29 Andriy Gapon said the following:
> Small summary of the above long description.
> For directory reading fs/udf performs bread() on a
(underlying) device
> vnode. It passes block number as if block size was 512
bytes (i.e.
> byte_offset_within_dev/512). On the other hand vm page
index calculation
> code uses the following formula:
(blkno*bo_bsize)/PAGE_SIZE.
> For CD/DVD devices bo_bsize is set to 2048. This
results in page index
> overlap when reading sufficiently many blocks from
adjacent starting blocks.
> That is, it seems that in "normal" cases page
index gets calculated as
> byte_offset/PAGE_SIZE, but in this case page index gets
to be
> 4*byte_offset/PAGE_SIZE. More details are in the quoted
above text.
> 
> With a lot of help from Bruce Evance I found this
commit which seems to
> have made a big difference:
> http://www.freebsd.org/cg
i/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f
=h
> 
> Before this change page index for device vnodes was
always
> (blkno*512)/PAGE_SIZE. This is because of the
vn_isdisk() case.
> 
> udf_mountfs device vnode is passed to g_vfs_open() (in
geom_vfs.c) and
> the latter has the following line of code:
> bo->bo_bsize = pp->sectorsize;
> Where pp is geom provider for the device in question.
> 
> I am not sure if the mentioned above vfs_bio.c change
broke something
> else in reading from device vnodes or if it fixed
something for that. I
> am also not sure what would be the consequences of
setting bo_bsize to
> 512 for vnodes of "disk" devices. It would
most probably fix current
> code in UDF, but I am not sure if it might break
anything else.
> 
> Just wanted to let the more knowledgeable people know
and make a decision.
> 
> P.S. bo_bsize seems to be referenced only in a handful
of files and in
> most of them it seems that it is related to
"file" vnodes (as opposed to
> "device" vnodes).
> 

Paul,

do you have any comment or opinion on the above?
[sorry if clamped too much of the previous context, but it's
all in
archives]

And a little continuation:
Just for the sake of experiment I tried to emulated the
previous
behavior by changing geom_vfs.c, function g_vfs_open(), so
that the
relevant code reads as follows:
        if (vn_isdisk(vp, NULL))
                bo->bo_bsize = DEV_BSIZE;
        else
                bo->bo_bsize = pp->sectorsize;

It didn't break anything for me and it re-enabled current
(CVS) fs/udf
code to work as before. (patch from kern/78987 still has to
be applied
for large directories to be handled).
So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and
cd9660 (on
CD) are not broken. Of course, the change should have
affected only
filesystems on CD/DVD (where sector/block size is not 512
bytes).
Of course, unlike Bruce I don't use msdosfs on CD/DVD media
(e.g.
DVD-RAM), so my test is somewhat incomplete.

-- 
Andriy Gapon
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"

Re: fs/udf: vm pages "overlap" while reading large dir
country flaguser name
Ukraine
2008-02-11 16:38:12
on 06/02/2008 18:29 Andriy Gapon said the following:
> Small summary of the above long description.
> For directory reading fs/udf performs bread() on a
(underlying) device
> vnode. It passes block number as if block size was 512
bytes (i.e.
> byte_offset_within_dev/512). On the other hand vm page
index calculation
> code uses the following formula:
(blkno*bo_bsize)/PAGE_SIZE.
> For CD/DVD devices bo_bsize is set to 2048. This
results in page index
> overlap when reading sufficiently many blocks from
adjacent starting blocks.
> That is, it seems that in "normal" cases page
index gets calculated as
> byte_offset/PAGE_SIZE, but in this case page index gets
to be
> 4*byte_offset/PAGE_SIZE. More details are in the quoted
above text.
> 
> With a lot of help from Bruce Evance I found this
commit which seems to
> have made a big difference:
> http://www.freebsd.org/cg
i/cvsweb.cgi/src/sys/kern/vfs_bio.c.diff?r1=1.458;r2=1.459;f
=h
> 
> Before this change page index for device vnodes was
always
> (blkno*512)/PAGE_SIZE. This is because of the
vn_isdisk() case.
> 
> udf_mountfs device vnode is passed to g_vfs_open() (in
geom_vfs.c) and
> the latter has the following line of code:
> bo->bo_bsize = pp->sectorsize;
> Where pp is geom provider for the device in question.
> 
> I am not sure if the mentioned above vfs_bio.c change
broke something
> else in reading from device vnodes or if it fixed
something for that. I
> am also not sure what would be the consequences of
setting bo_bsize to
> 512 for vnodes of "disk" devices. It would
most probably fix current
> code in UDF, but I am not sure if it might break
anything else.
> 
> Just wanted to let the more knowledgeable people know
and make a decision.
> 
> P.S. bo_bsize seems to be referenced only in a handful
of files and in
> most of them it seems that it is related to
"file" vnodes (as opposed to
> "device" vnodes).
> 

Poul-Henning,

I am sorry for this duplicate post and very sorry for
misspelling your
name in the first one.

do you have any comment or opinion on the above?
[sorry if clamped too much of the previous context, but it's
all in
archives]

And a little continuation:
Just for the sake of experiment I tried to emulated the
previous
behavior by changing geom_vfs.c, function g_vfs_open(), so
that the
relevant code reads as follows:
        if (vn_isdisk(vp, NULL))
                bo->bo_bsize = DEV_BSIZE;
        else
                bo->bo_bsize = pp->sectorsize;

It didn't break anything for me and it re-enabled current
(CVS) fs/udf
code to work as before. (patch from kern/78987 still has to
be applied
for large directories to be handled).
So udf (on DVD) is fixed, ufs (on HDD), msdosfs (on HDD) and
cd9660 (on
CD) are not broken. Of course, the change should have
affected only
filesystems on CD/DVD (where sector/block size is not 512
bytes).
Of course, unlike Bruce I don't use msdosfs on CD/DVD media
(e.g.
DVD-RAM), so my test is somewhat incomplete.

-- 
Andriy Gapon
_______________________________________________
freebsd-fsfreebsd.org mailing list

http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to
"freebsd-fs-unsubscribefreebsd.org"

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )