|
List Info
Thread: getting back to mmap
|
|
| getting back to mmap |

|
2008-04-29 14:57:09 |
On Tue, Apr 29, 2008 at 11:51 AM, Marvin Humphrey
<marvin rectangular.com> wrote:
> On Apr 28, 2008, at 6:37 PM, Nathan Kurz wrote:
> > We've been off the mmap theme for a while, but I
think it's still very
> > relevant to KinoSearch.
> >
>
> Thing is, we never resolved the problems of how to
balance data
> compression, file sizes over 4 gb, mmap, portability,
and Posting's
> architecture. There is Lucene's MMapDirectory as a
point of reference. But
> all that's better discused on-list.
To my recollection, portability to Windows is the only real
problem
with this approach. On Linux (and I presume OS X) mmap
already
underlies the existing file system. All one is doing is
stripping out
some unnecessary copies and duplication of effort between
KinoSearch
and the system. The goal would be to have C structures
with
elements pointing directly to the system buffers, and to let
the
system handle all the paging and buffering issues.
Compression would
be dealt with one Posting at a time (efficient to do it at
the last
minute to keep things in processor cache) and memory limits
aren't a
problem to my knowledge (if the file system allows we can
handle it
too).
It's quite possible that this can be done on Windows as
well, using
some other technique I'm not familiar with. But despite
the
similarity in name, I don't think that MMapDirectory is
particularly
relevant, as I don't think Java allows direct pointers to
system
buffers. I think they are merely using mmap to open the
files, and
then doing all the copies in userland, which doesn't buy
much.
One wouldn't be to change over to using all mmap'ed IO,
rather just
design the file structures and internal API's so it is
possible to use
them in a more efficient fashion. From what I can tell,
you are
already moving in this direction, making the lower levels
(scorers)
less aware of the upper ones (posting lists). Then at some
point,
someone (me once I have time?) would write a Linux specific
mmap
version that we can test for performance. If it works, we
can
integrate it better with the core. Then ideally, someone
with more
knowledge of Windows internals can do something parallel.
> You may be interested in an ongoing dialog between
Mike McCandless and
> myself on java-dev lucene.apache.org about PostingList and
the postings file
> format. There's some stuff in there about phrase
scorers, too. In addition
> to many other contributions to Lucene such as the
lockless-commits file
> format innovation, Mike's applied a bunch of concepts
from KS.
> http://www.nabble.c
om/Pooling-of-posting-objects-in-DocumentsWriter-tt16565743.
html#a16596031
Thanks! I only read through it quickly, but there are a lot
of good
ideas there (most of which flew over my head).
Given my comments above about working with the file system,
a few of
the parts about bulk reads, buffered writes, and disk seeks
made me
cringe a little. As the architecture article mentioned,
'flushing'
to 'disk' when your 'memory' is full might not work as well
as one
hopes. I'd guess that one could do something simpler but
just as
efficient by just calling 'write' each time, and letting the
system
decide when to commit the least recently used page to the
physical
disk. And I'd bet that (at least on Linux) one could do
something
considerably more efficient by using a file backed mmap and
letting
the system handle the details. Unfortunately, I won't be
able to back
that up with code any time soon. :(
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: getting back to mmap |
  United Kingdom |
2008-04-29 16:28:38 |
On Tue, Apr 29, 2008 at 01:57:09PM -0600, Nathan Kurz said:
> All one is doing is stripping out some unnecessary
copies and
> duplication of effort between KinoSearch and the
system. The goal
> would be to have C structures with elements pointing
directly to the
> system buffers, and to let the system handle all the
paging and
> buffering issues.
As a datum to this you might be interested to read
http://varnish.projects.linpro.no/wiki/ArchitectNotes
which has some interesting notes on how Varnish, a web
proxy, is built
with regards to techniques like this.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: getting back to mmap |
  United States |
2008-05-02 00:42:35 |
On Apr 29, 2008, at 2:28 PM, Simon Wistow wrote:
> As a datum to this you might be interested to read
>
> http://varnish.projects.linpro.no/wiki/ArchitectNotes
>
> which has some interesting notes on how Varnish, a web
proxy, is built
> with regards to techniques like this.
That article was actually what spurred Nathan to contact me
off-list.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: getting back to mmap |
  United States |
2008-05-02 00:43:08 |
On Apr 29, 2008, at 12:57 PM, Nathan Kurz wrote:
>>> We've been off the mmap theme for a while, but
I think it's still
>>> very
>>> relevant to KinoSearch.
I agree that the potential is there. Some restrictions
intentionally
built into KS should make things easier:
1) All files used at search-time are guaranteed to be
read-only.
2) OutStreams cannot seek; files are always written
linearly from
top to tail.
> To my recollection, portability to Windows is the only
real problem
> with this approach.
I've been doing a little research and I think we'll be able
to emulate
what we need to. Even if we can't, the current
implementation will
continue to work just fine.
> On Linux (and I presume OS X) mmap already
> underlies the existing file system. All one is doing
is stripping out
> some unnecessary copies and duplication of effort
between KinoSearch
> and the system.
If you didn't already know, you may be amused to hear that
KS still
uses standard C buffered IO (FILE*, fread, fwrite) because
it's the
lowest common denominator for portability -- so there are
presumably
multiple copies going on. However,
at least
KinoSearch::Store::FSFileDes uses setvbuf() to turn off
buffering for
output. Strangely, I wasn't able to turn off input
buffering using
setvbuf() because doing so had a significant negative impact
on the
indexing performance benchmark across multiple operating
systems.
FWIW, a version of FSFileDes which used unbuffered IO
(open(), read(),
and write()) only purchased an improvement of a couple
percent.
Probably we'll finish an mmap version first and that patch
will never
be applied.
I think the real gains are likely to come from tweaking how
we do
caching at search time.
> The goal would be to have C structures with
> elements pointing directly to the system buffers, and
to let the
> system handle all the paging and buffering issues.
I suspect we will still want to encapsulate IO functions
within
InStream and OutStream... We'll still have code that looks
like this:
u32_t doc_num = InStream_Read_C32(instream);
However, InStream will "refill" its
"buffer" using mmap/munmap rather
than by reading 1k worth of data at a shot from a file via
fread() or
read(). In other words, it will be InStream's buffer that
is the C
structure element pointing directly to the system buffer as
you
describe above.
> One wouldn't be to change over to using all mmap'ed IO,
rather just
> design the file structures and internal API's so it is
possible to use
> them in a more efficient fashion.
If I recall correctly, you were talking about having higher
level
objects such as Posting hold pointers to the system buffers.
The
thing is, the on-disk files use a lot of compression, so you
can't
read them as e.g. arrays of u32_t.
But what if we tried? Say you were implementing the
simplest kind of
posting list, which is just a list of document numbers. You
could
theoretically write uncompressed 32-bit document numbers to
disk, mmap
the file to a "doc_nums" array member within
"MMapPosting" and then
implement it along these lines:
MMapPosting_get_doc_num(MMapPosting *self)
{
return self->doc_nums[self->tick];
}
However, if you did that, it would increase the file size
and i/o
requirements by 2-4x. I don't think that's likely to yield
a net gain.
>> You may be interested in an ongoing dialog between
Mike McCandless
>> and
>> myself on java-dev lucene.apache.org about
PostingList and the
>> postings file
>> format. There's some stuff in there about phrase
scorers, too. In
>> addition
>> to many other contributions to Lucene such as the
lockless-commits
>> file
>> format innovation, Mike's applied a bunch of
concepts from KS.
>> http://www.nabble.c
om/Pooling-of-posting-objects-in-DocumentsWriter-tt16565743.
html#a16596031
>
> Thanks! I only read through it quickly, but there are
a lot of good
> ideas there (most of which flew over my head).
>
> Given my comments above about working with the file
system, a few of
> the parts about bulk reads, buffered writes, and disk
seeks made me
> cringe a little.
Well, did you at least notice that we were designing the
file format
with SSDs in mind when you were scoring our discussion for
buzzword
compliance? ;)
> As the architecture article mentioned, 'flushing'
> to 'disk' when your 'memory' is full might not work as
well as one
> hopes.
Flushing to disk was discussed in the context of the
external sorter,
used during indexing. For the external sorter,
"flush" doesn't mean
just "write buffer to disk", it means "sort
objects in cache, then
write their data to disk." Since it's not a simple
write op, we don't
have the option of handing things off to the kernel.
Also, it would be unusual for an indexing application to
behave like
squid. Indexers tend to be either active and dominating the
computer,
or not running. Swapping out of pages because the indexing
app has
been idle for a spell, well... that's just not a common
enough use
case that we need to worry about optimizing for it, IMO.
> I'd guess that one could do something simpler but just
as
> efficient by just calling 'write' each time, and
letting the system
> decide when to commit the least recently used page to
the physical
> disk.
Mike and I weren't discussing OutStream, but it's true that
the each
OutStream object maintains its own 1k buffer. It would be
nice to
ditch that, because there are clearly some useless copy ops
-- but I
anticipate achieving only minor, incremental gains for that
trouble.
What would be really great is if we could optimize term
dictionaries
and sort caches for mmap.
> Unfortunately, I won't be able to back
> that up with code any time soon. :(
Well, that just means both communiques and code will emerge
from me
more slowly.
Nevertheless, I've been pleased with how our design
discussions have
gone. It was your insight that just solved the multi-field
query
parsing problem which has been vexing me for a long time,
Compiler
finally has a half-decent interface, and I feel pretty
confident that
ANDQuery, ORQuery, etc will code up well. Let's keep things
going.
I just wish we could get you, Dave Balmain, Mike McCandless
and myself
all together hashing out a file format in the same forum at
the same
time.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: getting back to mmap |

|
2008-05-05 17:48:32 |
On Thu, May 1, 2008 at 11:43 PM, Marvin Humphrey
<marvin rectangular.com> wrote:
> I suspect we will still want to encapsulate IO
functions within InStream
> and OutStream... We'll still have code that looks like
this:
>
> u32_t doc_num = InStream_Read_C32(instream);
While I think this makes good sense for the current format,
I view
this as an implementation detail that should be internal to
the
default Index format. For an uncompressed format using
mmap (which I
agree is useful only in certain limited circumstances), I'd
really be
able to knock the whole thing down to "posting +=
posting->length"
(with some appropriate bounds checks).
For compressed formats, I still would want to be able to use
format
specific decompressions. For example, I've been tempted to
use
SQLite as a data store, and access its BTrees directly.
While I guess
one could do this with a specially subclassed InStream, I'd
rather
have the external interface be at some higher level.
Note that I'm not looking to index text content stored in
SQLite,
rather to use SQLite (or at least its BTree implementation)
to manage
the binary PostingLists used by KinoSearch. I'm also
interested in
using my own custom file formats, as well as trying a
balanced file
system like Reiser. Thus I would like to clearly defined
interface
between the Scorer and the Index that doesn't make
assumptions about
how the Index will fulfill the 'Posting_next' request.
> Well, did you at least notice that we were designing
the file format with
> SSDs in mind when you were scoring our discussion for
buzzword compliance?
> ;)
I did notice, and this was one of the sections that made me
uneasy.
I'd guess that Kinosearch is not currently being limited by
disk seek
times, but by sustained transfer rates --- or at least I'd
hope so.
My goal would be to have it limited by memory bandwidth,
which is many
times faster than an SSD accessed as a disk.
Am I missing something here that would give SSD's a real
advantage?
Or reasons that they benefit from being treated differently?
My
instinct is that they'd be of more benefit for truly random
access or
for cases where reads and writes are more balanced. I'm not
familiar
with them in practice.
> > Unfortunately, I won't be able to back
> > that up with code any time soon. :(
> >
> Well, that just means both communiques and code will
emerge from me more
> slowly.
I understand, and am appreciative that the responses come
at all.
> I just wish we could get you, Dave Balmain, Mike
McCandless and myself all
> together hashing out a file format in the same forum at
the same time.
If I'm useful at all in my currently addled state, I'd be
happy to try.
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
[1-5]
|
|