|
List Info
Thread: serializing safely
|
|
| serializing safely |
  United States |
2007-06-13 10:50:32 |
I'm going to be using KS in a persistent environment (a
fastcgi). The
Searcher/IndexReader docs recommend caching the searcher for
better
performance.
Because my fastcgi (as so many others) has multiple
children, my first instinct
is to store the searcher in the session so that any of the
children can get at
it as needed. I know that a bunch of the KS guts are XS or
C, though; what
things can be safely Storabled and put into a database, and
what things will
either blow up or silently not work?
hdp.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-13 13:57:21 |
On Wed, Jun 13, 2007 at 11:50:32AM -0400, Hans Dieter
Pearcey wrote:
> Because my fastcgi (as so many others) has multiple
children, my first instinct
> is to store the searcher in the session so that any of
the children can get at
> it as needed. I know that a bunch of the KS guts are
XS or C, though; what
> things can be safely Storabled and put into a database,
and what things will
> either blow up or silently not work?
I answered at least part of my own question, namely that
Searcher can't be
stored. How do people usually handle this sort of thing?
My first thought is
to write something kind of like SearchServer and do simple
RPC to it from my
application.
hdp.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-13 23:48:49 |
On Jun 13, 2007, at 11:57 AM, Hans Dieter Pearcey wrote:
> On Wed, Jun 13, 2007 at 11:50:32AM -0400, Hans Dieter
Pearcey wrote:
>> Because my fastcgi (as so many others) has multiple
children, my
>> first instinct
>> is to store the searcher in the session so that any
of the
>> children can get at
>> it as needed. I know that a bunch of the KS guts
are XS or C,
>> though; what
>> things can be safely Storabled and put into a
database, and what
>> things will
>> either blow up or silently not work?
>
> I answered at least part of my own question, namely
that Searcher
> can't be
> stored. How do people usually handle this sort of
thing? My first
> thought is
> to write something kind of like SearchServer and do
simple RPC to
> it from my
> application.
This is usual way to cache a Searcher with FastCGI:
use CGI::Fast;
use KinoSearch::Searcher;
# load searcher once, outside loop
my $searcher = KinoSearch::Searcher->new(
invindex => Schema->open('/path/to/invindex'),
);
while ( my $cgi = CGI::Fast->new ) {
process_search();
}
If that doesn't work for you, can you you please illustrate
how your
app differs?
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 06:50:30 |
On Wed, Jun 13, 2007 at 09:48:49PM -0700, Marvin Humphrey
wrote:
> # load searcher once, outside loop
> my $searcher = KinoSearch::Searcher->new(
> invindex =>
Schema->open('/path/to/invindex'),
> );
>
> while ( my $cgi = CGI::Fast->new ) {
> process_search();
> }
>
> If that doesn't work for you, can you you please
illustrate how your
> app differs?
I had been thinking of putting it into an Apache::Session.
Will your suggestion survive a fork usefully? I don't know
what's in
Searcher's guts.
My app primarily differs in that I was planning on having
many invindexes, two
or three per user, so opening them all at program start
would probably be
inefficient (there are several hundred of them).
hdp.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 09:09:12 |
On Jun 14, 2007, at 4:50 AM, Hans Dieter Pearcey wrote:
> I had been thinking of putting it into an
Apache::Session.
Serializing a Searcher so that the state of the *Searcher*
*object*
can be preserved between requests? That wouldn't aid
performance,
even if Searcher could be serialized.
What you're describing is analogous to serializing a
filehandle --
you can write code to do it, but you probably don't want to.
You can
record the filehandle's file position. In theory you can
even
serialize the bytes held in the filehandle's read buffer,
though
that's a bizarre thing to do, since the buffer is just an
in-memory
cache that spares you from having to access the disk with
every read op.
But what will you get when you deserialize that filehandle?
Does the
file even exist anymore? Is the data from the old read
buffer still
valid? Is the file the same length? Why would you ever do
something
like serialize and restore a filehandle, rather than just
open the
file again?
I think you may have been misled by the phrase,
"caching a Searcher",
which appears in the KS documentation. The point is to
cache a
Searcher *in* *RAM*, so that you don't pay the startup costs
of
reading a bunch of data off disk and into RAM over and over
with each
new search.
> Will your suggestion survive a fork usefully?
You won't get memory errors, but you can't use the Searcher
in both
processes. Searchers keep several filehandles open. If
both parent
and child attempt to read from the shared file descriptors
after the
fork, they'll interfere with each other.
Because Searchers have a large RAM footprint due to all the
caching,
yet you can't use duped Searchers because of IO sync issues,
you
probably want to avoid creating them in parent processes.
> My app primarily differs in that I was planning on
having many
> invindexes, two
> or three per user, so opening them all at program start
would
> probably be
> inefficient (there are several hundred of them).
OK. With that architecture, you'll need to factor in the
time it
takes to begin reading from any one of those invindexes.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 09:39:01 |
On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin Humphrey
wrote:
>
> On Jun 14, 2007, at 4:50 AM, Hans Dieter Pearcey
wrote:
>
> >I had been thinking of putting it into an
Apache::Session.
>
> Serializing a Searcher so that the state of the
*Searcher* *object*
> can be preserved between requests? That wouldn't aid
performance,
> even if Searcher could be serialized.
>
> What you're describing is analogous to serializing a
filehandle --
That's more or less what I figured.
> >My app primarily differs in that I was planning on
having many
> >invindexes, two
> >or three per user, so opening them all at program
start would
> >probably be
> >inefficient (there are several hundred of them).
>
> OK. With that architecture, you'll need to factor in
the time it
> takes to begin reading from any one of those
invindexes.
It may be a stupid architecture; I'm not really very
experienced with
invindexes. I want to index about 250G of email, which
seems like a lot to me,
so I'm assuming that partitions will be useful (since each
user only searches
their own email). Am I prematurely optimizing?
hdp.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |

|
2007-06-14 19:42:28 |
On 6/14/07, Hans Dieter Pearcey <hdp pobox.com> wrote:
> On Thu, Jun 14, 2007 at 07:09:12AM -0700, Marvin
Humphrey wrote:
> > >My app primarily differs in that I was
planning on having many
> > >invindexes, two
> > >or three per user, so opening them all at
program start would
> > >probably be
> > >inefficient (there are several hundred of
them).
> >
> > OK. With that architecture, you'll need to factor
in the time it
> > takes to begin reading from any one of those
invindexes.
>
> It may be a stupid architecture; I'm not really very
experienced with
> invindexes. I want to index about 250G of email, which
seems like a lot to me,
> so I'm assuming that partitions will be useful (since
each user only searches
> their own email). Am I prematurely optimizing?
Hi Hans ---
I've been thinking about some similar architectural issues,
and while
I don't have any experience with corpus sizes as large as
you were
dealing with, I thought I'd jump in.
First, your architecture sounds reasonable to me: if
searches are
never going to cross indexes, keeping them separate for each
user
seems like a reasonable idea. Yes, the initialization costs
of each
Searcher object will be expensive, but I think the smaller
size of
each index is going to offset this. Starting with this
architecture
strikes me as good forethought, and not premature.
Worrying about caching hot Searcher objects to those indexes
does
strike me premature, or possibly misguided. The thing that
takes the
most time (I'm guessing) is reading the index from the disk,
thus
caching the object to disk isn't going to help you a lot.
To get a
real advantage, you are going to need it hanging around in
RAM, and
given the size of your corpus this is going to require
finesse.
Presuming you are running Linux, most extra RAM on the
system will be
used to cache recently read files so that they can read
from
relatively fast memory rather than waiting for the
relatively very
slow disk. The more you cache big objects, the less space
available
for the system to cache files. It's a trade: if you know
you are
going to reuse the object, it's a win, but if you don't you
are
probably better off letting the system do its thing. I'd
wait and
measure.
If disk IO does turn out to be a bottleneck (and it will
with heavy
enough usage) the easiest solution may be to partition the
search off
to separate machines, each handling only a subset of your
users.
Rather than thinking about caching Searcher objects within
the
FastCGI, you could prepare for this eventuality by running
your search
in an external server process, either on the same machine or
another.
This process could then cache Searchers for the indexes of
the most
recent users and use the appropriate one for the search.
Alternatively, you could cache a small number of Searcher
objects in
each FastCGI process, and then come up with a way of
preferentially
directing users to the same process they used on the
previous request.
Historically, there have been some affinity patches for
mod_fastcgi
that did this, but I don't know if they have been updated.
But in
general, I don't think there is going to be any good way for
multiple
processes or threads to share a single Searcher object.
I'd start by sticking with the separate indexes, skipping
the caching,
and seeing how it goes.
Hope this helps,
Nathan Kurz
nate verse.com
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 20:07:06 |
On Thu, Jun 14, 2007 at 06:42:28PM -0600, Nathan Kurz
wrote:
> But in general, I don't think there is going to be any
good way for multiple
> processes or threads to share a single Searcher
object.
Can Searchers be treated analogously to file handles, i.e.
shared between
processes (opened in a parent, shared between children) as
long as only one
process uses it at a time, or is there per-process state
that will get screwed
up?
This doesn't really help with the question at hand, since I
don't plan on
preloading the searchers, but it's an interesting thing to
keep in mind.
hdp.
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 21:32:36 |
On Jun 14, 2007, at 6:07 PM, Hans Dieter Pearcey wrote:
> Can Searchers be treated analogously to file handles,
i.e. shared
> between
> processes (opened in a parent, shared between children)
as long as
> only one
> process uses it at a time, or is there per-process
state that will
> get screwed
> up?
I believe that will work. In general, I wouldn't recommend
doing
things that way with large indexes because you'll end up
wasting a
lot of RAM... but that may not matter here.
FWIW, the optional read-locking mechanism needed for use
with NFS
will break -- since it uses lock files that remember their
pids --
but it's off by default.
> This doesn't really help with the question at hand,
since I don't
> plan on
> preloading the searchers, but it's an interesting thing
to keep in
> mind.
I don't think you should rule out pre-loading. KinoSearch
is heavily
optimized for the use case of running many queries against a
single
view of an index.
The costs for warming the a Searcher vary linearly with
index size,
and get significantly higher if you perform sorting or range
operations. To put things in perspective, for very large
indexes
(larger than you're likely to see for any one individual's
email), it
can conceivably take several seconds to warm up a Searcher,
then a
fraction of a second to process the query.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
| Re: serializing safely |
  United States |
2007-06-14 21:42:28 |
On Jun 14, 2007, at 5:42 PM, Nathan Kurz wrote:
> First, your architecture sounds reasonable to me: if
searches are
> never going to cross indexes, keeping them separate for
each user
> seems like a reasonable idea.
I fully agree.
You want to avoid processing hits that you know can't match.
Definitely, break up the indexes if you know you will never
have to
multiplex search results across them.
Search costs are dominated by the time that it takes to
process the
matches for common terms. If you're looking for 'orpheus',
that's
probably cheap; '+black +orpheus' will be more expensive in
comparison, assuming that 'black' is a more common term in
the
corpus. Even though the intersection of the set that
matches 'black'
and the set that matches 'orpheus' is small, you still have
to
iterate over *all* the matches for both terms.
OTOH, if you knew you had to multiplex results from time to
time,
searching several indexes is more expensive, particularly in
terms of
disk i/o. In a single index, all the information about any
given
term will be relatively concentrated. With multiple
indexes, the
information is more scattered, so the disk has to seek a lot
more.
> the easiest solution may be to partition the search
off
> to separate machines, each handling only a subset of
your users.
> Rather than thinking about caching Searcher objects
within the
> FastCGI, you could prepare for this eventuality by
running your search
> in an external server process, either on the same
machine or another.
> This process could then cache Searchers for the indexes
of the most
> recent users and use the appropriate one for the
search.
This is a good plan.
Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/
_______________________________________________
KinoSearch mailing list
KinoSearch rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
|
|
|
|