List Info

Thread: Re: lucene indexing and merge process




Re: lucene indexing and merge process
user name
2007-10-19 06:14:44
It seems like there are (at least) two angles here for
getting better
performance from FieldCache:

  1) Be incremental: with reopen() we should only have to
update a
     subset of the array in the FieldCache, according to the
changed
     segments.  This is what Hoss is working on and Mark was
referring
     to and I think it's very important!

  2) Parsing is slow (?): I'm guessing one of the reasons
that John
     added the _X.udt file was because it's much faster to
load an
     array of already-parsed ints than to ask FieldCache to
populate
     itself.

Even if we do #1, I think #2 could be a big win (in
addition)?  John
do you have any numbers of how much faster it is to load the
array of
ints from the _X.udt file vs having FieldCache populate
itself?

Also on the original question of "can we open up
SegmentReader,
FieldsWriter, etc.", I think that's a good idea?  At
least we can make
things protected instead of private/final?

Mike

"Ning Li" <ning.li.ligmail.com> wrote:
> I see what you mean by 2) now. What Mark said should
work for you when
> it's done.
> 
> Cheers,
> Ning
> 
> On 10/18/07, John Wang <john.wanggmail.com> wrote:
> > Hi Ning:
> >     That is essentially what field cache does.
Doing this for each docid in
> > the result set will be slow if the result set is
large. But loading it in
> > memory when opening index can also be slow if the
index is large and updates
> > often.
> >
> > Thanks
> >
> > -John
> >
> > On 10/18/07, Ning Li <ning.li.ligmail.com> wrote:
> > >
> > > Make all documents have a term, say
"ID:UID", and for each document,
> > > store its UID in the term's payload. You can
read off this posting
> > > list to create your array. Will this work for
you, John?
> > >
> > > Cheers,
> > > Ning
> > >
> > >
> > > On 10/18/07, Erik Hatcher <erikehatchersolutions.com> wrote:
> > > > Forwarding this to java-dev per request.
 Seems like the best place
> > > > to discuss this topic.
> > > >
> > > >         Erik
> > > >
> > > >
> > > > Begin forwarded message:
> > > >
> > > > > From: "John Wang"
<john.wanggmail.com>
> > > > > Date: October 17, 2007 5:43:29 PM
EDT
> > > > > To: erikehatchersolutions.com
> > > > > Subject: lucene indexing and merge
process
> > > > >
> > > > > Hi Erik:
> > > > >
> > > > >     We are revamping our search
system here at LinekdIn. And we are
> > > > > using Lucene.
> > > > >
> > > > >     One issue we ran across is that
we store an UID in Lucene which
> > > > > we map to the DB storage. So given
a docid, to lookup its UID, we
> > > > > have the following solutions:
> > > > >
> > > > > 1) Index it as a Stored field and
get it from reader.document (very
> > > > > slow if recall is large)
> > > > > 2) Load/Warmup the FieldCache (for
large corpus, loading up the
> > > > > indexreader can be slow)
> > > > > 3) construct it using the
FieldCache and persist it on disk
> > > > > everytime the index changes. (not
suitable for real time indexing,
> > > > > e.g. this process will degrade as #
of documents get large)
> > > > >
> > > > >     None of the above solutions
turn out to be adequate for our
> > > > > requirements.
> > > > >
> > > > >      What we end up doing is to
modify Lucene code by changing
> > > > > SegmentReader,DocumentWriter,and
FieldWriter classes by taking
> > > > > advantage of the Lucene
Segment/merge process. E.g:
> > > > >
> > > > >      For each segment, we store a
.udt file, which is an int[]
> > > > > array, (by changing the FieldWriter
class)
> > > > >
> > > > >      And SegmentReader will load
the .udt file into an array.
> > > > >
> > > > >      And merge happens seemlessly.
> > > > >
> > > > >      Because the tight
encapsulation around these classes, e.g.
> > > > > private and final methods, it is
very difficult to extend Lucene
> > > > > while avoiding branch into our own
version. Is there a way we can
> > > > > open up and make these classes
extensible? We'd be happy to
> > > > > contribute what we have done.
> > > > >
> > > > >      I guess to tackle the problem
from a different angle: is there
> > > > > a way to incorporate FieldCache
into the segments (it is strictly
> > > > > in memory now), and build disk
versions while indexing.
> > > > >
> > > > >
> > > > >      Hope I am making sense.
> > > > >
> > > > >     I did not send this out to the
mailing list because I wasn't
> > > > > sure if this is a dev question or
an user question, feel free to
> > > > > either forward it to the right
mailing list or let me know and I
> > > > > can forward it.
> > > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > -John
> > > > >
> > > >
> > > >
> > > >
------------------------------------------------------------
---------
> > > > To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
> > > > For additional commands, e-mail:
java-dev-helplucene.apache.org
> > > >
> > > >
> > >
> > >
------------------------------------------------------------
---------
> > > To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
> > > For additional commands, e-mail:
java-dev-helplucene.apache.org
> > >
> > >
> >
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
> 

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: lucene indexing and merge process
user name
2007-10-19 10:13:34
Hi Mike:

     This is an excellent analysis.

     To do 2), we tried computing the field cache at
indexing time to avoid
"parsing" at search time. But what we've found out
was that this degrades
indexing (because it computes the entire fieldcache, not in
segements) which
was not acceptable to our project either.

     I can tried to get some numbers for leading an int[]
array vs
FieldCache.getInts().

Thanks

-John

On 10/19/07, Michael McCandless <lucenemikemccandless.com> wrote:
>
>
> It seems like there are (at least) two angles here for
getting better
> performance from FieldCache:
>
>   1) Be incremental: with reopen() we should only have
to update a
>      subset of the array in the FieldCache, according
to the changed
>      segments.  This is what Hoss is working on and
Mark was referring
>      to and I think it's very important!
>
>   2) Parsing is slow (?): I'm guessing one of the
reasons that John
>      added the _X.udt file was because it's much faster
to load an
>      array of already-parsed ints than to ask
FieldCache to populate
>      itself.
>
> Even if we do #1, I think #2 could be a big win (in
addition)?  John
> do you have any numbers of how much faster it is to
load the array of
> ints from the _X.udt file vs having FieldCache populate
itself?
>
> Also on the original question of "can we open up
SegmentReader,
> FieldsWriter, etc.", I think that's a good idea? 
At least we can make
> things protected instead of private/final?
>
> Mike
>
> "Ning Li" <ning.li.ligmail.com> wrote:
> > I see what you mean by 2) now. What Mark said
should work for you when
> > it's done.
> >
> > Cheers,
> > Ning
> >
> > On 10/18/07, John Wang <john.wanggmail.com> wrote:
> > > Hi Ning:
> > >     That is essentially what field cache
does. Doing this for each
> docid in
> > > the result set will be slow if the result set
is large. But loading it
> in
> > > memory when opening index can also be slow if
the index is large and
> updates
> > > often.
> > >
> > > Thanks
> > >
> > > -John
> > >
> > > On 10/18/07, Ning Li <ning.li.ligmail.com> wrote:
> > > >
> > > > Make all documents have a term, say
"ID:UID", and for each document,
> > > > store its UID in the term's payload. You
can read off this posting
> > > > list to create your array. Will this
work for you, John?
> > > >
> > > > Cheers,
> > > > Ning
> > > >
> > > >
> > > > On 10/18/07, Erik Hatcher <erikehatchersolutions.com> wrote:
> > > > > Forwarding this to java-dev per
request.  Seems like the best
> place
> > > > > to discuss this topic.
> > > > >
> > > > >         Erik
> > > > >
> > > > >
> > > > > Begin forwarded message:
> > > > >
> > > > > > From: "John Wang"
<john.wanggmail.com>
> > > > > > Date: October 17, 2007 5:43:29
PM EDT
> > > > > > To: erikehatchersolutions.com
> > > > > > Subject: lucene indexing and
merge process
> > > > > >
> > > > > > Hi Erik:
> > > > > >
> > > > > >     We are revamping our
search system here at LinekdIn. And we
> are
> > > > > > using Lucene.
> > > > > >
> > > > > >     One issue we ran across is
that we store an UID in Lucene
> which
> > > > > > we map to the DB storage. So
given a docid, to lookup its UID,
> we
> > > > > > have the following solutions:
> > > > > >
> > > > > > 1) Index it as a Stored field
and get it from reader.document(very
> > > > > > slow if recall is large)
> > > > > > 2) Load/Warmup the FieldCache
(for large corpus, loading up the
> > > > > > indexreader can be slow)
> > > > > > 3) construct it using the
FieldCache and persist it on disk
> > > > > > everytime the index changes.
(not suitable for real time
> indexing,
> > > > > > e.g. this process will degrade
as # of documents get large)
> > > > > >
> > > > > >     None of the above
solutions turn out to be adequate for our
> > > > > > requirements.
> > > > > >
> > > > > >      What we end up doing is
to modify Lucene code by changing
> > > > > >
SegmentReader,DocumentWriter,and FieldWriter classes by
taking
> > > > > > advantage of the Lucene
Segment/merge process. E.g:
> > > > > >
> > > > > >      For each segment, we
store a .udt file, which is an int[]
> > > > > > array, (by changing the
FieldWriter class)
> > > > > >
> > > > > >      And SegmentReader will
load the .udt file into an array.
> > > > > >
> > > > > >      And merge happens
seemlessly.
> > > > > >
> > > > > >      Because the tight
encapsulation around these classes, e.g.
> > > > > > private and final methods, it
is very difficult to extend Lucene
> > > > > > while avoiding branch into our
own version. Is there a way we
> can
> > > > > > open up and make these classes
extensible? We'd be happy to
> > > > > > contribute what we have done.
> > > > > >
> > > > > >      I guess to tackle the
problem from a different angle: is
> there
> > > > > > a way to incorporate
FieldCache into the segments (it is
> strictly
> > > > > > in memory now), and build disk
versions while indexing.
> > > > > >
> > > > > >
> > > > > >      Hope I am making sense.
> > > > > >
> > > > > >     I did not send this out to
the mailing list because I wasn't
> > > > > > sure if this is a dev question
or an user question, feel free to
> > > > > > either forward it to the right
mailing list or let me know and I
> > > > > > can forward it.
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > -John
> > > > > >
> > > > >
> > > > >
> > > > >
>
------------------------------------------------------------
---------
> > > > > To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
> > > > > For additional commands, e-mail:
java-dev-helplucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
>
------------------------------------------------------------
---------
> > > > To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
> > > > For additional commands, e-mail:
java-dev-helplucene.apache.org
> > > >
> > > >
> > >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> > For additional commands, e-mail: java-dev-helplucene.apache.org
> >
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
>
>
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )