List Info

Thread: Re: lucene indexing and merge process




Re: lucene indexing and merge process
user name
2007-10-18 11:57:19
Hi Ning:
    That is essentially what field cache does. Doing this
for each docid in
the result set will be slow if the result set is large. But
loading it in
memory when opening index can also be slow if the index is
large and updates
often.

Thanks

-John

On 10/18/07, Ning Li <ning.li.ligmail.com> wrote:
>
> Make all documents have a term, say "ID:UID",
and for each document,
> store its UID in the term's payload. You can read off
this posting
> list to create your array. Will this work for you,
John?
>
> Cheers,
> Ning
>
>
> On 10/18/07, Erik Hatcher <erikehatchersolutions.com> wrote:
> > Forwarding this to java-dev per request.  Seems
like the best place
> > to discuss this topic.
> >
> >         Erik
> >
> >
> > Begin forwarded message:
> >
> > > From: "John Wang" <john.wanggmail.com>
> > > Date: October 17, 2007 5:43:29 PM EDT
> > > To: erikehatchersolutions.com
> > > Subject: lucene indexing and merge process
> > >
> > > Hi Erik:
> > >
> > >     We are revamping our search system here
at LinekdIn. And we are
> > > using Lucene.
> > >
> > >     One issue we ran across is that we store
an UID in Lucene which
> > > we map to the DB storage. So given a docid,
to lookup its UID, we
> > > have the following solutions:
> > >
> > > 1) Index it as a Stored field and get it from
reader.document (very
> > > slow if recall is large)
> > > 2) Load/Warmup the FieldCache (for large
corpus, loading up the
> > > indexreader can be slow)
> > > 3) construct it using the FieldCache and
persist it on disk
> > > everytime the index changes. (not suitable
for real time indexing,
> > > e.g. this process will degrade as # of
documents get large)
> > >
> > >     None of the above solutions turn out to
be adequate for our
> > > requirements.
> > >
> > >      What we end up doing is to modify Lucene
code by changing
> > > SegmentReader,DocumentWriter,and FieldWriter
classes by taking
> > > advantage of the Lucene Segment/merge
process. E.g:
> > >
> > >      For each segment, we store a .udt file,
which is an int[]
> > > array, (by changing the FieldWriter class)
> > >
> > >      And SegmentReader will load the .udt
file into an array.
> > >
> > >      And merge happens seemlessly.
> > >
> > >      Because the tight encapsulation around
these classes, e.g.
> > > private and final methods, it is very
difficult to extend Lucene
> > > while avoiding branch into our own version.
Is there a way we can
> > > open up and make these classes extensible?
We'd be happy to
> > > contribute what we have done.
> > >
> > >      I guess to tackle the problem from a
different angle: is there
> > > a way to incorporate FieldCache into the
segments (it is strictly
> > > in memory now), and build disk versions while
indexing.
> > >
> > >
> > >      Hope I am making sense.
> > >
> > >     I did not send this out to the mailing
list because I wasn't
> > > sure if this is a dev question or an user
question, feel free to
> > > either forward it to the right mailing list
or let me know and I
> > > can forward it.
> > >
> > >
> > > Thanks
> > >
> > > -John
> > >
> >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> > For additional commands, e-mail: java-dev-helplucene.apache.org
> >
> >
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
>
>
Re: lucene indexing and merge process
user name
2007-10-18 13:07:18
I see what you mean by 2) now. What Mark said should work
for you when
it's done.

Cheers,
Ning

On 10/18/07, John Wang <john.wanggmail.com> wrote:
> Hi Ning:
>     That is essentially what field cache does. Doing
this for each docid in
> the result set will be slow if the result set is large.
But loading it in
> memory when opening index can also be slow if the index
is large and updates
> often.
>
> Thanks
>
> -John
>
> On 10/18/07, Ning Li <ning.li.ligmail.com> wrote:
> >
> > Make all documents have a term, say
"ID:UID", and for each document,
> > store its UID in the term's payload. You can read
off this posting
> > list to create your array. Will this work for you,
John?
> >
> > Cheers,
> > Ning
> >
> >
> > On 10/18/07, Erik Hatcher <erikehatchersolutions.com> wrote:
> > > Forwarding this to java-dev per request. 
Seems like the best place
> > > to discuss this topic.
> > >
> > >         Erik
> > >
> > >
> > > Begin forwarded message:
> > >
> > > > From: "John Wang"
<john.wanggmail.com>
> > > > Date: October 17, 2007 5:43:29 PM EDT
> > > > To: erikehatchersolutions.com
> > > > Subject: lucene indexing and merge
process
> > > >
> > > > Hi Erik:
> > > >
> > > >     We are revamping our search system
here at LinekdIn. And we are
> > > > using Lucene.
> > > >
> > > >     One issue we ran across is that we
store an UID in Lucene which
> > > > we map to the DB storage. So given a
docid, to lookup its UID, we
> > > > have the following solutions:
> > > >
> > > > 1) Index it as a Stored field and get it
from reader.document (very
> > > > slow if recall is large)
> > > > 2) Load/Warmup the FieldCache (for large
corpus, loading up the
> > > > indexreader can be slow)
> > > > 3) construct it using the FieldCache and
persist it on disk
> > > > everytime the index changes. (not
suitable for real time indexing,
> > > > e.g. this process will degrade as # of
documents get large)
> > > >
> > > >     None of the above solutions turn out
to be adequate for our
> > > > requirements.
> > > >
> > > >      What we end up doing is to modify
Lucene code by changing
> > > > SegmentReader,DocumentWriter,and
FieldWriter classes by taking
> > > > advantage of the Lucene Segment/merge
process. E.g:
> > > >
> > > >      For each segment, we store a .udt
file, which is an int[]
> > > > array, (by changing the FieldWriter
class)
> > > >
> > > >      And SegmentReader will load the
.udt file into an array.
> > > >
> > > >      And merge happens seemlessly.
> > > >
> > > >      Because the tight encapsulation
around these classes, e.g.
> > > > private and final methods, it is very
difficult to extend Lucene
> > > > while avoiding branch into our own
version. Is there a way we can
> > > > open up and make these classes
extensible? We'd be happy to
> > > > contribute what we have done.
> > > >
> > > >      I guess to tackle the problem from
a different angle: is there
> > > > a way to incorporate FieldCache into the
segments (it is strictly
> > > > in memory now), and build disk versions
while indexing.
> > > >
> > > >
> > > >      Hope I am making sense.
> > > >
> > > >     I did not send this out to the
mailing list because I wasn't
> > > > sure if this is a dev question or an
user question, feel free to
> > > > either forward it to the right mailing
list or let me know and I
> > > > can forward it.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > -John
> > > >
> > >
> > >
> > >
------------------------------------------------------------
---------
> > > To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
> > > For additional commands, e-mail:
java-dev-helplucene.apache.org
> > >
> > >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> > For additional commands, e-mail: java-dev-helplucene.apache.org
> >
> >
>

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )