I see what you mean by 2) now. What Mark said should work
for you when
it's done.
Cheers,
Ning
On 10/18/07, John Wang <john.wang gmail.com> wrote:
> Hi Ning:
> That is essentially what field cache does. Doing
this for each docid in
> the result set will be slow if the result set is large.
But loading it in
> memory when opening index can also be slow if the index
is large and updates
> often.
>
> Thanks
>
> -John
>
> On 10/18/07, Ning Li <ning.li.li gmail.com> wrote:
> >
> > Make all documents have a term, say
"ID:UID", and for each document,
> > store its UID in the term's payload. You can read
off this posting
> > list to create your array. Will this work for you,
John?
> >
> > Cheers,
> > Ning
> >
> >
> > On 10/18/07, Erik Hatcher <erik ehatchersolutions.com> wrote:
> > > Forwarding this to java-dev per request.
Seems like the best place
> > > to discuss this topic.
> > >
> > > Erik
> > >
> > >
> > > Begin forwarded message:
> > >
> > > > From: "John Wang"
<john.wang gmail.com>
> > > > Date: October 17, 2007 5:43:29 PM EDT
> > > > To: erik ehatchersolutions.com
> > > > Subject: lucene indexing and merge
process
> > > >
> > > > Hi Erik:
> > > >
> > > > We are revamping our search system
here at LinekdIn. And we are
> > > > using Lucene.
> > > >
> > > > One issue we ran across is that we
store an UID in Lucene which
> > > > we map to the DB storage. So given a
docid, to lookup its UID, we
> > > > have the following solutions:
> > > >
> > > > 1) Index it as a Stored field and get it
from reader.document (very
> > > > slow if recall is large)
> > > > 2) Load/Warmup the FieldCache (for large
corpus, loading up the
> > > > indexreader can be slow)
> > > > 3) construct it using the FieldCache and
persist it on disk
> > > > everytime the index changes. (not
suitable for real time indexing,
> > > > e.g. this process will degrade as # of
documents get large)
> > > >
> > > > None of the above solutions turn out
to be adequate for our
> > > > requirements.
> > > >
> > > > What we end up doing is to modify
Lucene code by changing
> > > > SegmentReader,DocumentWriter,and
FieldWriter classes by taking
> > > > advantage of the Lucene Segment/merge
process. E.g:
> > > >
> > > > For each segment, we store a .udt
file, which is an int[]
> > > > array, (by changing the FieldWriter
class)
> > > >
> > > > And SegmentReader will load the
.udt file into an array.
> > > >
> > > > And merge happens seemlessly.
> > > >
> > > > Because the tight encapsulation
around these classes, e.g.
> > > > private and final methods, it is very
difficult to extend Lucene
> > > > while avoiding branch into our own
version. Is there a way we can
> > > > open up and make these classes
extensible? We'd be happy to
> > > > contribute what we have done.
> > > >
> > > > I guess to tackle the problem from
a different angle: is there
> > > > a way to incorporate FieldCache into the
segments (it is strictly
> > > > in memory now), and build disk versions
while indexing.
> > > >
> > > >
> > > > Hope I am making sense.
> > > >
> > > > I did not send this out to the
mailing list because I wasn't
> > > > sure if this is a dev question or an
user question, feel free to
> > > > either forward it to the right mailing
list or let me know and I
> > > > can forward it.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > -John
> > > >
> > >
> > >
> > >
------------------------------------------------------------
---------
> > > To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
> > > For additional commands, e-mail:
java-dev-help lucene.apache.org
> > >
> > >
> >
> >
------------------------------------------------------------
---------
> > To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> > For additional commands, e-mail: java-dev-help lucene.apache.org
> >
> >
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|