|
List Info
Thread: Fwd: lucene indexing and merge process
|
|
| Fwd: lucene indexing and merge process |
  United States |
2007-10-18 09:38:38 |
Forwarding this to java-dev per request. Seems like the
best place
to discuss this topic.
Erik
Begin forwarded message:
> From: "John Wang" <john.wang gmail.com>
> Date: October 17, 2007 5:43:29 PM EDT
> To: erik ehatchersolutions.com
> Subject: lucene indexing and merge process
>
> Hi Erik:
>
> We are revamping our search system here at
LinekdIn. And we are
> using Lucene.
>
> One issue we ran across is that we store an UID in
Lucene which
> we map to the DB storage. So given a docid, to lookup
its UID, we
> have the following solutions:
>
> 1) Index it as a Stored field and get it from
reader.document (very
> slow if recall is large)
> 2) Load/Warmup the FieldCache (for large corpus,
loading up the
> indexreader can be slow)
> 3) construct it using the FieldCache and persist it on
disk
> everytime the index changes. (not suitable for real
time indexing,
> e.g. this process will degrade as # of documents get
large)
>
> None of the above solutions turn out to be adequate
for our
> requirements.
>
> What we end up doing is to modify Lucene code by
changing
> SegmentReader,DocumentWriter,and FieldWriter classes by
taking
> advantage of the Lucene Segment/merge process. E.g:
>
> For each segment, we store a .udt file, which is
an int[]
> array, (by changing the FieldWriter class)
>
> And SegmentReader will load the .udt file into an
array.
>
> And merge happens seemlessly.
>
> Because the tight encapsulation around these
classes, e.g.
> private and final methods, it is very difficult to
extend Lucene
> while avoiding branch into our own version. Is there a
way we can
> open up and make these classes extensible? We'd be
happy to
> contribute what we have done.
>
> I guess to tackle the problem from a different
angle: is there
> a way to incorporate FieldCache into the segments (it
is strictly
> in memory now), and build disk versions while
indexing.
>
>
> Hope I am making sense.
>
> I did not send this out to the mailing list because
I wasn't
> sure if this is a dev question or an user question,
feel free to
> either forward it to the right mailing list or let me
know and I
> can forward it.
>
>
> Thanks
>
> -John
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Re: lucene indexing and merge process |

|
2007-10-18 10:27:46 |
Make all documents have a term, say "ID:UID", and
for each document,
store its UID in the term's payload. You can read off this
posting
list to create your array. Will this work for you, John?
Cheers,
Ning
On 10/18/07, Erik Hatcher <erik ehatchersolutions.com>
wrote:
> Forwarding this to java-dev per request. Seems like
the best place
> to discuss this topic.
>
> Erik
>
>
> Begin forwarded message:
>
> > From: "John Wang" <john.wang gmail.com>
> > Date: October 17, 2007 5:43:29 PM EDT
> > To: erik ehatchersolutions.com
> > Subject: lucene indexing and merge process
> >
> > Hi Erik:
> >
> > We are revamping our search system here at
LinekdIn. And we are
> > using Lucene.
> >
> > One issue we ran across is that we store an
UID in Lucene which
> > we map to the DB storage. So given a docid, to
lookup its UID, we
> > have the following solutions:
> >
> > 1) Index it as a Stored field and get it from
reader.document (very
> > slow if recall is large)
> > 2) Load/Warmup the FieldCache (for large corpus,
loading up the
> > indexreader can be slow)
> > 3) construct it using the FieldCache and persist
it on disk
> > everytime the index changes. (not suitable for
real time indexing,
> > e.g. this process will degrade as # of documents
get large)
> >
> > None of the above solutions turn out to be
adequate for our
> > requirements.
> >
> > What we end up doing is to modify Lucene code
by changing
> > SegmentReader,DocumentWriter,and FieldWriter
classes by taking
> > advantage of the Lucene Segment/merge process.
E.g:
> >
> > For each segment, we store a .udt file, which
is an int[]
> > array, (by changing the FieldWriter class)
> >
> > And SegmentReader will load the .udt file
into an array.
> >
> > And merge happens seemlessly.
> >
> > Because the tight encapsulation around these
classes, e.g.
> > private and final methods, it is very difficult to
extend Lucene
> > while avoiding branch into our own version. Is
there a way we can
> > open up and make these classes extensible? We'd be
happy to
> > contribute what we have done.
> >
> > I guess to tackle the problem from a
different angle: is there
> > a way to incorporate FieldCache into the segments
(it is strictly
> > in memory now), and build disk versions while
indexing.
> >
> >
> > Hope I am making sense.
> >
> > I did not send this out to the mailing list
because I wasn't
> > sure if this is a dev question or an user
question, feel free to
> > either forward it to the right mailing list or let
me know and I
> > can forward it.
> >
> >
> > Thanks
> >
> > -John
> >
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-dev-help lucene.apache.org
>
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Re: Fwd: lucene indexing and merge
process |
  United States |
2007-10-18 11:27:44 |
Erik Hatcher wrote:
>> 2) Load/Warmup the FieldCache (for large corpus,
loading up the
>> indexreader can be slow)
With the new IndexReader#reopen(), the cost of opening a new
IndexReader
is much reduced. However, loading a FieldCache is not that
much faster,
so that may or may not be enough to make this approach
viable.
Doug
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Re: lucene indexing and merge process |
  United States |
2007-10-20 06:57:56 |
John,
For case 1, can you describe your document structure? Do
you have a
lot of other fields besides the UID field? Most
importantly, do you
have some large fields?
Did you give the FieldSelector mechanism a try?
In fact, I think you may even be able to create a caching
FieldSelector implementation. We could a add a
FieldSelectorResult,
something like LOAD_AND_CACHE that then caches the info for
that Doc,
Field combination. Would have to investigate further, but
it seems
like it might work.
Just thinking out loud...
-Grant
On Oct 18, 2007, at 10:38 AM, Erik Hatcher wrote:
> Forwarding this to java-dev per request. Seems like
the best place
> to discuss this topic.
>
> Erik
>
>
> Begin forwarded message:
>
>> From: "John Wang" <john.wang gmail.com>
>> Date: October 17, 2007 5:43:29 PM EDT
>> To: erik ehatchersolutions.com
>> Subject: lucene indexing and merge process
>>
>> Hi Erik:
>>
>> We are revamping our search system here at
LinekdIn. And we
>> are using Lucene.
>>
>> One issue we ran across is that we store an UID
in Lucene
>> which we map to the DB storage. So given a docid,
to lookup its
>> UID, we have the following solutions:
>>
>> 1) Index it as a Stored field and get it from
reader.document
>> (very slow if recall is large)
>> 2) Load/Warmup the FieldCache (for large corpus,
loading up the
>> indexreader can be slow)
>> 3) construct it using the FieldCache and persist it
on disk
>> everytime the index changes. (not suitable for real
time indexing,
>> e.g. this process will degrade as # of documents
get large)
>>
>> None of the above solutions turn out to be
adequate for our
>> requirements.
>>
>> What we end up doing is to modify Lucene code
by changing
>> SegmentReader,DocumentWriter,and FieldWriter
classes by taking
>> advantage of the Lucene Segment/merge process.
E.g:
>>
>> For each segment, we store a .udt file, which
is an int[]
>> array, (by changing the FieldWriter class)
>>
>> And SegmentReader will load the .udt file into
an array.
>>
>> And merge happens seemlessly.
>>
>> Because the tight encapsulation around these
classes, e.g.
>> private and final methods, it is very difficult to
extend Lucene
>> while avoiding branch into our own version. Is
there a way we can
>> open up and make these classes extensible? We'd be
happy to
>> contribute what we have done.
>>
>> I guess to tackle the problem from a different
angle: is
>> there a way to incorporate FieldCache into the
segments (it is
>> strictly in memory now), and build disk versions
while indexing.
>>
>>
>> Hope I am making sense.
>>
>> I did not send this out to the mailing list
because I wasn't
>> sure if this is a dev question or an user question,
feel free to
>> either forward it to the right mailing list or let
me know and I
>> can forward it.
>>
>>
>> Thanks
>>
>> -John
>>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-dev-help lucene.apache.org
>
------------------------------------------------------
Grant Ingersoll
http://www.grantingers
oll.com/
http://lucene.granti
ngersoll.com
http://www.paperofthew
eek.com/
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
[1-4]
|
|