|
List Info
Thread: Re: Per-document Payloads
|
|
| Re: Per-document Payloads |
  United States |
2007-10-20 03:09:54 |
John Wang wrote:
> Hi Michael:
> Thanks for the info.
>
> I haven't played with payloads. Can you give me an
example or point me
> to how it is used to solve this problem?
>
Hi John,
I (quickly) put together a class that is able to store UIDs
as payloads.
I believe the type of your UIDs is Integer?
To add the UID to a document use this method:
/**
* Adds a unique docId (UID) to a document as a payload.
*/
public void addUID(Document doc, int uid);
You can either load the UID from the disk or use a cache:
/** Returns the UID for the passed-in docId.
* If you use this method in a Hitcollector and your
Query
* contains OR-terms, then try to set
* BooleanQuery.setAllowDocsOutOfOrder(false) to improve
performance.
*/
public int getUID(IndexReader reader, int docId) throws
IOException;
/** Fills the passed-in array with UID-values. If the
given array is
* null or too small, then a new array is created with
* cache.length = reader.maxDoc()
*/
public int[] getCachedIds(IndexReader reader, int[] cache,
int offset)
throws IOException;
I put a little test program in main and it seems to work
fine. However,
it's not thoroughly tested yet...
You might want to try it without using the cache first. The
performance
might be good enough for your needs. If not, try using the
cache, it
should fill up much faster compared to the FieldCache.
Another comment: If you're using Lucene 2.2, you need to
replace the
lines "tp.seek(UID_TERM);" (see comments in the
code below). Or use the
latest trunk version, it has a fix for this bug.
Let me know please if this improves your performance! Have
fun...
- Michael
And here is the code:
import java.io.IOException;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import
org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Payload;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
public class PerDocPayloadReaderWriter {
public static final Term UID_TERM = new
Term("_ID", "_UID");
private SinglePayloadTokenStream singlePayloadTokenStream
= new
SinglePayloadTokenStream();
private TermPositions tp;
private byte[] payloadBuffer = new byte[4];
public static void main(String args[]) throws IOException
{
PerDocPayloadReaderWriter pdp = new
PerDocPayloadReaderWriter();
Directory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer());
for (int i = 0; i < 10; i++) {
Document d = new Document();
// create dummy doc
d.add(new Field("test", "This is a
test.", Store.NO,
Index.TOKENIZED));
pdp.addUID(d, 100 + i);
writer.addDocument(d);
}
writer.close();
IndexReader reader = IndexReader.open(dir);
int[] uids = pdp.getCachedIds(reader, null, 0);
System.out.println("Caching:");
System.out.println("ID --> UID");
for (int i = 0; i < uids.length; i++) {
System.out.println(i + " --> " +
uids[i]);
}
System.out.println("nDirect access:");
System.out.println("ID --> UID");
for (int i = 0; i < uids.length; i++) {
System.out.println(i + " --> " +
pdp.getUID(reader, i));
}
reader.close();
}
/** Fills the passed-in array with UID-values. If the
given array is
null or too small, then
* a new array is created with cache.length =
reader.maxDoc()
*/
public int[] getCachedIds(IndexReader reader, int[] cache,
int offset)
throws IOException {
int maxDoc = reader.maxDoc();
if (cache == null || cache.length - offset > maxDoc)
{
cache = new int[maxDoc];
offset = 0;
}
if (tp == null) {
tp = reader.termPositions(UID_TERM);
} else {
// if using Lucene 2.2 replace the following line
with
// tp = reader.termPositions(UID_TERM);
tp.seek(UID_TERM);
}
while (tp.next()) {
assert tp.doc() < maxDoc;
if (!reader.isDeleted(tp.doc())) {
tp.nextPosition();
tp.getPayload(payloadBuffer, 0);
cache[tp.doc() + offset] =
bytesToInt(payloadBuffer);
}
}
return cache;
}
/** Returns the UID for the passed-in docId.
* If you use this method in a Hitcollector and your
Query contains
OR-terms,
* then try to set
BooleanQuery.setAllowDocsOutOfOrder(false) to
improve performance.
*/
public int getUID(IndexReader reader, int docId) throws
IOException {
if (tp == null) {
tp = reader.termPositions(UID_TERM);
} else if (tp.doc()> docId) {
// if using Lucene 2.2 replace the following line
with
// tp = reader.termPositions(UID_TERM);
tp.seek(UID_TERM);
}
tp.skipTo(docId);
tp.nextPosition();
tp.getPayload(payloadBuffer, 0);
return bytesToInt(payloadBuffer);
}
/**
* Adds a unique docId (UID) to a document as a payload.
*/
public void addUID(Document doc, int uid) {
singlePayloadTokenStream.setUID(uid);
doc.add(new Field(UID_TERM.field(),
singlePayloadTokenStream));
}
private int bytesToInt(byte[] bytes) {
return ((bytes[3] & 0xFF) << 24) | ((bytes[2]
& 0xFF) << 16)
| ((bytes[1] & 0xFF) << 8) | (bytes[0] &
0xFF);
}
private static class SinglePayloadTokenStream extends
TokenStream {
private Token token = new Token(UID_TERM.text(), 0, 0);
private byte[] buffer = new byte[4];
private boolean returnToken = false;
void setUID(int uid) {
buffer[0] = (byte) (uid);
buffer[1] = (byte) (uid >> 8);
buffer[2] = (byte) (uid >> 16);
buffer[3] = (byte) (uid >> 24);
token.setPayload(new Payload(buffer));
returnToken = true;
}
public Token next() throws IOException {
if (returnToken) {
returnToken = false;
return token;
} else {
return null;
}
}
}
}
> Thanks
>
> -John
>
> On 10/19/07, Michael Busch <buschmic gmail.com> wrote:
>> John Wang wrote:
>>> I can tried to get some numbers for
leading an int[] array vs
>>> FieldCache.getInts().
>> I've had a similar performance problem when I used
the FieldCache. The
>> loading performance is apparently so slow, because
each value is stored
>> as a term in the dictionary. For loading the cache
it is necessary to
>> iterate over all terms for the field in the
dictionary. And for each
>> term it's posting list is opened to check which
documents have that value.
>>
>> If you store unique docIds, then there are no two
documents that share
>> the same value. That means, that each value gets
its own entry in the
>> dictionary and to load each value it is necessary
to perform two random
>> I/O seeks (one for term lookup + one to open the
posting list).
>>
>> In my app it took for a big index several minutes
to fill the cache like
>> that.
>>
>> To speed things up I did essentially what Ning
suggested. Now I store
>> the values as payloads in the posting list of an
artificial term. To
>> fill my cache it's only necessary to perform a
couple of I/O seeks for
>> opening the posting list of the specific term, then
it is just a
>> sequential scan to load all values. With this
approach the time for
>> filling the cache went down from minutes to
seconds!
>>
>> Now this approach is already much better than the
current field cache
>> implementation, but it still can be improved. In
fact, we already have a
>> mechanism for doing that: the norms. Norms are
stored with a fixed size,
>> which means both random access and sequential scan
are optimal. Norms
>> are also cached in memory, and filling that cache
is much faster
>> compared to the current FieldCache approach.
>>
>> I was therefore thinking about adding per-document
payloads to Lucene
>> (we can also call it document-metadata). The API
could look like this:
>>
>> Document d = new Document();
>> byte[] uidValue = ...
>> d.addMetadata("uid", uidValue);
>>
>> And on the retrieval side all values could either
be loaded into the
>> field cache, or, if the index is too big, a new API
can be used:
>>
>> IndexReader reader = IndexReader.open(...);
>> DocumentMetadataIterator it =
reader.metadataIterator("uid");
>>
>> where DocumentMetadataIterator is an interface
similar to TermDocs:
>>
>> interface DocumentMetadataIterator {
>> void seek(String name);
>> boolean next();
>> boolean skipTo(int doc);
>>
>> int doc();
>> byte[] getMetadata();
>> }
>>
>> The next question would be how to store the per-doc
payloads (PDP). If
>> all values have the same length (as the unique
docIds), then we should
>> store them as efficiently as possible, like the
norms. However, we still
>> want to offer the flexibility of having
variable-length values. For this
>> case we could use a new data structure similar to
our posting list.
>>
>> PDPList --> FixedLengthPDPList |
<VariableLengthPDPList,
>> SkipList>
>> FixedLengthPDPList -->
<Payload>^SegSize
>> VariableLengthPDPList --> <DocDelta,
PayloadLength?, Payload>
>> Payload --> Byte^PayloadLength
>> PayloadLength --> VInt
>> SkipList --> see frq.file
>>
>> Because we don't have global field semantics Lucene
should automatically
>> pick the "right" data structure. This
could work like this: When the
>> DocumentsWriter writes a segment it checks whether
all values of a PDP
>> have the same length. If yes, it stores them as
FixedLengthPDPList, if
>> not, then as VariableLengthPDPList.
>> When the SegmentMerger merges two or more segments
it checks if all
>> segments have a FixedLengthPDPList with the same
length for a PDP. If
>> not, it writes a VariableLengthPDPList to the new
segment.
>>
>> I think this would be a nice new feature for
Lucene. We could then have
>> user-defined and Lucene-specific PDPs. For example,
norms would be in
>> the latter category (this way we would get rid of
the special code for
>> norms, as they could be handled as PDPs). It would
also be easy to add
>> new features in the future, like splitting the
norms into two values: a
>> norm and a boost value.
>>
>> OK lot's of thoughts, I'm sure I'll get lot's of
comments too ... ;)
>>
>> - Michael
>>
>>> Thanks
>>>
>>> -John
>>>
>>> On 10/19/07, Michael McCandless <lucene mikemccandless.com> wrote:
>>>> It seems like there are (at least) two
angles here for getting better
>>>> performance from FieldCache:
>>>>
>>>> 1) Be incremental: with reopen() we
should only have to update a
>>>> subset of the array in the FieldCache,
according to the changed
>>>> segments. This is what Hoss is
working on and Mark was referring
>>>> to and I think it's very important!
>>>>
>>>> 2) Parsing is slow (?): I'm guessing one
of the reasons that John
>>>> added the _X.udt file was because it's
much faster to load an
>>>> array of already-parsed ints than to
ask FieldCache to populate
>>>> itself.
>>>>
>>>> Even if we do #1, I think #2 could be a big
win (in addition)? John
>>>> do you have any numbers of how much faster
it is to load the array of
>>>> ints from the _X.udt file vs having
FieldCache populate itself?
>>>>
>>>> Also on the original question of "can
we open up SegmentReader,
>>>> FieldsWriter, etc.", I think that's a
good idea? At least we can make
>>>> things protected instead of private/final?
>>>>
>>>> Mike
>>>>
>>>> "Ning Li" <ning.li.li gmail.com> wrote:
>>>>> I see what you mean by 2) now. What
Mark said should work for you when
>>>>> it's done.
>>>>>
>>>>> Cheers,
>>>>> Ning
>>>>>
>>>>> On 10/18/07, John Wang
<john.wang gmail.com> wrote:
>>>>>> Hi Ning:
>>>>>> That is essentially what field
cache does. Doing this for each
>>>> docid in
>>>>>> the result set will be slow if the
result set is large. But loading
>> it
>>>> in
>>>>>> memory when opening index can also
be slow if the index is large and
>>>> updates
>>>>>> often.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> On 10/18/07, Ning Li
<ning.li.li gmail.com> wrote:
>>>>>>> Make all documents have a term,
say "ID:UID", and for each document,
>>>>>>> store its UID in the term's
payload. You can read off this posting
>>>>>>> list to create your array. Will
this work for you, John?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Ning
>>>>>>>
>>>>>>>
>>>>>>> On 10/18/07, Erik Hatcher
<erik ehatchersolutions.com> wrote:
>>>>>>>> Forwarding this to java-dev
per request. Seems like the best
>>>> place
>>>>>>>> to discuss this topic.
>>>>>>>>
>>>>>>>> Erik
>>>>>>>>
>>>>>>>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>>> From: "John
Wang" <john.wang gmail.com>
>>>>>>>>> Date: October 17, 2007
5:43:29 PM EDT
>>>>>>>>> To: erik ehatchersolutions.com
>>>>>>>>> Subject: lucene
indexing and merge process
>>>>>>>>>
>>>>>>>>> Hi Erik:
>>>>>>>>>
>>>>>>>>> We are revamping
our search system here at LinekdIn. And we
>>>> are
>>>>>>>>> using Lucene.
>>>>>>>>>
>>>>>>>>> One issue we ran
across is that we store an UID in Lucene
>>>> which
>>>>>>>>> we map to the DB
storage. So given a docid, to lookup its UID,
>>>> we
>>>>>>>>> have the following
solutions:
>>>>>>>>>
>>>>>>>>> 1) Index it as a Stored
field and get it from reader.document(very
>>>>>>>>> slow if recall is
large)
>>>>>>>>> 2) Load/Warmup the
FieldCache (for large corpus, loading up the
>>>>>>>>> indexreader can be
slow)
>>>>>>>>> 3) construct it using
the FieldCache and persist it on disk
>>>>>>>>> everytime the index
changes. (not suitable for real time
>>>> indexing,
>>>>>>>>> e.g. this process will
degrade as # of documents get large)
>>>>>>>>>
>>>>>>>>> None of the above
solutions turn out to be adequate for our
>>>>>>>>> requirements.
>>>>>>>>>
>>>>>>>>> What we end up
doing is to modify Lucene code by changing
>>>>>>>>>
SegmentReader,DocumentWriter,and FieldWriter classes by
taking
>>>>>>>>> advantage of the Lucene
Segment/merge process. E.g:
>>>>>>>>>
>>>>>>>>> For each segment,
we store a .udt file, which is an int[]
>>>>>>>>> array, (by changing the
FieldWriter class)
>>>>>>>>>
>>>>>>>>> And SegmentReader
will load the .udt file into an array.
>>>>>>>>>
>>>>>>>>> And merge happens
seemlessly.
>>>>>>>>>
>>>>>>>>> Because the tight
encapsulation around these classes, e.g.
>>>>>>>>> private and final
methods, it is very difficult to extend Lucene
>>>>>>>>> while avoiding branch
into our own version. Is there a way we
>>>> can
>>>>>>>>> open up and make these
classes extensible? We'd be happy to
>>>>>>>>> contribute what we have
done.
>>>>>>>>>
>>>>>>>>> I guess to tackle
the problem from a different angle: is
>>>> there
>>>>>>>>> a way to incorporate
FieldCache into the segments (it is
>>>> strictly
>>>>>>>>> in memory now), and
build disk versions while indexing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hope I am making
sense.
>>>>>>>>>
>>>>>>>>> I did not send this
out to the mailing list because I wasn't
>>>>>>>>> sure if this is a dev
question or an user question, feel free to
>>>>>>>>> either forward it to
the right mailing list or let me know and I
>>>>>>>>> can forward it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>
>>>>
------------------------------------------------------------
---------
>>>>>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
>>>>>>>> For additional commands,
e-mail: java-dev-help lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>
------------------------------------------------------------
---------
>>>>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
>>>>>>> For additional commands,
e-mail: java-dev-help lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>
------------------------------------------------------------
---------
>>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
>>>>> For additional commands, e-mail:
java-dev-help lucene.apache.org
>>>>>
>>>>
------------------------------------------------------------
---------
>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
>>>> For additional commands, e-mail:
java-dev-help lucene.apache.org
>>>>
>>>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
>> For additional commands, e-mail: java-dev-help lucene.apache.org
>>
>>
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Re: Per-document Payloads |

|
2007-10-21 02:51:42 |
Hi Michael:
I took your program and benchmarked against my setup,
here are some
numbers comparing to the other options:
Setup: 2M docs with only the id, indexed in various ways for
each method
randomly selected 50000 of the docids and do a lookup.
Comparing 3 methods:
1) load int[] from field cache
Field(fldname,uid,Stored.No,Index.Tokenized); lucene
2.2
2) using payload (your impl) lucene 2.2
3) using our custom made lucene, described by my early
email
no fields added, just doc.setUserID(uid); // since we
are adding our own
segment file. (custome lucene 2.0)
Since all three methods loads docids into an int[], the
lookup time is the
same for all three methods, what's
different are the load times:
1) 16.5 seconds, 43 MB
2) 590 milliseconds 32.5 MB
3) 186 milliseconds 26MB
I think the payload method is good enough so we don't need
to diverge from
the lucene code base. However, I feel that being able to
customize the
indexing process and store our own file is still more
efficient both in load
time and index size.
Thanks
-John
On 10/20/07, Michael Busch <buschmic gmail.com> wrote:
>
> John Wang wrote:
> > Hi Michael:
> > Thanks for the info.
> >
> > I haven't played with payloads. Can you give
me an example or point
> me
> > to how it is used to solve this problem?
> >
>
> Hi John,
>
> I (quickly) put together a class that is able to store
UIDs as payloads.
> I believe the type of your UIDs is Integer?
>
> To add the UID to a document use this method:
> /**
> * Adds a unique docId (UID) to a document as a
payload.
> */
> public void addUID(Document doc, int uid);
>
> You can either load the UID from the disk or use a
cache:
>
> /** Returns the UID for the passed-in docId.
> * If you use this method in a Hitcollector and your
Query
> * contains OR-terms, then try to set
> * BooleanQuery.setAllowDocsOutOfOrder(false) to
improve performance.
> */
> public int getUID(IndexReader reader, int docId)
throws IOException;
>
> /** Fills the passed-in array with UID-values. If the
given array is
> * null or too small, then a new array is created
with
> * cache.length = reader.maxDoc()
> */
> public int[] getCachedIds(IndexReader reader, int[]
cache, int offset)
> throws IOException;
>
> I put a little test program in main and it seems to
work fine. However,
> it's not thoroughly tested yet...
>
> You might want to try it without using the cache first.
The performance
> might be good enough for your needs. If not, try using
the cache, it
> should fill up much faster compared to the FieldCache.
>
> Another comment: If you're using Lucene 2.2, you need
to replace the
> lines "tp.seek(UID_TERM);" (see comments in
the code below). Or use the
> latest trunk version, it has a fix for this bug.
>
> Let me know please if this improves your performance!
Have fun...
> - Michael
>
> And here is the code:
>
> import java.io.IOException;
>
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenStream;
> import
org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Payload;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.TermPositions;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
>
> public class PerDocPayloadReaderWriter {
> public static final Term UID_TERM = new
Term("_ID", "_UID");
> private SinglePayloadTokenStream
singlePayloadTokenStream = new
> SinglePayloadTokenStream();
> private TermPositions tp;
> private byte[] payloadBuffer = new byte[4];
>
> public static void main(String args[]) throws
IOException {
> PerDocPayloadReaderWriter pdp = new
PerDocPayloadReaderWriter();
> Directory dir = new RAMDirectory();
> IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer());
> for (int i = 0; i < 10; i++) {
> Document d = new Document();
> // create dummy doc
> d.add(new Field("test", "This is a
test.", Store.NO,
> Index.TOKENIZED));
> pdp.addUID(d, 100 + i);
>
> writer.addDocument(d);
> }
> writer.close();
>
> IndexReader reader = IndexReader.open(dir);
> int[] uids = pdp.getCachedIds(reader, null, 0);
>
> System.out.println("Caching:");
> System.out.println("ID --> UID");
>
> for (int i = 0; i < uids.length; i++) {
> System.out.println(i + " --> " +
uids[i]);
> }
>
> System.out.println("nDirect access:");
> System.out.println("ID --> UID");
>
> for (int i = 0; i < uids.length; i++) {
> System.out.println(i + " --> " +
pdp.getUID(reader, i));
> }
> reader.close();
> }
>
>
> /** Fills the passed-in array with UID-values. If the
given array is
> null or too small, then
> * a new array is created with cache.length =
reader.maxDoc()
> */
> public int[] getCachedIds(IndexReader reader, int[]
cache, int offset)
> throws IOException {
> int maxDoc = reader.maxDoc();
> if (cache == null || cache.length - offset >
maxDoc) {
> cache = new int[maxDoc];
> offset = 0;
> }
>
> if (tp == null) {
> tp = reader.termPositions(UID_TERM);
> } else {
> // if using Lucene 2.2 replace the following line
with
> // tp = reader.termPositions(UID_TERM);
> tp.seek(UID_TERM);
> }
>
> while (tp.next()) {
> assert tp.doc() < maxDoc;
> if (!reader.isDeleted(tp.doc())) {
> tp.nextPosition();
> tp.getPayload(payloadBuffer, 0);
> cache[tp.doc() + offset] =
bytesToInt(payloadBuffer);
> }
> }
>
> return cache;
> }
>
> /** Returns the UID for the passed-in docId.
> * If you use this method in a Hitcollector and your
Query contains
> OR-terms,
> * then try to set
BooleanQuery.setAllowDocsOutOfOrder(false) to
> improve performance.
> */
> public int getUID(IndexReader reader, int docId)
throws IOException {
> if (tp == null) {
> tp = reader.termPositions(UID_TERM);
> } else if (tp.doc()> docId) {
> // if using Lucene 2.2 replace the following line
with
> // tp = reader.termPositions(UID_TERM);
> tp.seek(UID_TERM);
> }
>
> tp.skipTo(docId);
> tp.nextPosition();
> tp.getPayload(payloadBuffer, 0);
>
> return bytesToInt(payloadBuffer);
> }
>
> /**
> * Adds a unique docId (UID) to a document as a
payload.
> */
> public void addUID(Document doc, int uid) {
> singlePayloadTokenStream.setUID(uid);
> doc.add(new Field(UID_TERM.field(),
singlePayloadTokenStream));
> }
>
> private int bytesToInt(byte[] bytes) {
> return ((bytes[3] & 0xFF) << 24) |
((bytes[2] & 0xFF) << 16)
> | ((bytes[1] & 0xFF) << 8) | (bytes[0]
& 0xFF);
>
> }
>
> private static class SinglePayloadTokenStream extends
TokenStream {
> private Token token = new Token(UID_TERM.text(), 0,
0);
> private byte[] buffer = new byte[4];
> private boolean returnToken = false;
>
> void setUID(int uid) {
> buffer[0] = (byte) (uid);
> buffer[1] = (byte) (uid >> 8);
> buffer[2] = (byte) (uid >> 16);
> buffer[3] = (byte) (uid >> 24);
> token.setPayload(new Payload(buffer));
> returnToken = true;
> }
>
> public Token next() throws IOException {
> if (returnToken) {
> returnToken = false;
> return token;
> } else {
> return null;
> }
> }
> }
> }
>
>
>
> > Thanks
> >
> > -John
> >
> > On 10/19/07, Michael Busch <buschmic gmail.com> wrote:
> >> John Wang wrote:
> >>> I can tried to get some numbers for
leading an int[] array vs
> >>> FieldCache.getInts().
> >> I've had a similar performance problem when I
used the FieldCache. The
> >> loading performance is apparently so slow,
because each value is stored
> >> as a term in the dictionary. For loading the
cache it is necessary to
> >> iterate over all terms for the field in the
dictionary. And for each
> >> term it's posting list is opened to check
which documents have that
> value.
> >>
> >> If you store unique docIds, then there are no
two documents that share
> >> the same value. That means, that each value
gets its own entry in the
> >> dictionary and to load each value it is
necessary to perform two random
> >> I/O seeks (one for term lookup + one to open
the posting list).
> >>
> >> In my app it took for a big index several
minutes to fill the cache
> like
> >> that.
> >>
> >> To speed things up I did essentially what Ning
suggested. Now I store
> >> the values as payloads in the posting list of
an artificial term. To
> >> fill my cache it's only necessary to perform a
couple of I/O seeks for
> >> opening the posting list of the specific term,
then it is just a
> >> sequential scan to load all values. With this
approach the time for
> >> filling the cache went down from minutes to
seconds!
> >>
> >> Now this approach is already much better than
the current field cache
> >> implementation, but it still can be improved.
In fact, we already have
> a
> >> mechanism for doing that: the norms. Norms are
stored with a fixed
> size,
> >> which means both random access and sequential
scan are optimal. Norms
> >> are also cached in memory, and filling that
cache is much faster
> >> compared to the current FieldCache approach.
> >>
> >> I was therefore thinking about adding
per-document payloads to Lucene
> >> (we can also call it document-metadata). The
API could look like this:
> >>
> >> Document d = new Document();
> >> byte[] uidValue = ...
> >> d.addMetadata("uid", uidValue);
> >>
> >> And on the retrieval side all values could
either be loaded into the
> >> field cache, or, if the index is too big, a
new API can be used:
> >>
> >> IndexReader reader = IndexReader.open(...);
> >> DocumentMetadataIterator it =
reader.metadataIterator("uid");
> >>
> >> where DocumentMetadataIterator is an interface
similar to TermDocs:
> >>
> >> interface DocumentMetadataIterator {
> >> void seek(String name);
> >> boolean next();
> >> boolean skipTo(int doc);
> >>
> >> int doc();
> >> byte[] getMetadata();
> >> }
> >>
> >> The next question would be how to store the
per-doc payloads (PDP). If
> >> all values have the same length (as the unique
docIds), then we should
> >> store them as efficiently as possible, like
the norms. However, we
> still
> >> want to offer the flexibility of having
variable-length values. For
> this
> >> case we could use a new data structure similar
to our posting list.
> >>
> >> PDPList -->
FixedLengthPDPList | <VariableLengthPDPList,
> >> SkipList>
> >> FixedLengthPDPList -->
<Payload>^SegSize
> >> VariableLengthPDPList --> <DocDelta,
PayloadLength?, Payload>
> >> Payload -->
Byte^PayloadLength
> >> PayloadLength --> VInt
> >> SkipList --> see frq.file
> >>
> >> Because we don't have global field semantics
Lucene should
> automatically
> >> pick the "right" data structure.
This could work like this: When the
> >> DocumentsWriter writes a segment it checks
whether all values of a PDP
> >> have the same length. If yes, it stores them
as FixedLengthPDPList, if
> >> not, then as VariableLengthPDPList.
> >> When the SegmentMerger merges two or more
segments it checks if all
> >> segments have a FixedLengthPDPList with the
same length for a PDP. If
> >> not, it writes a VariableLengthPDPList to the
new segment.
> >>
> >> I think this would be a nice new feature for
Lucene. We could then have
> >> user-defined and Lucene-specific PDPs. For
example, norms would be in
> >> the latter category (this way we would get rid
of the special code for
> >> norms, as they could be handled as PDPs). It
would also be easy to add
> >> new features in the future, like splitting the
norms into two values: a
> >> norm and a boost value.
> >>
> >> OK lot's of thoughts, I'm sure I'll get lot's
of comments too ... ;)
> >>
> >> - Michael
> >>
> >>> Thanks
> >>>
> >>> -John
> >>>
> >>> On 10/19/07, Michael McCandless
<lucene mikemccandless.com> wrote:
> >>>> It seems like there are (at least) two
angles here for getting better
> >>>> performance from FieldCache:
> >>>>
> >>>> 1) Be incremental: with reopen() we
should only have to update a
> >>>> subset of the array in the
FieldCache, according to the changed
> >>>> segments. This is what Hoss is
working on and Mark was
> referring
> >>>> to and I think it's very
important!
> >>>>
> >>>> 2) Parsing is slow (?): I'm guessing
one of the reasons that John
> >>>> added the _X.udt file was because
it's much faster to load an
> >>>> array of already-parsed ints than
to ask FieldCache to populate
> >>>> itself.
> >>>>
> >>>> Even if we do #1, I think #2 could be
a big win (in addition)? John
> >>>> do you have any numbers of how much
faster it is to load the array of
> >>>> ints from the _X.udt file vs having
FieldCache populate itself?
> >>>>
> >>>> Also on the original question of
"can we open up SegmentReader,
> >>>> FieldsWriter, etc.", I think
that's a good idea? At least we can
> make
> >>>> things protected instead of
private/final?
> >>>>
> >>>> Mike
> >>>>
> >>>> "Ning Li" <ning.li.li gmail.com> wrote:
> >>>>> I see what you mean by 2) now.
What Mark said should work for you
> when
> >>>>> it's done.
> >>>>>
> >>>>> Cheers,
> >>>>> Ning
> >>>>>
> >>>>> On 10/18/07, John Wang
<john.wang gmail.com> wrote:
> >>>>>> Hi Ning:
> >>>>>> That is essentially what
field cache does. Doing this for each
> >>>> docid in
> >>>>>> the result set will be slow if
the result set is large. But loading
> >> it
> >>>> in
> >>>>>> memory when opening index can
also be slow if the index is large
> and
> >>>> updates
> >>>>>> often.
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>> On 10/18/07, Ning Li
<ning.li.li gmail.com> wrote:
> >>>>>>> Make all documents have a
term, say "ID:UID", and for each
> document,
> >>>>>>> store its UID in the
term's payload. You can read off this posting
> >>>>>>> list to create your array.
Will this work for you, John?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Ning
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/18/07, Erik Hatcher
<erik ehatchersolutions.com> wrote:
> >>>>>>>> Forwarding this to
java-dev per request. Seems like the best
> >>>> place
> >>>>>>>> to discuss this
topic.
> >>>>>>>>
> >>>>>>>> Erik
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Begin forwarded
message:
> >>>>>>>>
> >>>>>>>>> From: "John
Wang" <john.wang gmail.com>
> >>>>>>>>> Date: October 17,
2007 5:43:29 PM EDT
> >>>>>>>>> To: erik ehatchersolutions.com
> >>>>>>>>> Subject: lucene
indexing and merge process
> >>>>>>>>>
> >>>>>>>>> Hi Erik:
> >>>>>>>>>
> >>>>>>>>> We are
revamping our search system here at LinekdIn. And we
> >>>> are
> >>>>>>>>> using Lucene.
> >>>>>>>>>
> >>>>>>>>> One issue we
ran across is that we store an UID in Lucene
> >>>> which
> >>>>>>>>> we map to the DB
storage. So given a docid, to lookup its UID,
> >>>> we
> >>>>>>>>> have the following
solutions:
> >>>>>>>>>
> >>>>>>>>> 1) Index it as a
Stored field and get it from reader.document
> (very
> >>>>>>>>> slow if recall is
large)
> >>>>>>>>> 2) Load/Warmup the
FieldCache (for large corpus, loading up the
> >>>>>>>>> indexreader can be
slow)
> >>>>>>>>> 3) construct it
using the FieldCache and persist it on disk
> >>>>>>>>> everytime the
index changes. (not suitable for real time
> >>>> indexing,
> >>>>>>>>> e.g. this process
will degrade as # of documents get large)
> >>>>>>>>>
> >>>>>>>>> None of the
above solutions turn out to be adequate for our
> >>>>>>>>> requirements.
> >>>>>>>>>
> >>>>>>>>> What we end
up doing is to modify Lucene code by changing
> >>>>>>>>>
SegmentReader,DocumentWriter,and FieldWriter classes by
taking
> >>>>>>>>> advantage of the
Lucene Segment/merge process. E.g:
> >>>>>>>>>
> >>>>>>>>> For each
segment, we store a .udt file, which is an int[]
> >>>>>>>>> array, (by
changing the FieldWriter class)
> >>>>>>>>>
> >>>>>>>>> And
SegmentReader will load the .udt file into an array.
> >>>>>>>>>
> >>>>>>>>> And merge
happens seemlessly.
> >>>>>>>>>
> >>>>>>>>> Because the
tight encapsulation around these classes, e.g.
> >>>>>>>>> private and final
methods, it is very difficult to extend Lucene
> >>>>>>>>> while avoiding
branch into our own version. Is there a way we
> >>>> can
> >>>>>>>>> open up and make
these classes extensible? We'd be happy to
> >>>>>>>>> contribute what we
have done.
> >>>>>>>>>
> >>>>>>>>> I guess to
tackle the problem from a different angle: is
> >>>> there
> >>>>>>>>> a way to
incorporate FieldCache into the segments (it is
> >>>> strictly
> >>>>>>>>> in memory now),
and build disk versions while indexing.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hope I am
making sense.
> >>>>>>>>>
> >>>>>>>>> I did not send
this out to the mailing list because I wasn't
> >>>>>>>>> sure if this is a
dev question or an user question, feel free to
> >>>>>>>>> either forward it
to the right mailing list or let me know and I
> >>>>>>>>> can forward it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> -John
> >>>>>>>>>
> >>>>>>>>
> >>>>
------------------------------------------------------------
---------
> >>>>>>>> To unsubscribe,
e-mail: java-dev-unsubscribe lucene.apache.org
> >>>>>>>> For additional
commands, e-mail: java-dev-help lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>
------------------------------------------------------------
---------
> >>>>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
> >>>>>>> For additional commands,
e-mail: java-dev-help lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>
>
------------------------------------------------------------
---------
> >>>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
> >>>>> For additional commands, e-mail:
java-dev-help lucene.apache.org
> >>>>>
> >>>>
------------------------------------------------------------
---------
> >>>> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
> >>>> For additional commands, e-mail:
java-dev-help lucene.apache.org
> >>>>
> >>>>
> >>
> >>
------------------------------------------------------------
---------
> >> To unsubscribe, e-mail:
java-dev-unsubscribe lucene.apache.org
> >> For additional commands, e-mail:
java-dev-help lucene.apache.org
> >>
> >>
> >
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-dev-help lucene.apache.org
>
>
|
|
[1-2]
|
|