List Info

Thread: Re: Per-document Payloads




Re: Per-document Payloads
country flaguser name
United States
2007-10-20 14:49:22
Grant Ingersoll wrote:
> 
> Some randomly pieced together thoughts (I may not even
be fully awake
> yet   so feel
free to tell me I'm not understanding this correctly)
> 
> My first thought was how is this different from just
having a binary
> field, but if I understand correctly it is to be stored
in a separate file?
> 
> Now you are proposing a faster storage mechanism for
them, essentially,
> since they are to be stored separately from the
Documents themselves?  
> But the other key is they are all stored next to each
other, right, so
> the scan is a lot faster?
> 

Yes, scanning and skipping would be much faster, comparable
to a posting
list. In fact, what I'm proposing is a new kind of posting
list. Since
you mentioned the magic term "flexible indexing"
already ;), let's take
a look at 
http://wiki.apache.org/lucene-java/FlexibleIndexing.
Here 4
kinds of posting lists are proposed:

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

Today, we have c. and d. already. c. is the original Lucene
format, and
d. can be achieved by storing the boost as a payload.

The new format I'm proposing actually covers a. and b. If
you don't
store a payload it's basically a binary posting list without
freq and
positions (a.). If you store the boost as a payload, then
you have b.


> I think one of the questions that will come up from
users is when should
> I use addMetadata and when should I use addField?  Why
make the
> distinction to the user?  Fields have always
represented metadata, all

I'd like to make a distinction because IMO these are two
different use
cases. Not necessarily in terms of functionality, but in
terms of
performance. You are right, you can store everything today
as stored
fields, but if you want to use e. g. a stored value for
scoring, then
performance is terrible. This is simply the nature of the
store - it is
optimized for returning all stored fields for a document.
Even a
FieldSelector doesn't help you too much, unless the docs
contain very
big fields that you don't want to return. The reason is that
two random
I/Os are necessary to find the stored fields of a document.
Then only
sequential I/O has to be performed. And the overhead of
loading e. g.
10KB instead of 2KB is not big, much less than two random
I/Os, I believe.

Payloads are also much better in terms of cache utilization.
Since they
are stored next to each other, and if accessed frequently
(in every
search), then it's very likely that big portions of that
posting list
will be in the cache.

So the answer to the question when to use a stored field and
when to use
a payload should be: use payloads when you access the data
during query
evaluation/scoring, use stored fields when you need the data
to
construct a search result from a hit.

> fields, right?  Perhaps in this way, if users were
willing to commit to
> fixed length fields for the first level, we could also
make field
> updating of these types of fields possible w/o having
to reindex?????
> 

Yes I was thinking the same. Just like norms.



------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Per-document Payloads
country flaguser name
United States
2007-10-20 15:23:49
On Oct 20, 2007, at 12:49 PM, Michael Busch wrote:

> In fact, what I'm proposing is a new kind of posting
list.

http://www.rectangular.com/pipermail/kinose
arch/2007-July/001096.html

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/



------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )