List Info

Thread: Re: Per-document Payloads




Re: Per-document Payloads
country flaguser name
United States
2007-10-21 14:37:09
John Wang wrote:

> 
> Since all three methods loads docids into an int[], the
lookup time is the
> same for all three methods, what's
> different are the load times:
> 
> 1) 16.5 seconds,      43 MB
> 2) 590 milliseconds     32.5 MB
> 3) 186 milliseconds  26MB

Good analysis! Thanks for sharing the results...

> 
> I think the payload method is good enough so we don't
need to diverge from
> the lucene code base. 

Actually, I noticed that in my program in getCachedIDs() you
can remove
the check
  if (!reader.isDeleted(tp.doc())) {

This should improve the performance further (not sure how
much though),
because the synchronized isDeleted() call is quite expensive
and not
necessary.

If you want to reduce the index size, you might want to try
to encode
the Integers more efficiently, e. g. as VInts (depending on
the values
of your UIDs).

> However, I feel that being able to customize the
> indexing process and store our own file is still more
efficient both in load
> time and index size.
> 

Yes, the current payload implementation is not optimized for
this use
case, it can be improved with a per-doc approach like the
one I suggested.

-Michael


> Thanks
> 
> -John
> 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Per-document Payloads
user name
2007-10-22 21:35:02
Hi Micahel:
    After removing isDelete(), the index loads in 430 ms.

Thanks

-john

On 10/21/07, Michael Busch <buschmicgmail.com> wrote:
>
> John Wang wrote:
>
> >
> > Since all three methods loads docids into an
int[], the lookup time is
> the
> > same for all three methods, what's
> > different are the load times:
> >
> > 1) 16.5 seconds,      43 MB
> > 2) 590 milliseconds     32.5 MB
> > 3) 186 milliseconds  26MB
>
> Good analysis! Thanks for sharing the results...
>
> >
> > I think the payload method is good enough so we
don't need to diverge
> from
> > the lucene code base.
>
> Actually, I noticed that in my program in
getCachedIDs() you can remove
> the check
>   if (!reader.isDeleted(tp.doc())) {
>
> This should improve the performance further (not sure
how much though),
> because the synchronized isDeleted() call is quite
expensive and not
> necessary.
>
> If you want to reduce the index size, you might want to
try to encode
> the Integers more efficiently, e. g. as VInts
(depending on the values
> of your UIDs).
>
> > However, I feel that being able to customize the
> > indexing process and store our own file is still
more efficient both in
> load
> > time and index size.
> >
>
> Yes, the current payload implementation is not
optimized for this use
> case, it can be improved with a per-doc approach like
the one I suggested.
>
> -Michael
>
>
> > Thanks
> >
> > -John
> >
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
>
>
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )