List Info

Thread: Re: RLE Compressing bit vectors, just toughts




Re: RLE Compressing bit vectors, just toughts
country flaguser name
Germany
2007-08-04 13:26:09
Hi Paul, thanks for feedback

>I suppose you mean run length encoding? (I missed the
first posting about 
this.)

You are right, that is what I meant by RLE. This was the
first posting. I am just trying get some feedback to see if
there are some knock-out conditions disqualifying  this
idea.

a bit of background, 
It came up as a side effect after playing with Filter ,
better said, with your Matcher idea. We found some
*relatively* clean way to get around BitSet - Filter
marriage by avoiding Filter completely and making our own
"SlicingIndexReader extends FilterIndexReader"
that has ability to receive  Matcher(filter) from outside.
Also,  FilteredTermDocs implements TermDocs inside of it to
use this "Matcher" to do skipping.   We could not
wait  for  your nice  code that solves Filter problems  to
get on svn. I did not bother anybody with this code as I
think it solves problem with Filter on conceptually soft
ground. It provides me with capacity to make different view
on index subset defined by Matcher, but does not have
flexibility your approach with Matcher has (eg
BooleanQuery.add(Matcher, Occur)...). Also, it is fast, as
it filters out at "the source", with Scorers
totally unaware of it.    .... irrelevant, just mentioning
it as maybe someone finds this idea
 useful for something else. 

Back to the RLE, by doing all this, we came up with one
SortedIntIntervalListMatcher  as  we have some fields in
index that compress perfectly using this trick! IntervalList
is practically  the same thing as RLE, solves  the same
problem. So the idea, "can RLE save some ticks/ index
size without affecting performance in typical non-sorted
case", I would say yes, but it is good idea to ask for
feedback from people more familiar with multi interval skip
lists and bit level gurus like you and Yonik,   honestly, I
have no idea what would be relative cost of an extra
if(0xFF==b) in tight next() and skipTo() methods, as  this
determines how big slow down in the worst case will be
(totally sparse, no RLE "firing"). 

 
>You could try and use a VInt flag value (how about 0xFF
?) to start an encoded 
>run length encoded series of bits. For example 0xFF
would be followed by the
>next delta as a VInt, and by the run length as the next
VInt.

>You might also try and generalize the bytes of VInt to
nibbles (half bytes).

I will see, I need to figure out some experiments that are
not as involved as implementing it on Lucene index (change
index format and whatnot..)
 
Anyhow,  it is encouraging  not to have  immediate
"sorry, cannot work because ...."   from  you or
someone else on this list 

thanks again for feedback,
e.







     
___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the
answer. Try it
now.
http://uk.answers.yahoo.
com/

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: RLE Compressing bit vectors, just toughts
country flaguser name
Netherlands
2007-08-04 14:09:18
Eks,

Two things:

In the latest Matcher patches I forgot to consider your
implementation of
a Matcher on OpenBitSet. In case I don't mention this there,
please holler
there when things start moving. My implementation of that in
the latest
patches is really no more than a first attempt, and mostly
untested.

As a sort of baseline for good compression you might try and
use
the GZipInputStream in java.util.gzip. I just saw that it
even
has a skip() method over bytes of uncompressed data,
so it should also useful for skip list compression.
I have no idea how fast/slow this class is, but when it
works only
slightly better than reasonable, there is probably no need
to try
any compression tricks on its input, except for delta
encoding to reduce the total size of the uncompressed data
stream.

Regards,
Paul Elschot


On Saturday 04 August 2007 20:26, eks dev wrote:
> Hi Paul, thanks for feedback
> 
> >I suppose you mean run length encoding? (I missed
the first posting about 
> this.)
> 
> You are right, that is what I meant by RLE. This was
the first posting. I am 
just trying get some feedback to see if there are some
knock-out conditions 
disqualifying  this idea.
> 
> a bit of background, 
> It came up as a side effect after playing with Filter ,
better said, with 
your Matcher idea. We found some *relatively* clean way to
get around BitSet 
- Filter marriage by avoiding Filter completely and making
our own 
"SlicingIndexReader extends FilterIndexReader"
that has ability to receive  
Matcher(filter) from outside. Also,  FilteredTermDocs
implements TermDocs 
inside of it to use this "Matcher" to do skipping.
  We could not wait  for  
your nice  code that solves Filter problems  to get on svn.
I did not bother 
anybody with this code as I think it solves problem with
Filter on 
conceptually soft ground. It provides me with capacity to
make different view 
on index subset defined by Matcher, but does not have
flexibility your 
approach with Matcher has (eg BooleanQuery.add(Matcher,
Occur)...). Also, it 
is fast, as it filters out at "the source", with
Scorers totally unaware of 
it.    .... irrelevant, just mentioning it as maybe someone
finds this idea
>  useful for something else. 
> 
> Back to the RLE, by doing all this, we came up with one

SortedIntIntervalListMatcher  as  we have some fields in
index that compress 
perfectly using this trick! IntervalList is practically  the
same thing as 
RLE, solves  the same problem. So the idea, "can RLE
save some ticks/ index 
size without affecting performance in typical non-sorted
case", I would say 
yes, but it is good idea to ask for feedback from people
more familiar with 
multi interval skip lists and bit level gurus like you and
Yonik,   honestly, 
I have no idea what would be relative cost of an extra
if(0xFF==b) in tight 
next() and skipTo() methods, as  this determines how big
slow down in the 
worst case will be (totally sparse, no RLE
"firing"). 
> 
>  
> >You could try and use a VInt flag value (how about
0xFF ?) to start an 
encoded 
> >run length encoded series of bits. For example 0xFF
would be followed by 
the
> >next delta as a VInt, and by the run length as the
next VInt.
> 
> >You might also try and generalize the bytes of VInt
to nibbles (half 
bytes).
> 
> I will see, I need to figure out some experiments that
are not as involved 
as implementing it on Lucene index (change index format and
whatnot..)
>  
> Anyhow,  it is encouraging  not to have  immediate
"sorry, cannot work 
because ...."   from  you or someone else on this list

> 
> thanks again for feedback,
> e.
> 
> 
> 
> 
> 
> 
> 
>      
___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there
knows the answer. Try it
> now.
> http://uk.answers.yahoo.
com/ 
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
> 
> 
> 
> 

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )