|
List Info
Thread: Re: RLE Compressing bit vectors, just toughts
|
|
| Re: RLE Compressing bit vectors, just
toughts |
  Germany |
2007-08-04 13:26:09 |
Hi Paul, thanks for feedback
>I suppose you mean run length encoding? (I missed the
first posting about
this.)
You are right, that is what I meant by RLE. This was the
first posting. I am just trying get some feedback to see if
there are some knock-out conditions disqualifying this
idea.
a bit of background,
It came up as a side effect after playing with Filter ,
better said, with your Matcher idea. We found some
*relatively* clean way to get around BitSet - Filter
marriage by avoiding Filter completely and making our own
"SlicingIndexReader extends FilterIndexReader"
that has ability to receive Matcher(filter) from outside.
Also, FilteredTermDocs implements TermDocs inside of it to
use this "Matcher" to do skipping. We could not
wait for your nice code that solves Filter problems to
get on svn. I did not bother anybody with this code as I
think it solves problem with Filter on conceptually soft
ground. It provides me with capacity to make different view
on index subset defined by Matcher, but does not have
flexibility your approach with Matcher has (eg
BooleanQuery.add(Matcher, Occur)...). Also, it is fast, as
it filters out at "the source", with Scorers
totally unaware of it. .... irrelevant, just mentioning
it as maybe someone finds this idea
useful for something else.
Back to the RLE, by doing all this, we came up with one
SortedIntIntervalListMatcher as we have some fields in
index that compress perfectly using this trick! IntervalList
is practically the same thing as RLE, solves the same
problem. So the idea, "can RLE save some ticks/ index
size without affecting performance in typical non-sorted
case", I would say yes, but it is good idea to ask for
feedback from people more familiar with multi interval skip
lists and bit level gurus like you and Yonik, honestly, I
have no idea what would be relative cost of an extra
if(0xFF==b) in tight next() and skipTo() methods, as this
determines how big slow down in the worst case will be
(totally sparse, no RLE "firing").
>You could try and use a VInt flag value (how about 0xFF
?) to start an encoded
>run length encoded series of bits. For example 0xFF
would be followed by the
>next delta as a VInt, and by the run length as the next
VInt.
>You might also try and generalize the bytes of VInt to
nibbles (half bytes).
I will see, I need to figure out some experiments that are
not as involved as implementing it on Lucene index (change
index format and whatnot..)
Anyhow, it is encouraging not to have immediate
"sorry, cannot work because ...." from you or
someone else on this list
thanks again for feedback,
e.
___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the
answer. Try it
now.
http://uk.answers.yahoo.
com/
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Re: RLE Compressing bit vectors, just
toughts |
  Netherlands |
2007-08-04 14:09:18 |
Eks,
Two things:
In the latest Matcher patches I forgot to consider your
implementation of
a Matcher on OpenBitSet. In case I don't mention this there,
please holler
there when things start moving. My implementation of that in
the latest
patches is really no more than a first attempt, and mostly
untested.
As a sort of baseline for good compression you might try and
use
the GZipInputStream in java.util.gzip. I just saw that it
even
has a skip() method over bytes of uncompressed data,
so it should also useful for skip list compression.
I have no idea how fast/slow this class is, but when it
works only
slightly better than reasonable, there is probably no need
to try
any compression tricks on its input, except for delta
encoding to reduce the total size of the uncompressed data
stream.
Regards,
Paul Elschot
On Saturday 04 August 2007 20:26, eks dev wrote:
> Hi Paul, thanks for feedback
>
> >I suppose you mean run length encoding? (I missed
the first posting about
> this.)
>
> You are right, that is what I meant by RLE. This was
the first posting. I am
just trying get some feedback to see if there are some
knock-out conditions
disqualifying this idea.
>
> a bit of background,
> It came up as a side effect after playing with Filter ,
better said, with
your Matcher idea. We found some *relatively* clean way to
get around BitSet
- Filter marriage by avoiding Filter completely and making
our own
"SlicingIndexReader extends FilterIndexReader"
that has ability to receive
Matcher(filter) from outside. Also, FilteredTermDocs
implements TermDocs
inside of it to use this "Matcher" to do skipping.
We could not wait for
your nice code that solves Filter problems to get on svn.
I did not bother
anybody with this code as I think it solves problem with
Filter on
conceptually soft ground. It provides me with capacity to
make different view
on index subset defined by Matcher, but does not have
flexibility your
approach with Matcher has (eg BooleanQuery.add(Matcher,
Occur)...). Also, it
is fast, as it filters out at "the source", with
Scorers totally unaware of
it. .... irrelevant, just mentioning it as maybe someone
finds this idea
> useful for something else.
>
> Back to the RLE, by doing all this, we came up with one
SortedIntIntervalListMatcher as we have some fields in
index that compress
perfectly using this trick! IntervalList is practically the
same thing as
RLE, solves the same problem. So the idea, "can RLE
save some ticks/ index
size without affecting performance in typical non-sorted
case", I would say
yes, but it is good idea to ask for feedback from people
more familiar with
multi interval skip lists and bit level gurus like you and
Yonik, honestly,
I have no idea what would be relative cost of an extra
if(0xFF==b) in tight
next() and skipTo() methods, as this determines how big
slow down in the
worst case will be (totally sparse, no RLE
"firing").
>
>
> >You could try and use a VInt flag value (how about
0xFF ?) to start an
encoded
> >run length encoded series of bits. For example 0xFF
would be followed by
the
> >next delta as a VInt, and by the run length as the
next VInt.
>
> >You might also try and generalize the bytes of VInt
to nibbles (half
bytes).
>
> I will see, I need to figure out some experiments that
are not as involved
as implementing it on Lucene index (change index format and
whatnot..)
>
> Anyhow, it is encouraging not to have immediate
"sorry, cannot work
because ...." from you or someone else on this list
>
> thanks again for feedback,
> e.
>
>
>
>
>
>
>
>
___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there
knows the answer. Try it
> now.
> http://uk.answers.yahoo.
com/
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-dev-help lucene.apache.org
>
>
>
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
[1-2]
|
|