|
List Info
Thread: Created: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
|
|
| Created: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-04-20 05:46:41 |
InstanciatedIndex - faster but memory consuming index
-----------------------------------------------------
Key: LUCENE-550
URL: http:
//issues.apache.org/jira/browse/LUCENE-550
Project: Lucene - Java
Type: New Feature
Components: Store
Versions: 1.9
Reporter: Karl Wettin
After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
Not good enough for Jira yet, but if someone wants to fool
around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
You might notice that norms are float[] and not byte[]. That
is me who refactored it to see if it would do any good. Bit
shifting don't take many ticks, so I might just revert
that.
I belive the code is quite self explaining.
InstanciatedIndex ii = ..
ii.new InstanciatedIndexReader();
ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-550)
InstanciatedIndex - faster but memory
consuming index |

|
2006-04-20 15:06:07 |
[ http://issues.apache.org/jira/brows
e/LUCENE-550?page=comments#action_12375373 ]
Yonik Seeley commented on LUCENE-550:
-------------------------------------
Thanks Karl, it's interesting stuff...
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do
> any good. Bit shifting don't take many ticks, so I
might just revert that.
Since there are only 256 byte values, many scorers use a
simple lookup table Similarity.getNormDecoder()
After I sped up norm decoding, a lookup table was only
marginally faster anyway (see comments in SmallFloat class).
So I wouldn't expect float[] norms to be mesurably faster
than byte[] norms in the context of a complete search.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-550)
InstanciatedIndex - faster but memory
consuming index |

|
2006-04-20 16:32:06 |
[ http://issues.apache.org/jira/brows
e/LUCENE-550?page=comments#action_12375394 ]
Karl Wettin commented on LUCENE-550:
------------------------------------
> > You might notice that norms are float[] and not
byte[]. That is me who refactored it to see if it would do
> > any good. Bit shifting don't take many ticks, so
I might just revert that.
> Since there are only 256 byte values, many scorers use
a simple lookup table Similarity.getNormDecoder()
> After I sped up norm decoding, a lookup table was only
marginally faster anyway (see comments in SmallFloat
> class). So I wouldn't expect float[] norms to be
mesurably faster than byte[] norms in the context of a
complete
> search.
The hypthesis is that instanciation and unnecessary data
parsing is the bad guy. Converting bytes to floats fit that
profile, so I moved it to the IO-classes (readFloat ->
readByte). I relize that for the the norms alone, it is a
marginal win, but if I find enough of these things it might
show in the end. Don't know if I'll find enough things to
work with though. Been looking at getting ridth of things in
the IndexReader as the information it returns in many
situations already available in the information passed
IndexReader, but I'm afraid it might be a Pyrrhus victory
as the Jit usually automatically "caches" things
like that. There are more obvious places to save ticks, e.g.
replacing collections with arrays.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-550)
InstanciatedIndex - faster but memory
consuming index |

|
2006-04-21 00:51:06 |
[ http://issues.apache.org/jira/brows
e/LUCENE-550?page=comments#action_12375453 ]
Karl Wettin commented on LUCENE-550:
------------------------------------
Due to read and write locks, this is how one must use the
extention:
InstanciatedIndex ii = new InstanciatedIndex();
IndexWriter iw = ii.new InstanciatedIndexWriter(analyzer,
clear); // locks
iw.close(); // commits
IndexReader ir = ii.new InstanciatedIndexReader();
Searcher = ii.getSearcher();
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, src.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-04-21 00:39:06 |
[ http://issues.apache.org/jira/browse/LUCENE-550?page=all
]
Karl Wettin updated LUCENE-550:
-------------------------------
Attachment: class_diagram.png
Class diagram over InstanciatedIndex
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, src.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-04-21 00:37:10 |
[ http://issues.apache.org/jira/browse/LUCENE-550?page=all
]
Karl Wettin updated LUCENE-550:
-------------------------------
Attachment: src.tar.gz
The whole Lucene core branch.
I think I've messed something up, queries with
Directory-implementations are much slower than normal.
See the class diagram to understand what I did.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, src.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-04-21 21:13:06 |
[ http://issues.apache.org/jira/browse/LUCENE-550?page=all
]
Karl Wettin updated LUCENE-550:
-------------------------------
Attachment: class_diagram.png
This is a class diagram that explains what it will look like
when I'm done.
It is pretty much only the IndexReader that needs to be
refactored.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, class_diagram.png, src.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-05-09 21:48:06 |
[ http://issues.apache.org/jira/browse/LUCENE-550?page=all
]
Karl Wettin updated LUCENE-550:
-------------------------------
Attachment: src_20060509.tar.gz
Some new statistics.
* A corpus of 500 documents, 1-5K text per document.
* Placed 150 000 term and boolean queries.
* Retrieved the top <100 hits from each result.
Query alone is about 5x faster,
but 9x if you include the hits collection.
I belive that span queries will be about 10x-20x faster as
the skipTo() is really really optimized. There is a bug in
my term position code, so I have not been able to messure it
for real yet.
Hope to have that working and an updated class diagram for
you soon.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, class_diagram.png, src.tar.gz,
src_20060509.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Commented: (LUCENE-550)
InstanciatedIndex - faster but memory
consuming index |

|
2006-05-09 23:27:05 |
[ http://issues.apache.org/jira/brows
e/LUCENE-550?page=comments#action_12378776 ]
Karl Wettin commented on LUCENE-550:
------------------------------------
Oups
InstanciatedIndex:
Corpus creation took 14011 ms.
Term queries took 33608 ms.
RAMDirectory:
Corpus creation took 9144 ms.
Term queries took 1123565 ms.
That it 35x the speed.
Something might be wrong. But my initial tests tells me that
it is right. Will look in to this tomorrow. Need to sleep
now.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, class_diagram.png, src.tar.gz,
src_20060509.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
| Updated: (LUCENE-550) InstanciatedIndex
- faster but memory consuming index |

|
2006-05-11 15:33:06 |
[ http://issues.apache.org/jira/browse/LUCENE-550?page=all
]
Karl Wettin updated LUCENE-550:
-------------------------------
Attachment: src-1.9karl1_20060611.tar.gz
There is a minor norms bug. The value differst +-3 from the
Directory norms. Other than that it seems to work great.
Now about 40x faster than RAMDirectory.
Stats for test: 500 documents. 1-5K text content.
10 000 * 5 spans
10 000 * 13 term and boolean term queries.
collected top 100 documents for each search results.
InstanciatedIndex is 40x faster than the RAMDirectory.
InstanciatedIndex running on Lucene 1.9-karl1
Corpus creation took 14903 ms.
Span queries took 12884 ms.
Term queries took 30221 ms.
RAMDirectory run on Licene 1.9
Corpus creation took 9337 ms.
Span queries took 253412 ms.
Term queries took 1188492 ms.
> InstanciatedIndex - faster but memory consuming index
> -----------------------------------------------------
>
> Key: LUCENE-550
> URL: http:
//issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
> Type: New Feature
> Components: Store
> Versions: 1.9
> Reporter: Karl Wettin
> Attachments: Document.java, InstanciatedIndex.java,
Term.java, class_diagram.png, class_diagram.png,
src-1.9karl1_20060611.tar.gz, src.tar.gz,
src_20060509.tar.gz
>
> After fixing the bugs, it's now 4.5 -> 5 times the
speed. This is true for both at index and query time. Sorry
if I got your hopes up too much. There are still things to
be done though. Might not have time to do anything with this
until next month, so here is the code if anyone wants a
peek.
> Not good enough for Jira yet, but if someone wants to
fool around with it, here it is. The implementation passes a
TermEnum -> TermDocs -> Fields -> TermVector
comparation against the same data in a Directory.
> When it comes to features, offsets don't exists and
positions are stored ugly and has bugs.
> You might notice that norms are float[] and not byte[].
That is me who refactored it to see if it would do any good.
Bit shifting don't take many ticks, so I might just revert
that.
> I belive the code is quite self explaining.
> InstanciatedIndex ii = ..
> ii.new InstanciatedIndexReader();
> ii.addDocument(s).. replace IndexWriter for now.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atl
assian.com/software/jira
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|
|
|
|