Thank you for the reply Ard,
The tokens exist in the index and are returned accurately,
except for
the offsets. In this case I am not dealing with the
positions, so the
termvector is specified as using 'with_offsets'. I have left
the term
position incrememt as its default. Looking at the existing
tokenstreams,
they don't maintain knowledge of the current position, they
always
generate values startoffsets beginning at 0 of the current
stream, and
then a 'proper' offset is generated based on the +1 of the
previous
token the DocumentWriter applies when indexeding. Nor are
there any test
cases for offsets. I found a bug that was opened a while ago
dealing
with this issue (as well as related one). It is:
http
s://issues.apache.org/jira/browse/LUCENE-579
I am retrieving the a text token's offset values using
TermPositionVector.getOffsets() which returns
TermVectorOffsetInfo[].
The same offset values that were placed into the token
during indexing
are not being returned, they have been shifted.
Thanks.
Shahan
Ard Schrijvers wrote:
> Hello,
>
>
>> Hi,
>> I am storing custom values in the Tokens provided
by a Tokenizer but
>> when retrieving them from the index the values
don't match.
>>
>
> What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know
that should be in, but you do not find a match?
>
> In the latter, you must make sure that you are using
the same analyzer for the search as you used for indexing.
>
>
>> I've looked
>> in the LIA book but it's not current since it
mentioned term vectors
>> aren't stored. I'm using Lucene Nightly 146 but the
same thing has
>> happened with older versions. Looking at the
internals,
>> DocumentWriter
>> seems to keep track of the end offset that was
placed into
>> the index and
>> modifies the token values (with +1) but I'm not
sure whether
>> I should be
>> concerned with it.
>> No existing analyzers are used when adding the
document so all the
>> offsets are generated manually.
>> Any suggestions of how the token offsets should be
stored?
>>
>>
>
> Look at other clases that implement TokenStream. Also
take a look at setPositionIncrement when you are putting in
your own terms
>
> Regards Ard
>
>
>> Is this valid?
>> Token, start, end
>> aaa, 0, 3
>> bbb, 4, 7
>> ccc, 8, 11
>>
>> Thanks,
>> Shahan
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
>> For additional commands, e-mail: java-user-help lucene.apache.org
>>
>>
>>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-user-help lucene.apache.org
>
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|