List Info

Thread: RE: Token offset values for custom Tokenizer




RE: Token offset values for custom Tokenizer
country flaguser name
Netherlands
2007-07-16 03:13:55
Hello,

> Hi,
> I am storing custom values in the Tokens provided by a
Tokenizer but 
> when retrieving them from the index the values don't
match. 

What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know
that should be in, but you do not find a match?

In the latter, you must make sure that you are using the
same analyzer for the search as you used for indexing. 

> I've looked 
> in the LIA book but it's not current since it mentioned
term vectors 
> aren't stored. I'm using Lucene Nightly 146 but the
same thing has 
> happened with older versions. Looking at the internals,

> DocumentWriter 
> seems to keep track of the end offset that was placed
into 
> the index and 
> modifies the token values (with +1) but I'm not sure
whether 
> I should be 
> concerned with it.
> No existing analyzers are used when adding the document
so all the 
> offsets are generated manually.
> Any suggestions of how the token offsets should be
stored?
> 

Look at other clases that implement TokenStream. Also take a
look at setPositionIncrement when you are putting in your
own terms

Regards Ard

> Is this valid?
> Token, start, end
> aaa, 0, 3
> bbb, 4, 7
> ccc, 8, 11
> 
> Thanks,
> Shahan
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
> 
> 

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Token offset values for custom Tokenizer
country flaguser name
Canada
2007-07-16 10:33:36
Thank you for the reply Ard,

The tokens exist in the index and are returned accurately,
except for 
the offsets. In this case I am not dealing with the
positions, so the 
termvector is specified as using 'with_offsets'. I have left
the term 
position incrememt as its default. Looking at the existing
tokenstreams, 
they don't maintain knowledge of the current position, they
always 
generate values startoffsets beginning at 0 of the current
stream, and 
then a 'proper' offset is generated based on the +1 of the
previous 
token the DocumentWriter applies when indexeding. Nor are
there any test 
cases for offsets. I found a bug that was opened a while ago
dealing 
with this issue (as well as related one). It is:
http
s://issues.apache.org/jira/browse/LUCENE-579

I am retrieving the a text token's offset values using 
TermPositionVector.getOffsets() which returns
TermVectorOffsetInfo[]. 
The same offset values that were placed into the token
during indexing 
are not being returned, they have been shifted.
Thanks.
Shahan

Ard Schrijvers wrote:
> Hello,
>
>   
>> Hi,
>> I am storing custom values in the Tokens provided
by a Tokenizer but 
>> when retrieving them from the index the values
don't match. 
>>     
>
> What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know
that should be in, but you do not find a match?
>
> In the latter, you must make sure that you are using
the same analyzer for the search as you used for indexing. 
>
>   
>> I've looked 
>> in the LIA book but it's not current since it
mentioned term vectors 
>> aren't stored. I'm using Lucene Nightly 146 but the
same thing has 
>> happened with older versions. Looking at the
internals, 
>> DocumentWriter 
>> seems to keep track of the end offset that was
placed into 
>> the index and 
>> modifies the token values (with +1) but I'm not
sure whether 
>> I should be 
>> concerned with it.
>> No existing analyzers are used when adding the
document so all the 
>> offsets are generated manually.
>> Any suggestions of how the token offsets should be
stored?
>>
>>     
>
> Look at other clases that implement TokenStream. Also
take a look at setPositionIncrement when you are putting in
your own terms
>
> Regards Ard
>
>   
>> Is this valid?
>> Token, start, end
>> aaa, 0, 3
>> bbb, 4, 7
>> ccc, 8, 11
>>
>> Thanks,
>> Shahan
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-user-helplucene.apache.org
>>
>>
>>     
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>   

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )