List Info

Thread: Created: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument




Created: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-21 13:08:06
Store Analyzed token text from an incoming
SolrInputDocument
------------------------------------------------------------


                 Key: SOLR-314
                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
             Project: Solr
          Issue Type: New Feature
          Components: update
            Reporter: Ryan McKinley


This is an UpdateRequestProcessor that runs incoming fields
through a Field Analyzer and stores the output of each token
as a field value.

For Example.  If you have a field type defined:

  <fieldType name="text_ws"
class="solr.TextField" >
      <analyzer>
        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
  </fieldType>

And send a request:
/update?store.analysis=true&f.feature.analysis=text_ws
<add> <doc>
 <field name="feature">aaa bbb
ccc</field>
</doc></add>

The returned document will look like:
<doc>
 <arr name="feature">
  <str>aaa</str>
  <str>bbb</str>
  <str>ccc</str>
 </arr>
</doc>



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-21 13:19:06
     [ 
https://issues.apache.org/jira/browse/SOLR-314?page=com.atla
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-314:
-------------------------------

    Attachment: SOLR-314-StoreAnalysis.patch

This adds the StoreAnalysisProcessor to the default chain. 
It is skipped unless the request includes a parameter
"store.analysis=true"

It chooses the field type based on a field param:
f.fieldname.analyze=FieldTypeName

I'm not totally happy with the field names.  suggestions?

- - - - -

The one big issue I'm not sure how to deal with is stitching
a multi-valued reqeust into a single TokenStream.

Consider the input 
<add> <doc>
 <field name="feature">aaa bbb
ccc</field>
 <field name="feature">bbb ccc
ddd</field>
</doc></add> 

As is, If the FieldType has a 'RemoveDuplicates' filter,
that won't remove the duplicates between the fields because
each input field gets its own Reader

Any ideas for a way around this?

Can I extract the Tokenizer explicitly?



> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-21 13:21:06
    [ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514429 ] 

Yonik Seeley commented on SOLR-314:
-----------------------------------

I think we need to be very careful misleading people into
thinking they need something like this
to search for separate components of a field.  Most people
will be best either with normal analysis, or with creating
multiple fields themselves if that's what they really
desire.

> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-21 13:32:06
    [ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514431 ] 

Yonik Seeley commented on SOLR-314:
-----------------------------------

> This adds the StoreAnalysisProcessor to the default
chain

Based on my previous comments, I think I'd be against adding
it to the default chain.
I still see this as a very rare need.  The norm for stored
fields should be "what you put in, you get back
out".

> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-21 13:51:06
    [ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514432 ] 

Ryan McKinley commented on SOLR-314:
------------------------------------

Right, the point of this is to process *stored* fields.  Any
documentation for this would make the purpose clear and
suggest that you will have more flexibility doing the
processing on the client side.

I need to find a user configurable way to  have someone
process incoming fields.  In some cases that is splitting
them into multiple tokens, but in others it is doing things
like 'toLowerCase' and remove duplicates.  Rather then build
my own interface for this, It would be great to use the
existing configurable analyzer framework.

If this is something that ought to stay of of core, I'm fine
with that.  But it does feel generally useful.



> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-22 21:04:31
    [ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514541 ] 

J.J. Larrea commented on SOLR-314:
----------------------------------

I agree that a stored-field pre-processor would be quite
useful, but I'm not sure the proposed scheme is the best way
to define and control it... in particular, 
f.<field>.analysis=<fieldType> to pull the
analyzer definition out of a different fieldType seems like
a fragile and hacky construct.  And it blurs what I see as
separate concerns, (1) having pre-storage processing part of
how a field is handled, versus (2) dynamically changing the
handling of a field.   Another valid concern you raise (3)
is how to handle duplicate indexed values, but that should
apply whether the duplicates arose from tokenization or
separate <field>...</field> values.  

I wonder if a more robust implementation of the
pre-processing concern would simply be to add another
analyzer type "store" to the current set
"index" and "query" which can be defined
on a fieldType; naturally it wouldn't be in the default
set.

For your example, 

  <fieldType name="text_ws"
class="solr.TextField" >
      <analyzer type="store,index,query">
        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
  </fieldType>

would ws-tokenize "aaa bbb ccc" and store 3
separate strings.

You raise the question of how to control the catenation of
tokens.  Simple enough to create an UnTokenize token filter
which can be added to the tail of any analyzer chain.  It
could take arguments for the separator strings to use based
on whether tokens are overlapping or not, or better yet,
printf formats for both cases.

That would extend the store analyzer to quite different
use-cases... for example, semicolon-delimited author strings
can be split, with each author run through your
CapitalizationFilter for storage, while for indexing
punctuation would be stripped and it would be lower-cased:

	<fieldType name="text_ws"
class="solr.TextField" >
		<analyzer type="store">
			<tokenizer
class="solr.PatternTokenizerFactory"
pattern=";s+"/>
			<filter
class="solr.CapitalizationFilterFactory"
				onlyFirstWord="false"
				keep="and or the is my of for de"
				okPrefix="McK"
				forceFirstLetter="true" />
			<filter
class="solr.UnTokenizerFilterFactory"
adjacent="; "/>
		</analyzer>
		<analyzer type="index,query">  <!--
type="index,query" is optional -->
			<tokenizer
class="solr.PatternTokenizerFactory"
pattern="[,;|s]+"/>
                       ...
			<filter
class="solr.LowerCaseFilterFactory"/>
		</analyzer>
	</fieldType>

In a similar example, stored values could be run through the
HyphenatedWordsFilterFactory (and then untokenized) so they
reflect what is actually being indexed.

One could even store the result of analysis (perhaps in a
CopyField) as a visual token mapping to help diagnose
indexing/analysis problems, concatenated with something on
the order of <filter
class="solr.UnTokenizerFilterFactory"
adjacent=" " overlap=" / "
missing="&lt;null&gt;" /> e.g.
"<null> quick / fast dog / canine jumped
..."

Then to address the other concern (2) of allowing
user-control of field types, one solution would be to recast
the StoreAnalysisProcessor as say DynamicFieldTypeProcessor,
allowing f.<field>.type=<fieldType> when it is
inserted in the chain... e.g. for language-specific
analysis, etc.

(It's late, I hope this all makes sense...)



> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument
country flaguser name
United States
2007-07-24 01:09:31
    [ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514860 ] 

Hoss Man commented on SOLR-314:
-------------------------------

I'm on the same page with the first part of JJs comments,
the API seems a awkward and forced.  adding a new analyzer
type would be one way to go if we wanted to change things at
the schema/doc-processing level -- the approach i was
thinking about was just anew FieldType that used it's index
analyzer for the stored values as well as the indexed
values.

i'm not really understanding most of the dicsussion about
concatenating and how that would work -- but i see it as
being largely unrelated to the main point of the issue (a
way to tokenize and process an input string) because people
may want to use an option like that even when sending
discrete values -- we should tackle the issues seperately

> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------

>
>                 Key: SOLR-314
>                 URL: https:
//issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws"
class="solr.TextField" >
>       <analyzer>
>         <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )