[ https://issues.apache.org/jira/browse/SO
LR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:
comment-tabpanel#action_12514541 ]
J.J. Larrea commented on SOLR-314:
----------------------------------
I agree that a stored-field pre-processor would be quite
useful, but I'm not sure the proposed scheme is the best way
to define and control it... in particular,
f.<field>.analysis=<fieldType> to pull the
analyzer definition out of a different fieldType seems like
a fragile and hacky construct. And it blurs what I see as
separate concerns, (1) having pre-storage processing part of
how a field is handled, versus (2) dynamically changing the
handling of a field. Another valid concern you raise (3)
is how to handle duplicate indexed values, but that should
apply whether the duplicates arose from tokenization or
separate <field>...</field> values.
I wonder if a more robust implementation of the
pre-processing concern would simply be to add another
analyzer type "store" to the current set
"index" and "query" which can be defined
on a fieldType; naturally it wouldn't be in the default
set.
For your example,
<fieldType name="text_ws"
class="solr.TextField" >
<analyzer type="store,index,query">
<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
would ws-tokenize "aaa bbb ccc" and store 3
separate strings.
You raise the question of how to control the catenation of
tokens. Simple enough to create an UnTokenize token filter
which can be added to the tail of any analyzer chain. It
could take arguments for the separator strings to use based
on whether tokens are overlapping or not, or better yet,
printf formats for both cases.
That would extend the store analyzer to quite different
use-cases... for example, semicolon-delimited author strings
can be split, with each author run through your
CapitalizationFilter for storage, while for indexing
punctuation would be stripped and it would be lower-cased:
<fieldType name="text_ws"
class="solr.TextField" >
<analyzer type="store">
<tokenizer
class="solr.PatternTokenizerFactory"
pattern=";s+"/>
<filter
class="solr.CapitalizationFilterFactory"
onlyFirstWord="false"
keep="and or the is my of for de"
okPrefix="McK"
forceFirstLetter="true" />
<filter
class="solr.UnTokenizerFilterFactory"
adjacent="; "/>
</analyzer>
<analyzer type="index,query"> <!--
type="index,query" is optional -->
<tokenizer
class="solr.PatternTokenizerFactory"
pattern="[,;|s]+"/>
...
<filter
class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In a similar example, stored values could be run through the
HyphenatedWordsFilterFactory (and then untokenized) so they
reflect what is actually being indexed.
One could even store the result of analysis (perhaps in a
CopyField) as a visual token mapping to help diagnose
indexing/analysis problems, concatenated with something on
the order of <filter
class="solr.UnTokenizerFilterFactory"
adjacent=" " overlap=" / "
missing="<null>" /> e.g.
"<null> quick / fast dog / canine jumped
..."
Then to address the other concern (2) of allowing
user-control of field types, one solution would be to recast
the StoreAnalysisProcessor as say DynamicFieldTypeProcessor,
allowing f.<field>.type=<fieldType> when it is
inserted in the chain... e.g. for language-specific
analysis, etc.
(It's late, I hope this all makes sense...)
> Store Analyzed token text from an incoming
SolrInputDocument
>
------------------------------------------------------------
>
> Key: SOLR-314
> URL: https:
//issues.apache.org/jira/browse/SOLR-314
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Ryan McKinley
> Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming
fields through a Field Analyzer and stores the output of
each token as a field value.
> For Example. If you have a field type defined:
> <fieldType name="text_ws"
class="solr.TextField" >
> <analyzer>
> <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
> </analyzer>
> </fieldType>
> And send a request:
>
/update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
> <field name="feature">aaa bbb
ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
> <arr name="feature">
> <str>aaa</str>
> <str>bbb</str>
> <str>ccc</str>
> </arr>
> </doc>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|