Are we sure about KeywordAnalyzer here? Which suppose to
"Tokenizes"
the entire stream as a single token. (useful for data like
zip codes,
ids, and some product names.)
In the scenario we are discussing, U.S. is just a token
within the
text and we still would like to leverage from
StandardAnalyzer for all
other goodies. I am sorry for the incomplete set up in
previous message.
More or less, I expect somewhere we can instruct
StandardTokenizer.jj
that U.S. is a special token (even it is indeed an ACRONYM)
and we
prefer to index it as U.S. as is. Can we do that?
Charlie
Otis Gospodnetic wrote:
> Use KeywordAnalyzer to leave "U.S." as-is and
index it as-is.
>
> Otis
> --
> Lucene Consulting -- http://lucene-consultin
g.com/
>
>
> ----- Original Message ----
> From: crspan <crspan gmail.com>
> To: java-user lucene.apache.org
> Sent: Saturday, July 14, 2007 5:18:59 PM
> Subject: index U.K. U.S. U.N. U.V.
>
> Would you please advice the best practice of indexing:
>
> U.S.
>
> The standard analyzer will transform it to be
"us", which collide with
> "us"(we).
>
> Thanks,
>
> Charlie
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|