List Info

Thread: Mixing Case and Case-Insensitive Searching




Mixing Case and Case-Insensitive Searching
user name
2007-04-17 10:29:24
I've run into a case where we want to search for the acronym
'LET',
however this three letter word occurs very frequently in
quite a
number of documents.

What I'm looking to do is a query that's case insensitive
_except_ for
that specific term.

And, it appears to do so, things get very ugly, very
quickly.

According to http://www.gossamer-threads.com/lists/luce
ne/java-user/28131?page=last
(I'm doing my homework first), it appears that one must keep
a
case-sensitive and case-insensitive version in the index, if
not two
separate indexes.

While this makes sense up to a point, allowing me to search
one field
or the other; I'm looking for distances between words, it
seems things
get far more complicated.

Because, should I store both versions and happen to know
that the
tokens are the same and that their positions are identical,
is there a
way do things like:

"LET organization"  (where LET is case sensitive,
but part of the phrase)

"company LET"~10  (again, where LET is case
sensitive, near the term
company which is case insensitive)

Would love to get some thoughts on how to go about
approaching this.

Thanks,
-Walt Stoneburner

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Mixing Case and Case-Insensitive Searching
user name
2007-04-17 12:06:09
Would it work to index the upper-case LET as something else?
For
instance, index it as '$let'

Now, all your searches are on one index, but you have to
substitute
'$let' where you want to find LET in your query.. This won't
match your
other occurrences of let...

You'll have to watch to be sure your displays are
appropriate. That is,
depending upon how you store your data for display you'd
have to
translate '$let' back into 'LET'......

This should also work for span queries, etc.....

This won't foul  you up if you use wildcards as long as you
don't
allow the wildcards to appear in the first character
position...

Best
Erick

On 4/17/07, Walt Stoneburner <walt.stoneburnergmail.com> wrote:
>
> I've run into a case where we want to search for the
acronym 'LET',
> however this three letter word occurs very frequently
in quite a
> number of documents.
>
> What I'm looking to do is a query that's case
insensitive _except_ for
> that specific term.
>
> And, it appears to do so, things get very ugly, very
quickly.
>
> According to
> http://www.gossamer-threads.com/lists/luce
ne/java-user/28131?page=last
> (I'm doing my homework first), it appears that one must
keep a
> case-sensitive and case-insensitive version in the
index, if not two
> separate indexes.
>
> While this makes sense up to a point, allowing me to
search one field
> or the other; I'm looking for distances between words,
it seems things
> get far more complicated.
>
> Because, should I store both versions and happen to
know that the
> tokens are the same and that their positions are
identical, is there a
> way do things like:
>
> "LET organization"  (where LET is case
sensitive, but part of the phrase)
>
> "company LET"~10  (again, where LET is case
sensitive, near the term
> company which is case insensitive)
>
> Would love to get some thoughts on how to go about
approaching this.
>
> Thanks,
> -Walt Stoneburner
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
Re: Mixing Case and Case-Insensitive Searching
country flaguser name
United States
2007-04-17 12:48:56
: I've run into a case where we want to search for the
acronym 'LET',
: however this three letter word occurs very frequently in
quite a
: number of documents.
:
: What I'm looking to do is a query that's case insensitive
_except_ for
: that specific term.

it sounds like you need to create a customized bastard
stepshild of
StopFilter and LowercaseFilter ... take in a dictionary of
known
capitalized acronimes in the constructor, then for each
Token lowercase it
unless:
   - it's in all caps
   - it's in your acronym list.



-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Mixing Case and Case-Insensitive Searching
user name
2007-04-17 13:59:05
Yeah, what Hoss said. That's a much more elegant solution
than I suggested. If you use the same filter for indexing
and searching,
it'll all "just happen" for you.

Erick

On 4/17/07, Chris Hostetter <hossman_lucenefucit.org> wrote:
>
>
> : I've run into a case where we want to search for the
acronym 'LET',
> : however this three letter word occurs very frequently
in quite a
> : number of documents.
> :
> : What I'm looking to do is a query that's case
insensitive _except_ for
> : that specific term.
>
> it sounds like you need to create a customized bastard
stepshild of
> StopFilter and LowercaseFilter ... take in a dictionary
of known
> capitalized acronimes in the constructor, then for each
Token lowercase it
> unless:
>    - it's in all caps
>    - it's in your acronym list.
>
>
>
> -Hoss
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )