List Info

Thread: Term frequency




Term frequency
user name
2007-04-11 14:23:24
Hi,
I've just started using Lucene. Can anybody assist me in
calculating
the term frequencies of the terms(words) that occur in a
document(*.txt),
when a particular doc is submitted.

Say when i submit sample.txt , i should first analyze the
document
with a standard anlyzer, then the term frequencies should be
calculated
for each and every term in that document.

Thanks in advance
-- 
சாய் Hari
Unicode Normalization
country flaguser name
United States
2007-04-11 15:00:40
Hi.

I have encountered a problem searching in my application
because of inconsistant unicode normalization forms in the
corpus (and the queries). I would like to normalize to form
NFKD in an analyzer (I think). I was thinking about creating
a filter similar to the lowercasefilter that would do the
unicode normalization. Then I will add that filter to my
existing snowball analyzer. I am about to embark on creating
said analyzer/filter using the ICU (http://icu-project.org/)
icu4j jar.

Is this already accounted for in standard lucene somewhere
and I'm just missing it?

Anything similar out there?

Any other advice?

Thanks,
Dave Wooodward
Library of Congress


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Term frequency
country flaguser name
United States
2007-04-11 15:29:19
Add Term Vectors to your Field during indexing.  See the
Field  
constructors.  To get a Term Vector out, see  
IndexReader.getTermFreqVector method.

-Grant

On Apr 11, 2007, at 3:23 PM, sai hariharan wrote:

> Hi,
> I've just started using Lucene. Can anybody assist me
in calculating
> the term frequencies of the terms(words) that occur in
a document 
> (*.txt),
> when a particular doc is submitted.
>
> Say when i submit sample.txt , i should first analyze
the document
> with a standard anlyzer, then the term frequencies
should be  
> calculated
> for each and every term in that document.
>
> Thanks in advance
> -- 
> சாய் Hari

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.ap
ache.org/jakarta-lucene/ 
LuceneFAQ



------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Unicode Normalization
country flaguser name
United States
2007-04-11 17:48:03
: I have encountered a problem searching in my application
because of
: inconsistant unicode normalization forms in the corpus
(and the
: queries). I would like to normalize to form NFKD in an
analyzer (I
: think). I was thinking about creating a filter similar to
the

i'm very naive to the multitudes of issues with charsets
and
charencodings, but isn't the a problem best solved well
when
First constructing the java String or Reader object --
either from a file
on disk or from a network socket of some kind?

or am i missunderstanding your meaning of the word
Normalization?  at
first i thought you might be talking about something like
the
ISOLatin1AccentFilter but then i looked at the ICU url you
mentioned and
it seems to be all about byte=>character issues ... that
doesn't sound
like something you would really want to be doing in an
Analyzer.




-Hoss


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Unicode Normalization
user name
2007-04-11 21:24:32
On 4/11/07, Chris Hostetter <hossman_lucenefucit.org> wrote:
>
> : I have encountered a problem searching in my
application because of
> : inconsistant unicode normalization forms in the
corpus (and the
> : queries). I would like to normalize to form NFKD in
an analyzer (I
> : think). I was thinking about creating a filter
similar to the
>
> i'm very naive to the multitudes of issues with
charsets and
> charencodings, but isn't the a problem best solved well
when
> First constructing the java String or Reader object --
either from a file
> on disk or from a network socket of some kind?
>
> or am i missunderstanding your meaning of the word
Normalization?  at
> first i thought you might be talking about something
like the
> ISOLatin1AccentFilter but then i looked at the ICU url
you mentioned and
> it seems to be all about byte=>character issues ...
that doesn't sound
> like something you would really want to be doing in an
Analyzer.

Unfortunately, there is a whole level of unicode
"encoding" issues
above the level of byte encoding.  Unicode characters do not
map
precisely to code points:  a single character can often be
represented
via a single codepoint or a combination of two (surrogate
pair).  I
have no idea how java's String class handles this--I doubt
it does any
intelligent normalization.

-Mike

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Unicode Normalization
user name
2007-04-11 22:02:31
On 4/11/07, Mike Klaas <mike.klaasgmail.com> wrote:
> Unicode characters do not map
> precisely to code points:  a single character can often
be represented
> via a single codepoint or a combination of two
(surrogate pair).

I normally hear surrogates in the context of UTF-16 after
the code point space
became too large for UTF-16 to represent.  AFAIK it's more
of an
encoding thing, not a code point thing... for example, you
would never
see the surrogates if you encoded in UTF8 (although the
surrogates are
still code points since they needed to be reserved).

But there do seem to be groups of code points that map to a
single character:
http
://en.wikipedia.org/wiki/Combining_character

> have no idea how java's String class handles this--I
doubt it does any
> intelligent normalization.

UTF-16 surrogates are handled as of Java5.

-Yonik

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Unicode Normalization
user name
2007-04-11 22:14:47
Yonik Seeley wrote:
>> have no idea how java's String class handles
this--I doubt it does any
>> intelligent normalization.
> 
> UTF-16 surrogates are handled as of Java5.

And as of Java6 we have the java.text.Normalizer utility.

Daniel



-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61
2 9280 0699
Web: http://nuix.com/        
                      Fax: +61 2 9212 6902

This message is intended only for the named recipient. If
you are not
the intended recipient you are notified that disclosing,
copying,
distributing or taking any action in reliance on the
contents of this
message or attachment is strictly prohibited.

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Unicode Normalization
user name
2007-04-11 22:32:45
On 4/11/07, Yonik Seeley <yonikapache.org> wrote:
> On 4/11/07, Mike Klaas <mike.klaasgmail.com> wrote:
> > Unicode characters do not map
> > precisely to code points:  a single character can
often be represented
> > via a single codepoint or a combination of two
(surrogate pair).
>
> I normally hear surrogates in the context of UTF-16
after the code point space
> became too large for UTF-16 to represent.  AFAIK it's
more of an
> encoding thing, not a code point thing... for example,
you would never
> see the surrogates if you encoded in UTF8 (although the
surrogates are
> still code points since they needed to be reserved).

You're right.  Bringing up surrogate pairs just muddles the
discussion.

> But there do seem to be groups of code points that map
to a single character:
> http
://en.wikipedia.org/wiki/Combining_character
>
> > have no idea how java's String class handles
this--I doubt it does any
> > intelligent normalization.
>
> UTF-16 surrogates are handled as of Java5.

And it seems that character composition and normalization is
built in to java 6:
http://weblogs.java.net/blog/joconner/
archive/2007/02/normalization_c.html

-Mike

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


Re: Term frequency
user name
2007-04-12 02:12:06
Hi,
Thanx for replying. In my scenario i'm not going to index
any of my docs.
So is there a way to find out term frequencies of the terms
in a doc
without doing the indexing part?

Thanx in advance,
Hari

On 4/12/07, Grant Ingersoll <gsingersapache.org> wrote:
>
> Add Term Vectors to your Field during indexing.  See
the Field
> constructors.  To get a Term Vector out, see
> IndexReader.getTermFreqVector method.
>
> -Grant
>
> On Apr 11, 2007, at 3:23 PM, sai hariharan wrote:
>
> > Hi,
> > I've just started using Lucene. Can anybody assist
me in calculating
> > the term frequencies of the terms(words) that
occur in a document
> > (*.txt),
> > when a particular doc is submitted.
> >
> > Say when i submit sample.txt , i should first
analyze the document
> > with a standard anlyzer, then the term frequencies
should be
> > calculated
> > for each and every term in that document.
> >
> > Thanks in advance
> > --
> > சாய் Hari
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.ap
ache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>


-- 
சாய் Hari
Re: Term frequency
country flaguser name
Sweden
2007-04-12 02:25:47
12 apr 2007 kl. 09.12 skrev sai hariharan:

> Thanx for replying. In my scenario i'm not going to
index any of my  
> docs.
> So is there a way to find out term frequencies of the
terms in a doc
> without doing the indexing part?

Using an analyzer (Tokenstream) and a Map<String,
Integer>?

while ((t = ts.next)!=null) {
   Integer tf = map.get(t.termtext());
   if (tf == null) {
     tf = 1;
   } else {
     tf++;
   }
   map.put(t.termtext(), tf);
}


-- 
karl


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org


[1-10]

about | contact  Other archives ( Real Estate discussion Medical topics )