|
List Info
Thread: Term frequency
|
|
| Term frequency |

|
2007-04-11 14:23:24 |
Hi,
I've just started using Lucene. Can anybody assist me in
calculating
the term frequencies of the terms(words) that occur in a
document(*.txt),
when a particular doc is submitted.
Say when i submit sample.txt , i should first analyze the
document
with a standard anlyzer, then the term frequencies should be
calculated
for each and every term in that document.
Thanks in advance
--
சாய் Hari
|
|
| Unicode Normalization |
  United States |
2007-04-11 15:00:40 |
Hi.
I have encountered a problem searching in my application
because of inconsistant unicode normalization forms in the
corpus (and the queries). I would like to normalize to form
NFKD in an analyzer (I think). I was thinking about creating
a filter similar to the lowercasefilter that would do the
unicode normalization. Then I will add that filter to my
existing snowball analyzer. I am about to embark on creating
said analyzer/filter using the ICU (http://icu-project.org/)
icu4j jar.
Is this already accounted for in standard lucene somewhere
and I'm just missing it?
Anything similar out there?
Any other advice?
Thanks,
Dave Wooodward
Library of Congress
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Term frequency |
  United States |
2007-04-11 15:29:19 |
Add Term Vectors to your Field during indexing. See the
Field
constructors. To get a Term Vector out, see
IndexReader.getTermFreqVector method.
-Grant
On Apr 11, 2007, at 3:23 PM, sai hariharan wrote:
> Hi,
> I've just started using Lucene. Can anybody assist me
in calculating
> the term frequencies of the terms(words) that occur in
a document
> (*.txt),
> when a particular doc is submitted.
>
> Say when i submit sample.txt , i should first analyze
the document
> with a standard anlyzer, then the term frequencies
should be
> calculated
> for each and every term in that document.
>
> Thanks in advance
> --
> சாய் Hari
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.ap
ache.org/jakarta-lucene/
LuceneFAQ
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Unicode Normalization |
  United States |
2007-04-11 17:48:03 |
: I have encountered a problem searching in my application
because of
: inconsistant unicode normalization forms in the corpus
(and the
: queries). I would like to normalize to form NFKD in an
analyzer (I
: think). I was thinking about creating a filter similar to
the
i'm very naive to the multitudes of issues with charsets
and
charencodings, but isn't the a problem best solved well
when
First constructing the java String or Reader object --
either from a file
on disk or from a network socket of some kind?
or am i missunderstanding your meaning of the word
Normalization? at
first i thought you might be talking about something like
the
ISOLatin1AccentFilter but then i looked at the ICU url you
mentioned and
it seems to be all about byte=>character issues ... that
doesn't sound
like something you would really want to be doing in an
Analyzer.
-Hoss
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Unicode Normalization |

|
2007-04-11 21:24:32 |
On 4/11/07, Chris Hostetter <hossman_lucene fucit.org> wrote:
>
> : I have encountered a problem searching in my
application because of
> : inconsistant unicode normalization forms in the
corpus (and the
> : queries). I would like to normalize to form NFKD in
an analyzer (I
> : think). I was thinking about creating a filter
similar to the
>
> i'm very naive to the multitudes of issues with
charsets and
> charencodings, but isn't the a problem best solved well
when
> First constructing the java String or Reader object --
either from a file
> on disk or from a network socket of some kind?
>
> or am i missunderstanding your meaning of the word
Normalization? at
> first i thought you might be talking about something
like the
> ISOLatin1AccentFilter but then i looked at the ICU url
you mentioned and
> it seems to be all about byte=>character issues ...
that doesn't sound
> like something you would really want to be doing in an
Analyzer.
Unfortunately, there is a whole level of unicode
"encoding" issues
above the level of byte encoding. Unicode characters do not
map
precisely to code points: a single character can often be
represented
via a single codepoint or a combination of two (surrogate
pair). I
have no idea how java's String class handles this--I doubt
it does any
intelligent normalization.
-Mike
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Unicode Normalization |

|
2007-04-11 22:02:31 |
On 4/11/07, Mike Klaas <mike.klaas gmail.com> wrote:
> Unicode characters do not map
> precisely to code points: a single character can often
be represented
> via a single codepoint or a combination of two
(surrogate pair).
I normally hear surrogates in the context of UTF-16 after
the code point space
became too large for UTF-16 to represent. AFAIK it's more
of an
encoding thing, not a code point thing... for example, you
would never
see the surrogates if you encoded in UTF8 (although the
surrogates are
still code points since they needed to be reserved).
But there do seem to be groups of code points that map to a
single character:
http
://en.wikipedia.org/wiki/Combining_character
> have no idea how java's String class handles this--I
doubt it does any
> intelligent normalization.
UTF-16 surrogates are handled as of Java5.
-Yonik
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Unicode Normalization |

|
2007-04-11 22:14:47 |
Yonik Seeley wrote:
>> have no idea how java's String class handles
this--I doubt it does any
>> intelligent normalization.
>
> UTF-16 surrogates are handled as of Java5.
And as of Java6 we have the java.text.Normalizer utility.
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61
2 9280 0699
Web: http://nuix.com/
Fax: +61 2 9212 6902
This message is intended only for the named recipient. If
you are not
the intended recipient you are notified that disclosing,
copying,
distributing or taking any action in reliance on the
contents of this
message or attachment is strictly prohibited.
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Unicode Normalization |

|
2007-04-11 22:32:45 |
On 4/11/07, Yonik Seeley <yonik apache.org> wrote:
> On 4/11/07, Mike Klaas <mike.klaas gmail.com> wrote:
> > Unicode characters do not map
> > precisely to code points: a single character can
often be represented
> > via a single codepoint or a combination of two
(surrogate pair).
>
> I normally hear surrogates in the context of UTF-16
after the code point space
> became too large for UTF-16 to represent. AFAIK it's
more of an
> encoding thing, not a code point thing... for example,
you would never
> see the surrogates if you encoded in UTF8 (although the
surrogates are
> still code points since they needed to be reserved).
You're right. Bringing up surrogate pairs just muddles the
discussion.
> But there do seem to be groups of code points that map
to a single character:
> http
://en.wikipedia.org/wiki/Combining_character
>
> > have no idea how java's String class handles
this--I doubt it does any
> > intelligent normalization.
>
> UTF-16 surrogates are handled as of Java5.
And it seems that character composition and normalization is
built in to java 6:
http://weblogs.java.net/blog/joconner/
archive/2007/02/normalization_c.html
-Mike
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
| Re: Term frequency |

|
2007-04-12 02:12:06 |
Hi,
Thanx for replying. In my scenario i'm not going to index
any of my docs.
So is there a way to find out term frequencies of the terms
in a doc
without doing the indexing part?
Thanx in advance,
Hari
On 4/12/07, Grant Ingersoll <gsingers apache.org> wrote:
>
> Add Term Vectors to your Field during indexing. See
the Field
> constructors. To get a Term Vector out, see
> IndexReader.getTermFreqVector method.
>
> -Grant
>
> On Apr 11, 2007, at 3:23 PM, sai hariharan wrote:
>
> > Hi,
> > I've just started using Lucene. Can anybody assist
me in calculating
> > the term frequencies of the terms(words) that
occur in a document
> > (*.txt),
> > when a particular doc is submitted.
> >
> > Say when i submit sample.txt , i should first
analyze the document
> > with a standard anlyzer, then the term frequencies
should be
> > calculated
> > for each and every term in that document.
> >
> > Thanks in advance
> > --
> > சாய் Hari
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
>
> Read the Lucene Java FAQ at http://wiki.ap
ache.org/jakarta-lucene/
> LuceneFAQ
>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-user-help lucene.apache.org
>
>
--
சாய் Hari
|
|
| Re: Term frequency |
  Sweden |
2007-04-12 02:25:47 |
12 apr 2007 kl. 09.12 skrev sai hariharan:
> Thanx for replying. In my scenario i'm not going to
index any of my
> docs.
> So is there a way to find out term frequencies of the
terms in a doc
> without doing the indexing part?
Using an analyzer (Tokenstream) and a Map<String,
Integer>?
while ((t = ts.next)!=null) {
Integer tf = map.get(t.termtext());
if (tf == null) {
tf = 1;
} else {
tf++;
}
map.put(t.termtext(), tf);
}
--
karl
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribe lucene.apache.org
For additional commands, e-mail: java-user-help lucene.apache.org
|
|
[1-10]
|
|