List Info

Thread: does anyone know of a 'smart' categorizing text pattern finder?




does anyone know of a 'smart' categorizing text pattern finder?
user name
2006-09-26 01:49:31
Hi,

I wonder if anyone here knows if there is a 'smart' text
pattern finder, ideally written in Java. The library I'm
looking for should be able to 'guess' the category of the
particular text on the page, most probably by finding
similarities between the bulk of the pages and a set of
templates.

Eg, many forums are powered by phpbb, which structures 99%
of the pages (except for some title pages & user profile
pages) in a very similar fashion (page is broken into
blocks, each block is broken into further blocks, etc). By
comparing many pages with each other (eg, from the same
domain root: forum.springframework.org) it should be
possible to detect common ('template decorations') and
page specific (actual content, like 'user name' and
'posting body') parts. After that it should further be
possible, by comparing 'template decorations' parts to a
set of templates, to 'guess' the nature of each of the
'page specific' block (eg, 'Vladimir Olenin' in the left
side column will be marked as 'name', while whatever is
adjucent to this column is the post body).

So, I wonder if anyone knows of a package capable of such
things. Primary goal though is simplier: to be able to parse
out just posters' names from message boards. Though
sometimes the 'block category' can be derived from CSS
class name of the tags around the text, it's very often not
the case.

Might Nutch have similar functionality built into their
crawler?

Thanks.

Vlad
does anyone know of a 'smart' categorizing text pattern finder?
user name
2006-11-21 22:46:01
Vladimir Olenin wrote:
> Hi,
> 
> I wonder if anyone here knows if there is a 'smart'
text pattern finder, ideally written in Java. The library
I'm looking for should be able to 'guess' the category of
the particular text on the page, most probably by finding
similarities between the bulk of the pages and a set of
templates.

This is another problem you can actually do pretty
well in Lucene itself.  Either index with your usual
analyzer or use the n-gram analyzers we wrote about
in LingPipe in Action.

Then create an index with a single pseudo-document
per topic, containing all the text you want to use
to train the topic.

Then run the document to classify as a query against
the index, and the highest scoring pseudo-document
is the most likely category according to token
match.

You could also check out our more probabilistic
classifiers.  For instance, we have a classification
by topic demo:

http://www.alias-i.com/lingpipe/demos/tutori
al/classify/read-me.html

And just about every other natural language platform
and most machine learning platforms do classificaton
(e.g. Mallet and MinorThird, both in Java).  For
general structured classification problems, you
might want to check out Weka.

- Bob Carpenter
   Alias-i

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

does anyone know of a 'smart' categorizing text pattern finder?
user name
2006-11-22 00:03:26
On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
> LingPipe in Action.

Now that's a book I'd love to own!





------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

does anyone know of a 'smart' categorizing text pattern finder?
user name
2006-11-24 08:22:33
Does this book really exit? I googled and didn't find any
introduction about
it 

2006/11/22, Erik Hatcher <erikehatchersolutions.com>:
>
>
> On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
> > LingPipe in Action.
>
> Now that's a book I'd love to own!
>
>
>
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
> For additional commands, e-mail: java-user-helplucene.apache.org
>
>
does anyone know of a 'smart' categorizing text pattern finder?
user name
2006-11-24 10:55:25
On Nov 24, 2006, at 3:22 AM, Jin Yiqing wrote:
> Does this book really exit? I googled and didn't find
any  
> introduction about
> it 


No, I'm sure Bob meant to say "Lucene in Action"
in which he  
contributed a wonderful case study on bits of LingPipe.

	Erik



>
> 2006/11/22, Erik Hatcher <erikehatchersolutions.com>:
>>
>>
>> On Nov 21, 2006, at 5:46 PM, Bob Carpenter wrote:
>> > LingPipe in Action.
>>
>> Now that's a book I'd love to own!
>>
>>
>>
>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-user-helplucene.apache.org
>>
>>


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )