List Info

Thread: Parsing extra fields from an html page in the web.....




Parsing extra fields from an html page in the web.....
user name
2007-09-27 08:13:01
Hi,
We are working on an Indian Language search engine and are
using
nutch-0.9as the basic framework.

However when the html pages are parsed during the fetching
phase, the
htmlParser which runs on the page extracts the title text
and metatags and
the outlinks.
what do i need to do if i need to add in more fields like
<author>,
<language>, <script>  to the segments extracted
from the web page. In case
the data is unavailable in the page, we can load in some
default values.

Do i need to touch the actual parser code (parser used here
is a neko-html
parser if am not wrong) or the additions can be done right
from within the
nutch code.

It would be of great help if you could get me through this.

-- 
Pratyush Banerjee
SPO, CLIA
IIT Kharagpur
Re: Parsing extra fields from an html page in the web.....
country flaguser name
Poland
2007-09-27 14:29:26
I brief. You need to write HtmlParserFilter, then
IndexingFilter and QueryFilter. You register them through
extension points. Search USER (not dev) group, there answers
already.

BTW. This questions is asked over and over. It seems to be a
good subject to write on wiki.

Marcin

> Hi,
> We are working on an Indian Language search engine and
are using
> nutch-0.9as the basic framework.
> 
> However when the html pages are parsed during the
fetching phase, the
> htmlParser which runs on the page extracts the title
text and metatags and
> the outlinks.
> what do i need to do if i need to add in more fields
like <author>,
> <language>, <script>  to the segments
extracted from the web page. In case
> the data is unavailable in the page, we can load in
some default values.
> 
> Do i need to touch the actual parser code (parser used
here is a neko-html
> parser if am not wrong) or the additions can be done
right from within the
> nutch code.
> 
> It would be of great help if you could get me through
this.
> 
> -- 
> Pratyush Banerjee
> SPO, CLIA
> IIT Kharagpur


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )