List Info

Thread: Extracting html pages from db




Extracting html pages from db
country flaguser name
United States
2007-10-17 07:53:07
Hi,

I was able to install Nutch 0.9 and crawl a site and use the
Web Page to do
full text search of my db.

But we need to extract informations from all HTML page.

So, is there a way to extract HTML pages from the db?
-- 
View this message in context: http://www.nabble.com/Extracting-htm
l-pages-from-db-tf4640373.html#a13253122
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Extracting html pages from db
user name
2007-10-17 11:40:25
It depends on what you are trying to do.  Content in
segments stores the 
full content (html, etc.) of each page.  The cached.jsp page
displays 
full content.

Dennis Kubes


LoneEagle70 wrote:
> Hi,
> 
> I was able to install Nutch 0.9 and crawl a site and
use the Web Page to do
> full text search of my db.
> 
> But we need to extract informations from all HTML
page.
> 
> So, is there a way to extract HTML pages from the db?

Re: Extracting html pages from db
country flaguser name
United States
2007-10-17 12:20:31
I do not want it using the WebApp.

Is there a way to extract all html files from command line
in a directory?
Like displaying stats. I tried the dump but was not what I
wanted. I really
want only html pages so I can take information from them.

Here my problem: We are looking for a program that will do
Web Crawling but
must be customized for each site that we need because those
pages are
generated based on parameters. Also, we need to extract
information
(product, price, manufacturer, ...). So, if you have
experience with Nutch,
you could help me out. Can I customized it through Hooks?
What can/can't I
do?

Thanks for your help! 

Dennis Kubes-2 wrote:
> 
> It depends on what you are trying to do.  Content in
segments stores the 
> full content (html, etc.) of each page.  The cached.jsp
page displays 
> full content.
> 
> Dennis Kubes
> 
> 
> LoneEagle70 wrote:
>> Hi,
>> 
>> I was able to install Nutch 0.9 and crawl a site
and use the Web Page to
>> do
>> full text search of my db.
>> 
>> But we need to extract informations from all HTML
page.
>> 
>> So, is there a way to extract HTML pages from the
db?
> 
> 

-- 
View this message in context: http://www.nabble.com/Extracting-htm
l-pages-from-db-tf4640373.html#a13258493
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Extracting html pages from db
user name
2007-10-17 12:30:05
Pulling out specific information for each site could be done
through 
HtmlParseFilter implementations.  Look at 
org.apache.nutch.parse.HtmlParseFilter and its
implementations.  The 
specific fields you extract can be stored in MetaData in
ParseData.  You 
can then access that information in other jobs such as
indexer.  Hope 
this helps.

Dennis Kubes

LoneEagle70 wrote:
> I do not want it using the WebApp.
> 
> Is there a way to extract all html files from command
line in a directory?
> Like displaying stats. I tried the dump but was not
what I wanted. I really
> want only html pages so I can take information from
them.
> 
> Here my problem: We are looking for a program that will
do Web Crawling but
> must be customized for each site that we need because
those pages are
> generated based on parameters. Also, we need to extract
information
> (product, price, manufacturer, ...). So, if you have
experience with Nutch,
> you could help me out. Can I customized it through
Hooks? What can/can't I
> do?
> 
> Thanks for your help! 
> 
> Dennis Kubes-2 wrote:
>> It depends on what you are trying to do.  Content
in segments stores the 
>> full content (html, etc.) of each page.  The
cached.jsp page displays 
>> full content.
>>
>> Dennis Kubes
>>
>>
>> LoneEagle70 wrote:
>>> Hi,
>>>
>>> I was able to install Nutch 0.9 and crawl a
site and use the Web Page to
>>> do
>>> full text search of my db.
>>>
>>> But we need to extract informations from all
HTML page.
>>>
>>> So, is there a way to extract HTML pages from
the db?
>>
> 

Re: Extracting html pages from db
country flaguser name
United States
2007-10-17 12:42:40
Do you have any idea how to extract from command line all my
html files
stored in the db?

Dennis Kubes-2 wrote:
> 
> Pulling out specific information for each site could be
done through 
> HtmlParseFilter implementations.  Look at 
> org.apache.nutch.parse.HtmlParseFilter and its
implementations.  The 
> specific fields you extract can be stored in MetaData
in ParseData.  You 
> can then access that information in other jobs such as
indexer.  Hope 
> this helps.
> 
> Dennis Kubes
> 
> LoneEagle70 wrote:
>> I do not want it using the WebApp.
>> 
>> Is there a way to extract all html files from
command line in a
>> directory?
>> Like displaying stats. I tried the dump but was not
what I wanted. I
>> really
>> want only html pages so I can take information from
them.
>> 
>> Here my problem: We are looking for a program that
will do Web Crawling
>> but
>> must be customized for each site that we need
because those pages are
>> generated based on parameters. Also, we need to
extract information
>> (product, price, manufacturer, ...). So, if you
have experience with
>> Nutch,
>> you could help me out. Can I customized it through
Hooks? What can/can't
>> I
>> do?
>> 
>> Thanks for your help! 
>> 
>> Dennis Kubes-2 wrote:
>>> It depends on what you are trying to do. 
Content in segments stores the 
>>> full content (html, etc.) of each page.  The
cached.jsp page displays 
>>> full content.
>>>
>>> Dennis Kubes
>>>
>>>
>>> LoneEagle70 wrote:
>>>> Hi,
>>>>
>>>> I was able to install Nutch 0.9 and crawl a
site and use the Web Page
>>>> to
>>>> do
>>>> full text search of my db.
>>>>
>>>> But we need to extract informations from
all HTML page.
>>>>
>>>> So, is there a way to extract HTML pages
from the db?
>>>
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Extracting-htm
l-pages-from-db-tf4640373.html#a13258870
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Extracting html pages from db
user name
2007-10-17 12:51:25
There is currently no way to do that.  You would need to
write a map job 
to pull the data from Content within Segments.

Dennis Kubes

LoneEagle70 wrote:
> Do you have any idea how to extract from command line
all my html files
> stored in the db?
> 
> Dennis Kubes-2 wrote:
>> Pulling out specific information for each site
could be done through 
>> HtmlParseFilter implementations.  Look at 
>> org.apache.nutch.parse.HtmlParseFilter and its
implementations.  The 
>> specific fields you extract can be stored in
MetaData in ParseData.  You 
>> can then access that information in other jobs such
as indexer.  Hope 
>> this helps.
>>
>> Dennis Kubes
>>
>> LoneEagle70 wrote:
>>> I do not want it using the WebApp.
>>>
>>> Is there a way to extract all html files from
command line in a
>>> directory?
>>> Like displaying stats. I tried the dump but was
not what I wanted. I
>>> really
>>> want only html pages so I can take information
from them.
>>>
>>> Here my problem: We are looking for a program
that will do Web Crawling
>>> but
>>> must be customized for each site that we need
because those pages are
>>> generated based on parameters. Also, we need to
extract information
>>> (product, price, manufacturer, ...). So, if you
have experience with
>>> Nutch,
>>> you could help me out. Can I customized it
through Hooks? What can/can't
>>> I
>>> do?
>>>
>>> Thanks for your help! 
>>>
>>> Dennis Kubes-2 wrote:
>>>> It depends on what you are trying to do. 
Content in segments stores the 
>>>> full content (html, etc.) of each page. 
The cached.jsp page displays 
>>>> full content.
>>>>
>>>> Dennis Kubes
>>>>
>>>>
>>>> LoneEagle70 wrote:
>>>>> Hi,
>>>>>
>>>>> I was able to install Nutch 0.9 and
crawl a site and use the Web Page
>>>>> to
>>>>> do
>>>>> full text search of my db.
>>>>>
>>>>> But we need to extract informations
from all HTML page.
>>>>>
>>>>> So, is there a way to extract HTML
pages from the db?
>>
> 

Re: Extracting html pages from db
country flaguser name
United States
2007-10-17 14:23:34
Hello-

    I've done this, I think it is

    nutch readseg -dump <segment_dir>
<dumpfile>

to dump all the html of everything in a segment.  You can
also specify what 
url you are interested in, type nutch readseg for details.

                        see you
                            -Jim


----- Original Message ----- 
From: "LoneEagle70" <avachone-djuster.com>
To: <nutch-userlucene.apache.org>
Sent: Wednesday, October 17, 2007 5:53 AM
Subject: Extracting html pages from db


>
> Hi,
>
> I was able to install Nutch 0.9 and crawl a site and
use the Web Page to 
> do
> full text search of my db.
>
> But we need to extract informations from all HTML
page.
>
> So, is there a way to extract HTML pages from the db?
> -- 
> View this message in context: 
> http://www.nabble.com/Extracting-htm
l-pages-from-db-tf4640373.html#a13253122
> Sent from the Nutch - User mailing list archive at
Nabble.com.
> 


[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )