List Info

Thread: Use CrawlDb as a metadata Db?




Use CrawlDb as a metadata Db?
user name
2006-08-22 17:38:31
If I am not wrong, segments generated by Generator are some
sort of
CrawlDatum.
I am putting metadata in the CrawlDb (I keep information
that never
change) and I think they are copied to the segments by the
Generator.

But now I want to access those metadata at the Parsing or
Indexing step
to put some of them in the ParseData that were extracted (or
directly in
the index).

I can't find a way to reassociate the "Content"
and the Parse Object to
their respective CrawlDb/Segment.

Basically, I am trying to use CrawlDb as a database of
metadata for
every URL and want to use them at the indexing step to
enrich the
ParseData and then be able to search against them later on.

Stupid Example: I know this URL is associated to color
"blue", but
doesn't have this information in the page pointed by this
URL. Blue
would be kept in the metadata of the CrawlDb, then the
generator/fetch/parse steps are done as usual, but when
indexing, blue
should be reassociated to the parsedata that has been
extracted from the
page. 

Is it feasible without changing anything in nutch? (I use
nutch as a
library more or lessand avoid changing stuff in it, I prefer
redoing my
own injector/generator/fetcher/parser and formats etc... if
needed).

I am going through all the different classes in nutch/hadoop
now to
understand where stuff are and if they are read and in what
kind of
object they are put.
Any pointer to shorten my reading is very welcome ;)

Thanks!

Use CrawlDb as a metadata Db?
user name
2006-08-30 08:06:36
HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> If I am not wrong, segments generated by Generator are
some sort of
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep
information that never
> change) and I think they are copied to the segments by
the Generator.
>
> But now I want to access those metadata at the Parsing
or Indexing step
> to put some of them in the ParseData that were
extracted (or directly in
> the index).
>
> I can't find a way to reassociate the
"Content" and the Parse Object to
> their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of
metadata for
> every URL and want to use them at the indexing step to
enrich the
> ParseData and then be able to search against them later
on.
>
> Stupid Example: I know this URL is associated to color
"blue", but
> doesn't have this information in the page pointed by
this URL. Blue
> would be kept in the metadata of the CrawlDb, then the
> generator/fetch/parse steps are done as usual, but when
indexing, blue
> should be reassociated to the parsedata that has been
extracted from the
> page. 
>
> Is it feasible without changing anything in nutch? (I
use nutch as a
> library more or lessand avoid changing stuff in it, I
prefer redoing my
> own injector/generator/fetcher/parser and formats
etc... if needed).
>
> I am going through all the different classes in
nutch/hadoop now to
> understand where stuff are and if they are read and in
what kind of
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>   
hi,

The CrawlDatum keeps crawl status information about every
url that is 
fetched. The class has a metedata field which is an instance
of  
MapWritable, behaving similar to a HashMap. Thus I have used
the 
metadata field for similar purposes. For example in the
fetcher, you can 
set some property like :

datum.getMetaData().put(<key>,<value>);

and than in the indexing plugin you could retrieve it with :
 
datum.getMetaData().get(<key>);





Fetch error
user name
2006-08-30 08:17:21
I update hadoop but I am get next error now on fetch step
(reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker:
task_0003_r_000000_3 0.33333334%
reduce > copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_000002_0&reduce=1:
java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(Serv
letHttpResponse.
java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImp
l.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWrite
rImpl.java:115)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContex
tImpl.java:190)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePage
Context(JspFacto
ryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(
JspFactoryImpl.j
ava:75)
        at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp
.java:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.ja
va:94)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.java:3
24)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:292)
        at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:236)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(Web
ApplicationHandl
er.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
er.java:567)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebAp
plicationContext
.java:635)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at
org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at
org.mortbay.http.HttpConnection.service(HttpConnection.java:
814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.ja
va:981)
        at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:8
31)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListe
ner.java:244)
        at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:3
57)
        at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:5
34)


How I can fixed this? While on generate step all works right
but on fetch
reduce I get error and task faild?



Fetch error
user name
2006-08-30 11:18:14
Preview error I got from tasktracker log. In jobtracker log
I am see next
error now:

06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from
task_0001_r_000000_1: java.lang.AbstractMethodError:
org.apache.n
utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache
/hadoop/fs/FileS
ystem;Lorg/apache/hadoop/mapred/JobConf;Ljava/
lang/String;Lorg/apache/hadoop/util/Progressable;)Lorg/apach
e/hadoop/mapred/
RecordWriter;
        at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:297)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1075)



-----Original Message-----
From: antonorbita1.ru [mailto:antonorbita1.ru] 
Sent: Wednesday, August 30, 2006 12:17 PM
To: nutch-devlucene.apache.org
Subject: Fetch error
Importance: High

I update hadoop but I am get next error now on fetch step
(reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker:
task_0003_r_000000_3 0.33333334%
reduce > copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_000002_0&reduce=1:
java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(Serv
letHttpResponse.
java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImp
l.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWrite
rImpl.java:115)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContex
tImpl.java:190)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePage
Context(JspFacto
ryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(
JspFactoryImpl.j
ava:75)
        at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp
.java:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.ja
va:94)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.java:3
24)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:292)
        at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:236)
        at
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(Web
ApplicationHandl
er.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
er.java:567)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebAp
plicationContext
.java:635)
        at
org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at
org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at
org.mortbay.http.HttpConnection.service(HttpConnection.java:
814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.ja
va:981)
        at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:8
31)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListe
ner.java:244)
        at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:3
57)
        at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:5
34)


How I can fixed this? While on generate step all works right
but on fetch
reduce I get error and task faild?





[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )