HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> If I am not wrong, segments generated by Generator are
some sort of
> CrawlDatum.
> I am putting metadata in the CrawlDb (I keep
information that never
> change) and I think they are copied to the segments by
the Generator.
>
> But now I want to access those metadata at the Parsing
or Indexing step
> to put some of them in the ParseData that were
extracted (or directly in
> the index).
>
> I can't find a way to reassociate the
"Content" and the Parse Object to
> their respective CrawlDb/Segment.
>
> Basically, I am trying to use CrawlDb as a database of
metadata for
> every URL and want to use them at the indexing step to
enrich the
> ParseData and then be able to search against them later
on.
>
> Stupid Example: I know this URL is associated to color
"blue", but
> doesn't have this information in the page pointed by
this URL. Blue
> would be kept in the metadata of the CrawlDb, then the
> generator/fetch/parse steps are done as usual, but when
indexing, blue
> should be reassociated to the parsedata that has been
extracted from the
> page.
>
> Is it feasible without changing anything in nutch? (I
use nutch as a
> library more or lessand avoid changing stuff in it, I
prefer redoing my
> own injector/generator/fetcher/parser and formats
etc... if needed).
>
> I am going through all the different classes in
nutch/hadoop now to
> understand where stuff are and if they are read and in
what kind of
> object they are put.
> Any pointer to shorten my reading is very welcome ;)
>
> Thanks!
>
>
>
hi,
The CrawlDatum keeps crawl status information about every
url that is
fetched. The class has a metedata field which is an instance
of
MapWritable, behaving similar to a HashMap. Thus I have used
the
metadata field for similar purposes. For example in the
fetcher, you can
set some property like :
datum.getMetaData().put(<key>,<value>);
and than in the indexing plugin you could retrieve it with :
datum.getMetaData().get(<key>);
|