List Info

Thread: Scoring API issues (LONG)




Scoring API issues (LONG)
country flaguser name
Poland
2007-09-13 10:44:32
Hi all,

I've been working recently on a custom scoring plugin, and I
found out 
some issues with the scoring API that severely limit the way
we can 
calculate static page scores. I'd like to restart the
discussion about 
this API, and propose some changes. Any comments or
suggestions are welcome!


Currently we use the ScoringFilters API in several places in
the code, 
in order to adjust page scores during different operations
(such as 
generating, parsing, updating the db, and indexing). The API
can be 
divided into two distinct parts - the part that prioritizes
pages to be 
re-fetched, and the part that calculates static page scores
that convey 
the idea of query-independent page "importance" or
"quality". Of these 
two, the second one is more complex and requires better
support in the code.

One general comment: all of the issues below could be
handled by 
special-purpose tools that execute several map-reduce jobs
to read the 
data, build intermediate databases, and update crawldb /
linkdb as 
necessary. However, it would be good to investigate if we
can come up 
with an abstraction that is flexible enough to do this for
different 
scoring algorithms, and still doesn't require running many
additional jobs.

1. Partial link graph issue
---------------------------
Historically speaking, this API was abstracted from the
existing Nutch 
code, which implemented the OPIC-like scoring. As such, it
works on the 
premise that it's possible to correctly calculate a static
page score 
given only the following information:

* page history recorded in CrawlDb,
* new page information from the current segment,
* partial inlink information available in the segments
currently being 
processed (in updatedb).

This works well for the OPIC (if even there - the jury's
still out ;) ), 
but not for many other scoring algorithms, including the
most popular 
ones like PageRank or HITS, or even the inlink-degree
algorithm from 
Nutch 0.7. Those algorithms require processing of the
complete web graph 
information (i.e. the inlink / outlink info) that's been
collected so far.

I propose the following change: let's update the LinkDb in
the same step 
as CrawlDb, so that both are up-to-date with the most
current 
information after finishing the updatedb step. In terms of
the input / 
output data involved, this change does increase the amount
of data to be 
processed, by the size of LinkDb. However, the benefit of
this change is 
that for each page to be updated we have access to its
complete inlink 
information, which enables us to use other scoring
algorithms that 
require this data. Also, we don't have to run invertlinks
anymore.

So, the updatedb process would look like this:

INPUT:
	- CrawlDb: <Text url, CrawlDatum dbDatum>
	- LinkDb:  <Text url, Inlinks inlinks>
         - segment/crawl_fetch:
		   <Text url, CrawlDatum fetchDatum>
         - segment/crawl_parse:
		   <Text url, CrawlDatum inlink>

MAP: simply collects the input data in records that are able
to 
accommodate all this information:

	<Text url, <Inlinks oldInlinks, List newInlinks,
		CrawlDatum dbDatum, CrawlDatum fetchDatum> >

REDUCE: uses a modified version of CrawlDbReducer, which
first collapses 
all incoming records to a single record in the above format,
i.e. it 
collects all incoming records and fills in the slots for
dbDatum, 
fetchDatum, oldInlinks and newInlinks. The we pretty much
reuse the rest 
of the existing logic in CrawlDbReducer - but at the end of
the reduce() 
we can pass the full inlink information to
ScoringFilters.updateDbScore. 
Finally, we aggregate all inlink information and output the
following 
record:

	<Text url, <Inlinks newInlinks, CrawlDatum
newDbDatum> >

OUTPUT: we use a special OutputFormat that splits output
records into 
<url, newInlinks> and <url, newDbDatum> and
creates new versions of both 
CrawlDb and LinkDb.

2. Lack of global properties
----------------------------
Neither CrawlDb, nor LinkDb, nor segments keep around the
most basic 
global statistics about them. Currently, in order to tell
how many pages 
we have in the db it's necessary to run a mapred job (readdb
-stats) - 
even though this information is static and could've been
calculated in 
advance for a given generation of db-s or for each segment.
This 
complicates even simple tasks such as readseg -list, and
makes it 
difficult to keep around global score-related statistics for
db-s and 
segments.

So, I propose to add a metadata file located in a well-known
location 
inside the db or segment directory. The file would be based
on a single 
MapWritable that contains arbitrary keys/value, including
predefined 
ones such as the number of records, last update time etc. We
would need 
to maintain it for each db and each segment. Each operation
that changes 
a db or a segment would update this information.

In practial terms, I propose to add static methods to
CrawlDbReader, 
LinkDbReader and SegmentReader, which can retrieve and / or
update this 
information.

3. Initialization of scoring plugins with global
information
------------------------------------------------------------

Current scoring API works only with local properties of the
page (I'm 
not taking into account plugins that use external
information sources - 
that's outside of the scope of the API). It doesn't have any
built-in 
facilities to collect and calculate global properties useful
for PR or 
HITS calculation, such as e.g. the number of dangling nodes
(ie. pages 
without outlinks), their total score, the number of inlinks,
etc. It 
doesn't have the facility to output this collected global
information at 
the end of the job. Neither has it any facility to
initialize scoring 
plugins with such information if one exists.

I propose to add the following methods to scoring plugins,
so that they 
can modify the job configuration right before the job is
started, so 
that later on the plugins could use this information when
scoring 
filters are initialized in each task. E.g:

public void prepareInjectorConfig(Path crawlDb, Path urls,
Configuration 
config);
public void prepareGeneratorConfig(Path crawlDb,
Configuration config);
public void prepareIndexerConfig(Path crawlDb, Path linkDb,
Path[] 
segments, Configuration config);
public void prepareUpdateConfig(Path crawlDb, Path[]
segments, 
Configuration config);

Example: to properly implement the OPIC scoring, it's
necessary to 
collect the total number of dangling nodes, and the total
score from 
these nodes. Then, in the next step it's necessary to spread
this total 
score evenly among all other nodes in the crawldb. Currently
this is not 
possible unless we run additional jobs, and create
additional files to 
keep this data around between the steps. It would be more
convenient to 
keep this data in CrawlDb metadata (see above) and make
relevant values 
available in the job context (Configuration).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


Re: Scoring API issues (LONG)
user name
2007-09-18 14:40:23
Hi,

I think the ideas here are brilliant. A big +1 from me. I
have one
minor suggestion that I detail below.

On 9/13/07, Andrzej Bialecki <abgetopt.org> wrote:
> Hi all,
>
> I've been working recently on a custom scoring plugin,
and I found out
> some issues with the scoring API that severely limit
the way we can
> calculate static page scores. I'd like to restart the
discussion about
> this API, and propose some changes. Any comments or
suggestions are welcome!
>
>
> Currently we use the ScoringFilters API in several
places in the code,
> in order to adjust page scores during different
operations (such as
> generating, parsing, updating the db, and indexing).
The API can be
> divided into two distinct parts - the part that
prioritizes pages to be
> re-fetched, and the part that calculates static page
scores that convey
> the idea of query-independent page
"importance" or "quality". Of these
> two, the second one is more complex and requires better
support in the code.
>
> One general comment: all of the issues below could be
handled by
> special-purpose tools that execute several map-reduce
jobs to read the
> data, build intermediate databases, and update crawldb
/ linkdb as
> necessary. However, it would be good to investigate if
we can come up
> with an abstraction that is flexible enough to do this
for different
> scoring algorithms, and still doesn't require running
many additional jobs.
>
> 1. Partial link graph issue
> ---------------------------
> Historically speaking, this API was abstracted from the
existing Nutch
> code, which implemented the OPIC-like scoring. As such,
it works on the
> premise that it's possible to correctly calculate a
static page score
> given only the following information:
>
> * page history recorded in CrawlDb,
> * new page information from the current segment,
> * partial inlink information available in the segments
currently being
> processed (in updatedb).
>
> This works well for the OPIC (if even there - the
jury's still out ;) ),
> but not for many other scoring algorithms, including
the most popular
> ones like PageRank or HITS, or even the inlink-degree
algorithm from
> Nutch 0.7. Those algorithms require processing of the
complete web graph
> information (i.e. the inlink / outlink info) that's
been collected so far.
>
> I propose the following change: let's update the LinkDb
in the same step
> as CrawlDb, so that both are up-to-date with the most
current
> information after finishing the updatedb step. In terms
of the input /
> output data involved, this change does increase the
amount of data to be
> processed, by the size of LinkDb. However, the benefit
of this change is
> that for each page to be updated we have access to its
complete inlink
> information, which enables us to use other scoring
algorithms that
> require this data. Also, we don't have to run
invertlinks anymore.
>
> So, the updatedb process would look like this:
>
> INPUT:
>         - CrawlDb: <Text url, CrawlDatum dbDatum>
>         - LinkDb:  <Text url, Inlinks inlinks>
>          - segment/crawl_fetch:
>                    <Text url, CrawlDatum
fetchDatum>
>          - segment/crawl_parse:
>                    <Text url, CrawlDatum inlink>
>
> MAP: simply collects the input data in records that are
able to
> accommodate all this information:
>
>         <Text url, <Inlinks oldInlinks, List
newInlinks,
>                 CrawlDatum dbDatum, CrawlDatum
fetchDatum> >
>
> REDUCE: uses a modified version of CrawlDbReducer,
which first collapses
> all incoming records to a single record in the above
format, i.e. it
> collects all incoming records and fills in the slots
for dbDatum,
> fetchDatum, oldInlinks and newInlinks. The we pretty
much reuse the rest
> of the existing logic in CrawlDbReducer - but at the
end of the reduce()
> we can pass the full inlink information to
ScoringFilters.updateDbScore.
> Finally, we aggregate all inlink information and output
the following
> record:
>
>         <Text url, <Inlinks newInlinks,
CrawlDatum newDbDatum> >
>
> OUTPUT: we use a special OutputFormat that splits
output records into
> <url, newInlinks> and <url, newDbDatum> and
creates new versions of both
> CrawlDb and LinkDb.
>
> 2. Lack of global properties
> ----------------------------
> Neither CrawlDb, nor LinkDb, nor segments keep around
the most basic
> global statistics about them. Currently, in order to
tell how many pages
> we have in the db it's necessary to run a mapred job
(readdb -stats) -
> even though this information is static and could've
been calculated in
> advance for a given generation of db-s or for each
segment. This
> complicates even simple tasks such as readseg -list,
and makes it
> difficult to keep around global score-related
statistics for db-s and
> segments.
>
> So, I propose to add a metadata file located in a
well-known location
> inside the db or segment directory. The file would be
based on a single
> MapWritable that contains arbitrary keys/value,
including predefined
> ones such as the number of records, last update time
etc. We would need
> to maintain it for each db and each segment. Each
operation that changes
> a db or a segment would update this information.
>
> In practial terms, I propose to add static methods to
CrawlDbReader,
> LinkDbReader and SegmentReader, which can retrieve and
/ or update this
> information.
>
> 3. Initialization of scoring plugins with global
information
>
------------------------------------------------------------
> Current scoring API works only with local properties of
the page (I'm
> not taking into account plugins that use external
information sources -
> that's outside of the scope of the API). It doesn't
have any built-in
> facilities to collect and calculate global properties
useful for PR or
> HITS calculation, such as e.g. the number of dangling
nodes (ie. pages
> without outlinks), their total score, the number of
inlinks, etc. It
> doesn't have the facility to output this collected
global information at
> the end of the job. Neither has it any facility to
initialize scoring
> plugins with such information if one exists.
>
> I propose to add the following methods to scoring
plugins, so that they
> can modify the job configuration right before the job
is started, so
> that later on the plugins could use this information
when scoring
> filters are initialized in each task. E.g:
>
> public void prepareInjectorConfig(Path crawlDb, Path
urls, Configuration
> config);
> public void prepareGeneratorConfig(Path crawlDb,
Configuration config);
> public void prepareIndexerConfig(Path crawlDb, Path
linkDb, Path[]
> segments, Configuration config);
> public void prepareUpdateConfig(Path crawlDb, Path[]
segments,
> Configuration config);

Should we really pass Path-s to methods? IMHO, opening a
file and
reading from it looks a bit cumbersome. I would suggest that
the
relevant job would read the file then pass the data
(MapWritable) to
the method. For example, prepareGeneratorConfig would look
like this:

public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);

>
> Example: to properly implement the OPIC scoring, it's
necessary to
> collect the total number of dangling nodes, and the
total score from
> these nodes. Then, in the next step it's necessary to
spread this total
> score evenly among all other nodes in the crawldb.
Currently this is not
> possible unless we run additional jobs, and create
additional files to
> keep this data around between the steps. It would be
more convenient to
> keep this data in CrawlDb metadata (see above) and make
relevant values
> available in the job context (Configuration).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
__________________________________
> [__ || __|__/|__||/|  Information Retrieval, Semantic
Web
> ___|||__||  |  ||  |  Embedded Unix, System
Integration
> http://www.sigram.com 
Contact: info at sigram dot com
>
>


-- 
Doğacan Güney
Re: Scoring API issues (LONG)
country flaguser name
Poland
2007-09-18 15:12:27
Doğacan Güney wrote:

>> public void prepareInjectorConfig(Path crawlDb,
Path urls, Configuration
>> config);
>> public void prepareGeneratorConfig(Path crawlDb,
Configuration config);
>> public void prepareIndexerConfig(Path crawlDb, Path
linkDb, Path[]
>> segments, Configuration config);
>> public void prepareUpdateConfig(Path crawlDb,
Path[] segments,
>> Configuration config);
> 
> Should we really pass Path-s to methods? IMHO, opening
a file and
> reading from it looks a bit cumbersome. I would suggest
that the
> relevant job would read the file then pass the data
(MapWritable) to
> the method. For example, prepareGeneratorConfig would
look like this:
> 
> public void prepareGeneratorConfig(MapWritable
crawlDbMeta,
> Configuration config);

What about the segment's metadata in prepareUpdateConfig?
Following your 
idea, we would have to pass a Map<String segmentName,
MapWritable 
metaData> ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


Re: Scoring API issues (LONG)
user name
2007-09-19 01:09:29
On 9/18/07, Andrzej Bialecki <abgetopt.org> wrote:
> Doğacan Güney wrote:
>
> >> public void prepareInjectorConfig(Path
crawlDb, Path urls, Configuration
> >> config);
> >> public void prepareGeneratorConfig(Path
crawlDb, Configuration config);
> >> public void prepareIndexerConfig(Path crawlDb,
Path linkDb, Path[]
> >> segments, Configuration config);
> >> public void prepareUpdateConfig(Path crawlDb,
Path[] segments,
> >> Configuration config);
> >
> > Should we really pass Path-s to methods? IMHO,
opening a file and
> > reading from it looks a bit cumbersome. I would
suggest that the
> > relevant job would read the file then pass the
data (MapWritable) to
> > the method. For example, prepareGeneratorConfig
would look like this:
> >
> > public void prepareGeneratorConfig(MapWritable
crawlDbMeta,
> > Configuration config);
>
> What about the segment's metadata in
prepareUpdateConfig? Following your
> idea, we would have to pass a Map<String
segmentName, MapWritable
> metaData> ...

Yeah, I think it looks good but I guess you disagree?

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
__________________________________
> [__ || __|__/|__||/|  Information Retrieval, Semantic
Web
> ___|||__||  |  ||  |  Embedded Unix, System
Integration
> http://www.sigram.com 
Contact: info at sigram dot com
>
>


-- 
Doğacan Güney
Re: Scoring API issues (LONG)
country flaguser name
Poland
2007-09-19 04:50:26
Doğacan Güney wrote:
> On 9/18/07, Andrzej Bialecki <abgetopt.org> wrote:
>> Doğacan Güney wrote:


>>> public void prepareGeneratorConfig(MapWritable
crawlDbMeta,
>>> Configuration config);
>> What about the segment's metadata in
prepareUpdateConfig? Following your
>> idea, we would have to pass a Map<String
segmentName, MapWritable
>> metaData> ...
> 
> Yeah, I think it looks good but I guess you disagree?

 Each
variant looks a bit like a hack ... The variant with Path[]

doesn't require that you retrieve all segments' metadata
first, even if 
your plugins don't use it.

On the other hand the variant with Map<String,
MapWritable> avoids the 
pesky issue of doing file I/O in the plugins. So perhaps
it's a little 
better ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


Re: Scoring API issues (LONG)
country flaguser name
Finland
2007-10-18 11:21:32
Andrzej Bialecki wrote:
> Hi all,
> 
> I've been working recently on a custom scoring plugin,
and I found out
> some issues with the scoring API that severely limit
the way we can
> calculate static page scores. I'd like to restart the
discussion about
> this API, and propose some changes. Any comments or
suggestions are
> welcome!

Hi,

In practice I have found out that sometimes it's just easier
(and even
more efficient) to write a custom mr job (yes, an additional
phase into
the process) to calculate the scores for urls.

By using this strategy it would give users more freedom in
selecting the
data (and algorithm) required and same time keep the other
parts of the
process more slim.

-- 
 Sami Siren

Re: Scoring API issues (LONG)
country flaguser name
Poland
2007-10-18 11:40:05
Sami Siren wrote:
> Andrzej Bialecki wrote:
>> Hi all,
>>
>> I've been working recently on a custom scoring
plugin, and I found out
>> some issues with the scoring API that severely
limit the way we can
>> calculate static page scores. I'd like to restart
the discussion about
>> this API, and propose some changes. Any comments or
suggestions are
>> welcome!
> 
> Hi,
> 
> In practice I have found out that sometimes it's just
easier (and even
> more efficient) to write a custom mr job (yes, an
additional phase into
> the process) to calculate the scores for urls.

Same here. E.g. PageRank calc. requires running a separate
job. Other 
scoring techniques that use a post-processed linkgraph also
require 
running a separate MR job.

> By using this strategy it would give users more freedom
in selecting the
> data (and algorithm) required and same time keep the
other parts of the
> process more slim.

Right .. except the main (supposed) benefit of OPIC was that
it would be 
possible to avoid running an additional analysis step - the
scores were 
supposed to be re-calculated online as a part of other
steps. It's not 
worked out this way, as we know, but this was the main
motivation for 
introducing the scoring API ... although it seems more and
more that 
this API is just a glorified OPIC, and it's not sufficiently
re-usable 
to benefit other scoring algorithms ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )