|
List Info
Thread: More fetcher speed increases
|
|
| More fetcher speed increases |

|
2006-11-16 16:30:32 |
Hi, folks,
I, too, was slowed down by reduce operations in fetch. Some
benchmarking
showed that in my case, the limiting operation was filtering
(though a
distant second was the time spent calculating Levenshtein
distances,
presumably part of the spellchecking that Sami just removed
to speed things
up, though I haven't looked at it yet).
I've fixed the problem, and my reduce speed is better by
about a factor of
three. However, the fix is limited to certain usage
patterns.
In my case, I have tens of thousands of sites and subsites
I'm crawling, and
I'm using a combination of PrefixURLFilter +
AutomatonURLFilter. I
essentially use the prefix filter to limit to the set of
sites, and then
automaton to pattern-match within those sites. I only have
subsite matches
on < 10% of the sites, however, so I was clearly wasting
a lot of time
running the automaton patterns that didn't need it. And
automaton, though
much faster than RegexURLFilter, is still dog-slow with that
many patterns.
A simple fix was to extend the current "AND all the
filters together" model
to have the notion of a "short-circuit" match,
which allows a filter to say
"let this URL through and DON'T run the other
filters" by returning a
special token to URLFilters. Now I have a version of
PrefixURLFilter that
can return both "normal" matches and "short
circuit" matches, and only
returns "normal" matches for those sites that need
to run subsite patterns.
It seems to work well, the overhead is negligible when not
in use, and the
speedup is massive for my usage pattern.
I'd like to contribute it back, if people would find this
useful (not that
it's rocket science!).
First, is there anyone out there besides me who would find
this useful?
Second, I've been thinking about the best way to handle
PrefixURLFilter
configuration. I can see a few options:
1. Have two different config files, one for
"normal" matches, and one for
"short-circuit" matches.
2. Have one config file, with a syntax to say "make
this pattern a
short-circuit match," and make the default be a
"normal" match, so it is
backwards compatible with the current version.
3. Make a new type of filter which internally combines
Prefix and Automaton,
takes one config file, and decides internally which patterns
should generate
automaton inputs vs "normal" or "short
circuit" prefix matches.
Approach #3 requires no changes to the URLFilter model, and
makes it
difficult to screw up by making config files which are
inconsistent (e.g.
forgetting to put in a prefix pattern for one of the
automaton patterns). It
is also the least flexible, requires the most code, and
introduces yet
another kind of filter.
I tend to like the changed URLFilter model; it's more
flexible, even if it
requires a little more care in configuration (a simple Perl
script, in my
case, to generate the config files correctly and
consistently). I'm leaning
towards approach #2. I'm thinking something simple,
syntax-wise, like
putting SHORTCIRCUIT: before the patterns which should
short-circuit. Any
suggestions for a better syntax? Or reasons why I should
consider a
different approach?
Doug
--
View this message in context: http://www.nabble.com/More-fetcher-spe
ed-increases-tf2644170.html#a7381430
Sent from the Nutch - Dev mailing list archive at
Nabble.com.
|
|
| More fetcher speed increases |

|
2006-11-22 04:40:08 |
Hi Doug,
Your idea about PrefixURLFilter and AutomatonURLFilter
combination
sounds interesting. Could you please attach the patch to
JIRA? Thanks
- Scott
On 11/17/06, Doug Cook <nabble candiru.com> wrote:
>
> Hi, folks,
>
> I, too, was slowed down by reduce operations in fetch.
Some benchmarking
> showed that in my case, the limiting operation was
filtering (though a
> distant second was the time spent calculating
Levenshtein distances,
> presumably part of the spellchecking that Sami just
removed to speed things
> up, though I haven't looked at it yet).
>
> I've fixed the problem, and my reduce speed is better
by about a factor of
> three. However, the fix is limited to certain usage
patterns.
>
> In my case, I have tens of thousands of sites and
subsites I'm crawling, and
> I'm using a combination of PrefixURLFilter +
AutomatonURLFilter. I
> essentially use the prefix filter to limit to the set
of sites, and then
> automaton to pattern-match within those sites. I only
have subsite matches
> on < 10% of the sites, however, so I was clearly
wasting a lot of time
> running the automaton patterns that didn't need it. And
automaton, though
> much faster than RegexURLFilter, is still dog-slow with
that many patterns.
>
> A simple fix was to extend the current "AND all
the filters together" model
> to have the notion of a "short-circuit"
match, which allows a filter to say
> "let this URL through and DON'T run the other
filters" by returning a
> special token to URLFilters. Now I have a version of
PrefixURLFilter that
> can return both "normal" matches and
"short circuit" matches, and only
> returns "normal" matches for those sites that
need to run subsite patterns.
> It seems to work well, the overhead is negligible when
not in use, and the
> speedup is massive for my usage pattern.
>
> I'd like to contribute it back, if people would find
this useful (not that
> it's rocket science!).
>
> First, is there anyone out there besides me who would
find this useful?
>
> Second, I've been thinking about the best way to handle
PrefixURLFilter
> configuration. I can see a few options:
>
> 1. Have two different config files, one for
"normal" matches, and one for
> "short-circuit" matches.
> 2. Have one config file, with a syntax to say
"make this pattern a
> short-circuit match," and make the default be a
"normal" match, so it is
> backwards compatible with the current version.
> 3. Make a new type of filter which internally combines
Prefix and Automaton,
> takes one config file, and decides internally which
patterns should generate
> automaton inputs vs "normal" or "short
circuit" prefix matches.
>
> Approach #3 requires no changes to the URLFilter model,
and makes it
> difficult to screw up by making config files which are
inconsistent (e.g.
> forgetting to put in a prefix pattern for one of the
automaton patterns). It
> is also the least flexible, requires the most code, and
introduces yet
> another kind of filter.
>
> I tend to like the changed URLFilter model; it's more
flexible, even if it
> requires a little more care in configuration (a simple
Perl script, in my
> case, to generate the config files correctly and
consistently). I'm leaning
> towards approach #2. I'm thinking something simple,
syntax-wise, like
> putting SHORTCIRCUIT: before the patterns which should
short-circuit. Any
> suggestions for a better syntax? Or reasons why I
should consider a
> different approach?
>
> Doug
>
> --
> View this message in context: http://www.nabble.com/More-fetcher-spe
ed-increases-tf2644170.html#a7381430
> Sent from the Nutch - Dev mailing list archive at
Nabble.com.
>
>
|
|
| More fetcher speed increases |

|
2006-11-26 00:20:24 |
Done. See http:/
/issues.apache.org/jira/browse/NUTCH-409
This is my first Nutch contribution, so hopefully I've got
it right Any
suggestions/questions/feedback welcome.
Hope this is useful to others.
D
scott green wrote:
>
> Hi Doug,
>
> Your idea about PrefixURLFilter and AutomatonURLFilter
combination
> sounds interesting. Could you please attach the patch
to JIRA? Thanks
>
> - Scott
>
> On 11/17/06, Doug Cook <nabble candiru.com> wrote:
>>
>> Hi, folks,
>>
>> I, too, was slowed down by reduce operations in
fetch. Some benchmarking
>> showed that in my case, the limiting operation was
filtering (though a
>> distant second was the time spent calculating
Levenshtein distances,
>> presumably part of the spellchecking that Sami just
removed to speed
>> things
>> up, though I haven't looked at it yet).
>>
>> I've fixed the problem, and my reduce speed is
better by about a factor
>> of
>> three. However, the fix is limited to certain usage
patterns.
>>
>> In my case, I have tens of thousands of sites and
subsites I'm crawling,
>> and
>> I'm using a combination of PrefixURLFilter +
AutomatonURLFilter. I
>> essentially use the prefix filter to limit to the
set of sites, and then
>> automaton to pattern-match within those sites. I
only have subsite
>> matches
>> on < 10% of the sites, however, so I was clearly
wasting a lot of time
>> running the automaton patterns that didn't need it.
And automaton, though
>> much faster than RegexURLFilter, is still dog-slow
with that many
>> patterns.
>>
>> A simple fix was to extend the current "AND
all the filters together"
>> model
>> to have the notion of a "short-circuit"
match, which allows a filter to
>> say
>> "let this URL through and DON'T run the other
filters" by returning a
>> special token to URLFilters. Now I have a version
of PrefixURLFilter that
>> can return both "normal" matches and
"short circuit" matches, and only
>> returns "normal" matches for those sites
that need to run subsite
>> patterns.
>> It seems to work well, the overhead is negligible
when not in use, and
>> the
>> speedup is massive for my usage pattern.
>>
>> I'd like to contribute it back, if people would
find this useful (not
>> that
>> it's rocket science!).
>>
>> First, is there anyone out there besides me who
would find this useful?
>>
>> Second, I've been thinking about the best way to
handle PrefixURLFilter
>> configuration. I can see a few options:
>>
>> 1. Have two different config files, one for
"normal" matches, and one for
>> "short-circuit" matches.
>> 2. Have one config file, with a syntax to say
"make this pattern a
>> short-circuit match," and make the default be
a "normal" match, so it is
>> backwards compatible with the current version.
>> 3. Make a new type of filter which internally
combines Prefix and
>> Automaton,
>> takes one config file, and decides internally which
patterns should
>> generate
>> automaton inputs vs "normal" or
"short circuit" prefix matches.
>>
>> Approach #3 requires no changes to the URLFilter
model, and makes it
>> difficult to screw up by making config files which
are inconsistent (e.g.
>> forgetting to put in a prefix pattern for one of
the automaton patterns).
>> It
>> is also the least flexible, requires the most code,
and introduces yet
>> another kind of filter.
>>
>> I tend to like the changed URLFilter model; it's
more flexible, even if
>> it
>> requires a little more care in configuration (a
simple Perl script, in my
>> case, to generate the config files correctly and
consistently). I'm
>> leaning
>> towards approach #2. I'm thinking something simple,
syntax-wise, like
>> putting SHORTCIRCUIT: before the patterns which
should short-circuit. Any
>> suggestions for a better syntax? Or reasons why I
should consider a
>> different approach?
>>
>> Doug
>>
>> --
>> View this message in context:
>> http://www.nabble.com/More-fetcher-spe
ed-increases-tf2644170.html#a7381430
>> Sent from the Nutch - Dev mailing list archive at
Nabble.com.
>>
>>
>
>
--
View this message in context: http://www.nabble.com/More-fetcher-spe
ed-increases-tf2644170.html#a7543634
Sent from the Nutch - Dev mailing list archive at
Nabble.com.
|
|
[1-3]
|
|