[ http://issues.apache.org/jira/browse
/NUTCH-233?page=comments#action_12453919 ]
Sean Dean commented on NUTCH-233:
---------------------------------
Could I suggest that this change, from
".*(/.+?)/.*?1/.*?1/" to
".*(/[^/]+)/[^/]+1/[^/]+1/" be committed to at
least trunk for the time being.
I recently created a segment with 1M urls exactly, I ran the
fetch and it did indeed stall on the reduce part of the
operation due to the regex filter. This was verified with a
thread dump (kill -3 <pid>) on FreeBSD.
I then made the suggested change in the config file and
re-fetched the exact same segment. It completed without
issue.
I'm aware we might be losing some filtering functionality
with this new expression, but is it not better then knowing
there is always the chance your whole-web crawl fetch will
fail because of this?
> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
> Key: NUTCH-233
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-233
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Stefan Groschupf
> Priority: Blocker
> Fix For: 0.9.0
>
>
> Looks like that the expression
".*(/.+?)/.*?1/.*?1/" in regex-urlfilter.txt
wasn't compatible with java.util.regex that is actually used
in the regex url filter.
> May be it was missed to change it when the regular
expression packages was changed.
> The problem was that until reducing a fetch map output
the reducer hangs forever since the outputformat was
applying the urlfilter a url that causes the hang.
> 060315 230823 task_r_3n4zga at
java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at
java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at
java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to
".*(/[^/]+)/[^/]+1/[^/]+1/" and now the fetch
job works. (thanks to Grant and Chris B. helping to find the
new regex)
> However may people can review it and can suggest
improvements, since the old regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the
new one match it also. But the old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new
regex will not match.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|