Hi Sriram,
In regex, . matches to any single character, and following .
with a *
matches that single character zero or more times. That is,
.* in
combination is a wildcard match.
So modifying your regex to:
-^http://
wiki.mydomain.com/index.php/Special:.*
should fix the problem.
- Ravi Chintakunta
On 3/22/07, SriramG <sgopalan etrade.com> wrote:
>
> I trying to crawl a wikipedia site.
>
> I want to skip any url which has the term Special:
>
> Eg:
> https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
> https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
> https://wiki.mydomain.com/index.php/Special:Watchlist
> https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
> https://wiki.mydomain.com/index.php/Special:Recentchang
es
>
> This is my crawl-urlfilter.txt
> -^">http://wik
i.mydomain.com/index.php/Special
> -^http://wi
ki.mydomain.com/index.php/Special /
> -^http://w
iki.mydomain.com/index.php/Special /*
> -^ht
tps://wiki.mydomain.com/index.php/Special:Upload
> +^https://wiki.mydo
main.com/index.php
> -.
>
> But I still see the fetcher logs.
>
> 2007-03-22 12:52:15,387 INFO fetcher.Fetcher -
fetching
> https://wiki.mydo
main.com/index.php
> 2007-03-22 12:52:32,128 INFO fetcher.Fetcher -
fetching
> https://w
iki.mydomain.com/index.php/Telecom
> 2007-03-22 12:52:32,159 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
> 2007-03-22 12:52:32,159 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Watchlist
> 2007-03-22 12:52:32,179 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Preferences
a>
> 2007-03-22 12:52:32,198 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Recentchang
es
> 2007-03-22 12:52:32,322 INFO fetcher.Fetcher -
fetching
> ht
tps://wiki.mydomain.com/index.php/Talk:Main_Page
> 2007-03-22 12:52:32,323 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
> 2007-03-22 12:52:32,326 INFO fetcher.Fetcher -
fetching
> https://wiki.
mydomain.com/index.php/BCP
> 2007-03-22 12:52:32,339 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
> 2007-03-22 12:52:32,343 INFO fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Network_Engineering
a>
>
>
> Not sure whats wrong in my regular expression.
>
> Any help please.
>
>
> --
> View this message in context: http://www.nabble.com/Need-Help-
with-crawl-urlfilter.txt-tf3450339.html#a9623983
> Sent from the Nutch - User mailing list archive at
Nabble.com.
>
>
|