List Info

Thread: Need Help with crawl-urlfilter.txt




Need Help with crawl-urlfilter.txt
country flaguser name
United States
2007-03-22 16:00:26
I trying to crawl a wikipedia site.

I want to skip any url which has the term Special:

Eg:
https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
https://wiki.mydomain.com/index.php/Special:Watchlist
https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
https://wiki.mydomain.com/index.php/Special:Recentchang
es

This is my crawl-urlfilter.txt
-^">http://wik
i.mydomain.com/index.php/Special
-^http://wi
ki.mydomain.com/index.php/Special/
-^http://w
iki.mydomain.com/index.php/Special/*
-^ht
tps://wiki.mydomain.com/index.php/Special:Upload
+^https://wiki.mydo
main.com/index.php
-.

But I still see the fetcher logs.

2007-03-22 12:52:15,387 INFO  fetcher.Fetcher - fetching
https://wiki.mydo
main.com/index.php
2007-03-22 12:52:32,128 INFO  fetcher.Fetcher - fetching
https://w
iki.mydomain.com/index.php/Telecom
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
2007-03-22 12:52:32,159 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Watchlist
2007-03-22 12:52:32,179 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Preferences
2007-03-22 12:52:32,198 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:Recentchang
es
2007-03-22 12:52:32,322 INFO  fetcher.Fetcher - fetching
ht
tps://wiki.mydomain.com/index.php/Talk:Main_Page
2007-03-22 12:52:32,323 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
2007-03-22 12:52:32,326 INFO  fetcher.Fetcher - fetching
https://wiki.
mydomain.com/index.php/BCP
2007-03-22 12:52:32,339 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
2007-03-22 12:52:32,343 INFO  fetcher.Fetcher - fetching
https://wiki.mydomain.com/index.php/Network_Engineering


Not sure whats wrong in my regular expression.

Any help please.


-- 
View this message in context: http://www.nabble.com/Need-Help-
with-crawl-urlfilter.txt-tf3450339.html#a9623983
Sent from the Nutch - User mailing list archive at
Nabble.com.


Re: Need Help with crawl-urlfilter.txt
user name
2007-03-22 21:51:39
Hi Sriram,

In regex, . matches to any single character, and following .
with a *
matches that single character zero or more times. That is, 
.* in
combination is a wildcard match.

So modifying your regex to:

-^http://
wiki.mydomain.com/index.php/Special:.*

should fix the problem.

- Ravi Chintakunta


On 3/22/07, SriramG <sgopalanetrade.com> wrote:
>
> I trying to crawl a wikipedia site.
>
> I want to skip any url which has the term Special:
>
> Eg:
> https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
> https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
> https://wiki.mydomain.com/index.php/Special:Watchlist
> https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
> https://wiki.mydomain.com/index.php/Special:Recentchang
es
>
> This is my crawl-urlfilter.txt
> -^">http://wik
i.mydomain.com/index.php/Special
> -^http://wi
ki.mydomain.com/index.php/Special/
> -^http://w
iki.mydomain.com/index.php/Special/*
> -^ht
tps://wiki.mydomain.com/index.php/Special:Upload
> +^https://wiki.mydo
main.com/index.php
> -.
>
> But I still see the fetcher logs.
>
> 2007-03-22 12:52:15,387 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydo
main.com/index.php
> 2007-03-22 12:52:32,128 INFO  fetcher.Fetcher -
fetching
> https://w
iki.mydomain.com/index.php/Telecom
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Contr
ibutions/SName
> 2007-03-22 12:52:32,159 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Watchlist
> 2007-03-22 12:52:32,179 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Preferences
> 2007-03-22 12:52:32,198 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:Recentchang
es
> 2007-03-22 12:52:32,322 INFO  fetcher.Fetcher -
fetching
> ht
tps://wiki.mydomain.com/index.php/Talk:Main_Page
> 2007-03-22 12:52:32,323 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Special:W
hatlinkshere/Main_Page
> 2007-03-22 12:52:32,326 INFO  fetcher.Fetcher -
fetching
> https://wiki.
mydomain.com/index.php/BCP
> 2007-03-22 12:52:32,339 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Spe
cial:Recentchangeslinked/Main_Page
> 2007-03-22 12:52:32,343 INFO  fetcher.Fetcher -
fetching
> https://wiki.mydomain.com/index.php/Network_Engineering
>
>
> Not sure whats wrong in my regular expression.
>
> Any help please.
>
>
> --
> View this message in context: http://www.nabble.com/Need-Help-
with-crawl-urlfilter.txt-tf3450339.html#a9623983
> Sent from the Nutch - User mailing list archive at
Nabble.com.
>
>

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )