List Info

Thread: Windows Share Crawling/searching




Windows Share Crawling/searching
user name
2007-08-13 03:07:33
Hi all

I am new to nutch.. 

I have downloaded Nutch 9.0


I want to crawl my local network (Windows shares & Linux
 shares)

tried this link as referance
http://www.folge2.de/tp/search/1/crawli
ng-the-local-filesystem-with-nutch 


1) Downloaded the  protocol-smb

http:/
/issues.apache.org/jira/browse/NUTCH-427

2) Made following changes in crawler-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable
queries, etc.
-[?*!=]

# skip URLs with slash-delimited segment that repeats 3+
times, to break loops
-[?*!=]

# skip everything else
# -.

# accept anything else 
+.*


3) Made following changes in nutch-site.xml

<property>
  <name>plugin.includes</name>
               
<value>nutch-extensionpoints|protocol-smb|protocol-fil
e|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspower
point|msexcel)|index-basic|query-(basic|sit
e|url)</value>
  <description></description>
</property>



4) the urls file consists smb:hostnames/shares

5) The windows login details >> username/password/ip
address etc are entered in smb.properties

6) bin/nutch crawl urls -dir localcrawl  give error

smb://192.168.0.1/:java.net.MalformedURLException: unknown
protocol: smb

7) Tried crawling Files but got following error

file:///var/test.txt failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not
found for url=file

Is the above setting correct to crawl local windows shares

                                
Can some one guide me what to do ... where am i wrong???

Thanx

Bikram



Re: Windows Share Crawling/searching
country flaguser name
China
2007-08-17 00:27:59
bikram_singhyyahoo.com 写道:
> Hi all
>
> I am new to nutch.. 
>
> I have downloaded Nutch 9.0
>
>
> I want to crawl my local network (Windows shares &
Linux  shares)
>
> tried this link as referance
> http://www.folge2.de/tp/search/1/crawli
ng-the-local-filesystem-with-nutch 
>
>
> 1) Downloaded the  protocol-smb
>
> http:/
/issues.apache.org/jira/browse/NUTCH-427
>
> 2) Made following changes in crawler-urlfilter.txt
>
> # skip file:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable
queries, etc.
> -[?*!=]
>
> # skip URLs with slash-delimited segment that repeats
3+ times, to break loops
> -[?*!=]
>
> # skip everything else
> # -.
>
> # accept anything else 
> +.*
>
>
> 3) Made following changes in nutch-site.xml
>
> <property>
>   <name>plugin.includes</name>
>                
<value>nutch-extensionpoints|protocol-smb|protocol-fil
e|urlfilter-regex|parse-(text|html|js|pdf|msword|zip|mspower
point|msexcel)|index-basic|query-(basic|sit
> e|url)</value>
>   <description></description>
> </property>
>
>
>
> 4) the urls file consists smb:hostnames/shares
>
> 5) The windows login details >>
username/password/ip address etc are entered in
smb.properties
>
> 6) bin/nutch crawl urls -dir localcrawl  give error
>
> smb://192.168.0.1/:java.net.MalformedURLException:
unknown protocol: smb
>
> 7) Tried crawling Files but got following error
>
> file:///var/test.txt failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not
found for url=file
>
> Is the above setting correct to crawl local windows
shares
>
>                                 
> Can some one guide me what to do ... where am i
wrong???
>
> Thanx
>
> Bikram
>
>
>
>   
Hi

protocol-smb is a plugin of nutch,see the following link to
get any help

http://wiki.apache.org/nutch/WritingPluginExample-0.9


remember to ant after you add this protocol to nutch

and for checking whether the plugin has been actived,Use
command

bin/nutch plugin protocol-smb
org.apache.nutch.protocol.smb.[class name here!]



Re: Windows Share Crawling/searching
country flaguser name
United States
2007-08-17 07:52:22
Hi.. 熊泽法Thanx for the reply.. Did everything explained
in the plugin
tutorial...http://wiki.apache.org/nutch/WritingPluginExample-0.9But
 not
working :(I dont know where am i going wrong...Please can
anyone help me out
here??ThanxbikramPS : Sorry for Double post 
-- 
View this message in context: http://www.nabble.com/Windows-Sha
re-Crawling-searching-tf4281025.html#a12198994
Sent from the Nutch - User mailing list archive at
Nabble.com.


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )