first i set conf/crawl-urlfilter that
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|M
OV|exe|bmp|BMP)$
# skip URLs containing certain characters as probable
queries, etc.
-[?*! =]
# skip URLs with slash-delimited segment that repeats 3+
times, to break
loops
-.*(/.+?)/.*?1/.*?1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
# skip everything else
+.
i can crawl "http://guide.kapook.com
" but i can't crawl
"http://www.kapook.com&quo
t; some webpage can't crawl all i want to know why?
after crawl index file not complete it's not have segments
file it have only
/user/nutch/crawld/indexes/part-00000/_0.fdt <r 1>
365
/user/nutch/crawld/indexes/part-00000/_0.fdx <r 1>
8
/user/nutch/crawld/indexes/part-00000/_0.fnm <r 1>
66
/user/nutch/crawld/indexes/part-00000/_0.frq <r 1>
370
/user/nutch/crawld/indexes/part-00000/_0.nrm <r 1>
9
/user/nutch/crawld/indexes/part-00000/_0.prx <r 1>
611
/user/nutch/crawld/indexes/part-00000/_0.tii <r 1>
135
/user/nutch/crawld/indexes/part-00000/_0.tis <r 1>
10553
/user/nutch/crawld/indexes/part-00000/index.done
<r 1> 0
/user/nutch/crawld/indexes/part-00000/segments.gen
<r 1> 20
/user/nutch/crawld/indexes/part-00000/segments_2
<r 1> 41
/user/nutch/crawld/indexes/part-00001/index.done
<r 1> 0
/user/nutch/crawld/indexes/part-00001/segments.gen
<r 1> 20
/user/nutch/crawld/indexes/part-00001/segments_1
<r 1> 20
how i solve it?
--
View this message in context: http://www.nabble.com/nutch-crawl-an
d-index-problem-tp14703815p14703815.html
Sent from the Hadoop Users mailing list archive at
Nabble.com.
|