List Info

Thread: nutch crawl and index problem




nutch crawl and index problem
country flaguser name
United States
2008-01-08 20:07:31
first i set conf/crawl-urlfilter that
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|M
OV|exe|bmp|BMP)$

# skip URLs containing certain characters as probable
queries, etc.
-[?*!=]

# skip URLs with slash-delimited segment that repeats 3+
times, to break
loops
-.*(/.+?)/.*?1/.*?1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

# skip everything else
+.

i can crawl "http://guide.kapook.com
" but i can't crawl
"http://www.kapook.com&quo
t; some webpage can't crawl all i want to know why?
after crawl index file not complete it's not have segments
file it have only 

/user/nutch/crawld/indexes/part-00000/_0.fdt    <r 1> 
 365
/user/nutch/crawld/indexes/part-00000/_0.fdx    <r 1> 
 8
/user/nutch/crawld/indexes/part-00000/_0.fnm    <r 1> 
 66
/user/nutch/crawld/indexes/part-00000/_0.frq    <r 1> 
 370
/user/nutch/crawld/indexes/part-00000/_0.nrm    <r 1> 
 9
/user/nutch/crawld/indexes/part-00000/_0.prx    <r 1> 
 611
/user/nutch/crawld/indexes/part-00000/_0.tii    <r 1> 
 135
/user/nutch/crawld/indexes/part-00000/_0.tis    <r 1> 
 10553
/user/nutch/crawld/indexes/part-00000/index.done       
<r 1>   0
/user/nutch/crawld/indexes/part-00000/segments.gen     
<r 1>   20
/user/nutch/crawld/indexes/part-00000/segments_2       
<r 1>   41

/user/nutch/crawld/indexes/part-00001/index.done       
<r 1>   0
/user/nutch/crawld/indexes/part-00001/segments.gen     
<r 1>   20
/user/nutch/crawld/indexes/part-00001/segments_1       
<r 1>   20

how i solve it?
-- 
View this message in context: http://www.nabble.com/nutch-crawl-an
d-index-problem-tp14703815p14703815.html
Sent from the Hadoop Users mailing list archive at
Nabble.com.


Re: nutch crawl and index problem
country flaguser name
United States
2008-01-10 03:36:58
now i have

/user/nutch/crawld/indexes/part-00000/index.done       
<r 1>   0
/user/nutch/crawld/indexes/part-00000/segments.gen     
<r 1>   20
/user/nutch/crawld/indexes/part-00000/segments_1       
<r 1>   20

/user/nutch/crawld/indexes/part-00001/_0.fdt    <r 1> 
 144
/user/nutch/crawld/indexes/part-00001/_0.fdx    <r 1> 
 8
/user/nutch/crawld/indexes/part-00001/_0.fnm    <r 1> 
 66
/user/nutch/crawld/indexes/part-00001/_0.frq    <r 1> 
 31
/user/nutch/crawld/indexes/part-00001/_0.nrm    <r 1> 
 9
/user/nutch/crawld/indexes/part-00001/_0.prx    <r 1> 
 32
/user/nutch/crawld/indexes/part-00001/_0.tii    <r 1> 
 31
/user/nutch/crawld/indexes/part-00001/_0.tis    <r 1> 
 757
/user/nutch/crawld/indexes/part-00001/index.done       
<r 1>   0
/user/nutch/crawld/indexes/part-00001/segments.gen     
<r 1>   20
/user/nutch/crawld/indexes/part-00001/segments_2       
<r 1>   41

it not have segment file that importance for nutch search,
so i use command
"bin/nutch merge /user/nutch/crawld/index
/user/nutch/crawld/indexes"  after
that i list /d01/local/crawld/index it have

-rw-r--r-- 1 nutch users 144 ม.ค. 10 16:24 _0.fdt
-rw-r--r-- 1 nutch users   8 ม.ค. 10 16:24 _0.fdx
-rw-r--r-- 1 nutch users  66 ม.ค. 10 16:24 _0.fnm
-rw-r--r-- 1 nutch users  31 ม.ค. 10 16:24 _0.frq
-rw-r--r-- 1 nutch users   9 ม.ค. 10 16:24 _0.nrm
-rw-r--r-- 1 nutch users  32 ม.ค. 10 16:24 _0.prx
-rw-r--r-- 1 nutch users  31 ม.ค. 10 16:24 _0.tii
-rw-r--r-- 1 nutch users 757 ม.ค. 10 16:24 _0.tis
-rw-r--r-- 1 nutch users  41 ม.ค. 10 16:24 segments_2
-rw-r--r-- 1 nutch users  20 ม.ค. 10 16:24 segments.gen

which don't have segments file i want to know i miss
"bin/nutch merge" yes
or no? is it correct? if not correct how i use this command?



-- 
View this message in context: http://www.nabble.com/nutch-crawl-an
d-index-problem-tp14703815p14730578.html
Sent from the Hadoop Users mailing list archive at
Nabble.com.


Re: nutch crawl and index problem
country flaguser name
United States
2008-01-14 01:48:28
i can not solve it

jibjoice wrote:
> 
> now i have
> 
> /user/nutch/crawld/indexes/part-00000/index.done       
<r 1>   0
> /user/nutch/crawld/indexes/part-00000/segments.gen     
<r 1>   20
> /user/nutch/crawld/indexes/part-00000/segments_1       
<r 1>   20
> 
> /user/nutch/crawld/indexes/part-00001/_0.fdt    <r
1>   144
> /user/nutch/crawld/indexes/part-00001/_0.fdx    <r
1>   8
> /user/nutch/crawld/indexes/part-00001/_0.fnm    <r
1>   66
> /user/nutch/crawld/indexes/part-00001/_0.frq    <r
1>   31
> /user/nutch/crawld/indexes/part-00001/_0.nrm    <r
1>   9
> /user/nutch/crawld/indexes/part-00001/_0.prx    <r
1>   32
> /user/nutch/crawld/indexes/part-00001/_0.tii    <r
1>   31
> /user/nutch/crawld/indexes/part-00001/_0.tis    <r
1>   757
> /user/nutch/crawld/indexes/part-00001/index.done       
<r 1>   0
> /user/nutch/crawld/indexes/part-00001/segments.gen     
<r 1>   20
> /user/nutch/crawld/indexes/part-00001/segments_2       
<r 1>   41
> 
> it not have segment file that importance for nutch
search, so i use
> command "bin/nutch merge /user/nutch/crawld/index
> /user/nutch/crawld/indexes"  after that i list
/d01/local/crawld/index it
> have
> 
> -rw-r--r-- 1 nutch users 144 ม.ค. 10 16:24 _0.fdt
> -rw-r--r-- 1 nutch users   8 ม.ค. 10 16:24 _0.fdx
> -rw-r--r-- 1 nutch users  66 ม.ค. 10 16:24 _0.fnm
> -rw-r--r-- 1 nutch users  31 ม.ค. 10 16:24 _0.frq
> -rw-r--r-- 1 nutch users   9 ม.ค. 10 16:24 _0.nrm
> -rw-r--r-- 1 nutch users  32 ม.ค. 10 16:24 _0.prx
> -rw-r--r-- 1 nutch users  31 ม.ค. 10 16:24 _0.tii
> -rw-r--r-- 1 nutch users 757 ม.ค. 10 16:24 _0.tis
> -rw-r--r-- 1 nutch users  41 ม.ค. 10 16:24
segments_2
> -rw-r--r-- 1 nutch users  20 ม.ค. 10 16:24
segments.gen
> 
> which don't have segments file i want to know i miss
"bin/nutch merge" yes
> or no? is it correct? if not correct how i use this
command? 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/nutch-crawl-an
d-index-problem-tp14703815p14796643.html
Sent from the Hadoop Users mailing list archive at
Nabble.com.


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )