List Info

Thread: Compression issue ?




Compression issue ?
user name
2007-10-07 10:01:37
I've made a huge crawl over 2M of urls and noticed that my
folder was using
more than 40Go.
I found it very weird. I decided to make a simple test. I
realised twice the
same crawl with different parameters:
1- First crawl, I've add the following property in
hadoop-site.xml
<property>
  <name>io.seqfile.compression.type</name>
  <value>BLOCK</value>
  <description></description>
</property>

I launch a crawl over 50000 urls/pages

I checked the space used on the HD and i found
$ du -ks 
/data/nutch/local/ctsite51/segments/20071007151116/*
301400 
/data/nutch/local/ctsite51/segments/20071007151116/content
376    
/data/nutch/local/ctsite51/segments/20071007151116/crawl_fet
ch
1580   
/data/nutch/local/ctsite51/segments/20071007151116/crawl_gen
erate
1488   
/data/nutch/local/ctsite51/segments/20071007151116/crawl_par
se
2424   
/data/nutch/local/ctsite51/segments/20071007151116/parse_dat
a
17060  
/data/nutch/local/ctsite51/segments/20071007151116/parse_tex
t

2- Second crawl, I've removed the property in
hadoop-site.xml
I launch the same crawl over 50000 urls/pages

I checked the space used on the HD and i found
$ du -ks 
/data/nutch/local/ctsite52/segments/20071007184837/*
318568 
/data/nutch/local/ctsite52/segments/20071007184837/content
1536   
/data/nutch/local/ctsite52/segments/20071007184837/crawl_fet
ch
1580   
/data/nutch/local/ctsite52/segments/20071007184837/crawl_gen
erate
41668  
/data/nutch/local/ctsite52/segments/20071007184837/crawl_par
se
9596   
/data/nutch/local/ctsite52/segments/20071007184837/parse_dat
a
17152  
/data/nutch/local/ctsite52/segments/20071007184837/parse_tex
t

It looks like the data within the folder CONTENT and
PARSE_TEXT are not
compressed.
Is it normal ? do you have the same issue ?

E
Re: Compression issue ?
country flaguser name
Poland
2007-10-07 10:14:34
Emmanuel wrote:
> I've made a huge crawl over 2M of urls and noticed that
my folder was using
> more than 40Go.
> I found it very weird. I decided to make a simple test.
I realised twice the
> same crawl with different parameters:
> 1- First crawl, I've add the following property in
hadoop-site.xml
> <property>
>   <name>io.seqfile.compression.type</name>
>   <value>BLOCK</value>
>   <description></description>
> </property>
> 
> I launch a crawl over 50000 urls/pages
> 
> I checked the space used on the HD and i found
> $ du -ks 
/data/nutch/local/ctsite51/segments/20071007151116/*
> 301400 
/data/nutch/local/ctsite51/segments/20071007151116/content
> 376    
/data/nutch/local/ctsite51/segments/20071007151116/crawl_fet
ch
> 1580   
/data/nutch/local/ctsite51/segments/20071007151116/crawl_gen
erate
> 1488   
/data/nutch/local/ctsite51/segments/20071007151116/crawl_par
se
> 2424   
/data/nutch/local/ctsite51/segments/20071007151116/parse_dat
a
> 17060  
/data/nutch/local/ctsite51/segments/20071007151116/parse_tex
t
> 
> 2- Second crawl, I've removed the property in
hadoop-site.xml
> I launch the same crawl over 50000 urls/pages
> 
> I checked the space used on the HD and i found
> $ du -ks 
/data/nutch/local/ctsite52/segments/20071007184837/*
> 318568 
/data/nutch/local/ctsite52/segments/20071007184837/content
> 1536   
/data/nutch/local/ctsite52/segments/20071007184837/crawl_fet
ch
> 1580   
/data/nutch/local/ctsite52/segments/20071007184837/crawl_gen
erate
> 41668  
/data/nutch/local/ctsite52/segments/20071007184837/crawl_par
se
> 9596   
/data/nutch/local/ctsite52/segments/20071007184837/parse_dat
a
> 17152  
/data/nutch/local/ctsite52/segments/20071007184837/parse_tex
t
> 
> It looks like the data within the folder CONTENT and
PARSE_TEXT are not
> compressed.
> Is it normal ? do you have the same issue ?

What you see is a difference between RECORD and BLOCK
compression. Nutch 
uses BLOCK compression for content/ and parse_text/, which
is less 
efficient in terms of maximum compression ratio, but
preserves record 
boundaries - which is crucial in order to achieve good
performance of 
random access. BLOCK compressed data causes performance
issues in case 
of random access, because the data needs to be read from the
nearest 
sync mark (or from the beginning of the file) and
decompressed.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )