Emmanuel wrote:
> I've made a huge crawl over 2M of urls and noticed that
my folder was using
> more than 40Go.
> I found it very weird. I decided to make a simple test.
I realised twice the
> same crawl with different parameters:
> 1- First crawl, I've add the following property in
hadoop-site.xml
> <property>
> <name>io.seqfile.compression.type</name>
> <value>BLOCK</value>
> <description></description>
> </property>
>
> I launch a crawl over 50000 urls/pages
>
> I checked the space used on the HD and i found
> $ du -ks
/data/nutch/local/ctsite51/segments/20071007151116/*
> 301400
/data/nutch/local/ctsite51/segments/20071007151116/content
> 376
/data/nutch/local/ctsite51/segments/20071007151116/crawl_fet
ch
> 1580
/data/nutch/local/ctsite51/segments/20071007151116/crawl_gen
erate
> 1488
/data/nutch/local/ctsite51/segments/20071007151116/crawl_par
se
> 2424
/data/nutch/local/ctsite51/segments/20071007151116/parse_dat
a
> 17060
/data/nutch/local/ctsite51/segments/20071007151116/parse_tex
t
>
> 2- Second crawl, I've removed the property in
hadoop-site.xml
> I launch the same crawl over 50000 urls/pages
>
> I checked the space used on the HD and i found
> $ du -ks
/data/nutch/local/ctsite52/segments/20071007184837/*
> 318568
/data/nutch/local/ctsite52/segments/20071007184837/content
> 1536
/data/nutch/local/ctsite52/segments/20071007184837/crawl_fet
ch
> 1580
/data/nutch/local/ctsite52/segments/20071007184837/crawl_gen
erate
> 41668
/data/nutch/local/ctsite52/segments/20071007184837/crawl_par
se
> 9596
/data/nutch/local/ctsite52/segments/20071007184837/parse_dat
a
> 17152
/data/nutch/local/ctsite52/segments/20071007184837/parse_tex
t
>
> It looks like the data within the folder CONTENT and
PARSE_TEXT are not
> compressed.
> Is it normal ? do you have the same issue ?
What you see is a difference between RECORD and BLOCK
compression. Nutch
uses BLOCK compression for content/ and parse_text/, which
is less
efficient in terms of maximum compression ratio, but
preserves record
boundaries - which is crucial in order to achieve good
performance of
random access. BLOCK compressed data causes performance
issues in case
of random access, because the data needs to be read from the
nearest
sync mark (or from the beginning of the file) and
decompressed.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||/| Information Retrieval, Semantic Web
___|||__|| | || | Embedded Unix, System Integration
http://www.sigram.com
Contact: info at sigram dot com
|