List Info

Thread: Seeing what's occupying all the space in the index




Seeing what's occupying all the space in the index
user name
2006-05-26 17:37:41
It seems odd to me that if you are using the CFS format, why
you would 
have the .fdt, .frq and .prx files in addition to the .cfs
files.  My 
understanding is all files (except deletable and segment)
get put inside 
of the CFS file.  Looking at my indices, I only have the CFS
file.  Are 
you optimizing your indices after you are done indexing? 
Are you 
turning off compound file format?

Can you try a smaller sample in a clean directory and see
what size it 
is (so that it doesn't take as long to index)?

Rob Staveley (Tom) wrote:
>> Is there anything I can learn from the index
directory's file listing?
>>     
>
> Running this nasty little BASH one-liner...
>
> $ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print
$1;}' | sort |
> uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if
(SUM > 1e10) {print
> "'$i': ", SUM}}'; done       
>
> ... I see....
>
> 	.cfs:  1.23155e+10
> 	.fdt:  5.06108e+10
> 	.frq:  1.27472e+10
> 	.prx:  1.3444e+10
>
> That means I have 98 GB of files, with: 
>
> 	51 GB devoted to field data (.fdt), 
> 	13 BG devoted to term positions (.prx)
> 	13 BG devoted to term frequencies (.frq)
> 	12 BG devoted to compound files for the field index
(.cfs)
>
> Does that seem reasonable, bearing in mind I have only
indexed 4.3 million
> Lucene documents? That's 22.8 kB per Lucene document,
and apart from a 300
> character synopsis the fields are all much less than
100 characters long,
> and yet this suggests that the index is providing 600
bytes per field.
>
>   

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscribelucene.apache.org
For additional commands, e-mail: java-user-helplucene.apache.org

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )