Author: Aragorn
Email: qg_tech noos.fr
Message:
I was wondering about performances issues:
First, i have indexed a 50 Mo plain text file . On an
Intel(R) Pentium(R) IV CPU 2.00GHz (Linux) , it took 1 hour
. Is that normal?
Is html parsing far slower than plain text parsing? I can
see than even small files took a long time to be indexed.
I am indexing a web site with about 400000 files for about
30 GO:
(1) 1/4 of others files are .doc,.pdf,txt, html,ppt,xls.
(2) 3/4 of these files are zip archives, containing sames
files than in (1), plus other zip files.
I have made 2 parsers
- the first which call binary converters (antiword,
pdftotext etc... ) or a second script dedicated to zip files
- the second , recurvively unzip files and call the first
script for each recognized file format
All is working perfectly well : zip files are converted to
plain text (so .zip is associalted to plain/text in config
file) very quickly (all steps are loggued in syslog so i can
see that even big zip files are quickly textized), even
.html files contained in zip are converted to plain/text by
a "lynx -dump" command.
What is slow, is the indexing of the results: indexer
process take 99% cpu for very long time before going on with
next URL..
I have launch the indexer soon 2 weeks ago now and it has
not finished yet. unzipped, the whole web site must be about
100GO, which is not so big isn't it?
So, are these statistics look "normal" or is
there something wrong on this computer/configuration?
Regards,
Aragorn.
Reply: <http://www.mnogosearch.org/board/message.php?id=17597&g
t;
------------------------------------------------------------
---------
To unsubscribe, e-mail: general-unsubscribe mnogosearch.org
For additional commands, e-mail: general-help mnogosearch.org
|