Sami Siren-2 wrote:
>
> George Weller wrote:
>> Hi all,
>>
>> First I note in the logs that a large number of PDF
documents have been
>> fetched, and yet only two have been indexed, and
indeed only these two
>> appear in search results. The content limit is set
high enough to allow
>> these documents to be indexed, so I can't think why
this should be.
>
> Are there any related errors on log?
>
>> Secondly for those documents that ARE indexed,
rather bizarrely, the
>> document titles in the search results have a '.xls'
extension. I can even
>> search for all PDF document just by using the query
'xls'. Note that this
>> suffix is most definitely NOT in the actual title
of those files. I also
>> chanced upon a site that seems to use Nutch (no
affiliation- I just
>> googled)
>> and found the same problem...
>
> In the examples from your site the title is extracted
from the pdf
> metadata so it just uses the title stored within the
pdf doc.
>
> --
> Sami Siren
>
>
Thanks for the reply.
Yes you're absolutely right! I did a sample crawl on our
production server
and I notice that it also returns some PDFs with
".doc" in the title.... I
can now see that this is due to whatever software was used
to convert the
XLS or DOC documents to PDF format in the first place!
I couldn't spot any other errors in the log, but I think I
managed to solve
the other problem too. I had the content limit set to around
1.6MB IIRC,
which after a quick survey of common document I concluded
would be enough to
allow indexing of the main docs that people would search for
(most of which
were a couple of hundred kilobytes), but it seems that it
wasn't enough. I
have now set it to be unlimited (i.e. -1), and I'm getting
proper results.
Now I just need to find out what "more.jsp" does,
and how to get it going...
Back to the wiki I think!
Thanks again,
George
--
View this message in context: http://
www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-
XLS-extension-tf4671286.html#a13381606
Sent from the Nutch - User mailing list archive at
Nabble.com.
|