I've been trying to index .pdf and .doc documents in v.
3.2.0b with
doc2html/catdoc/pdf2html.
I can see both types indexed fine (though I'm not sure why
log doesn't tell
which words and tags have been indexed). See below:
pick: devserverxxx.com, # servers = 1
>devserverxxx.com with a traditional HTTP connection
316:33:2:https://devserverxxx.com/library/ADJA/docs/por
tlet-1_0-fr-spec.pdf
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:19:01 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 07 May 2007 14:08:26 GMT
Header line: ETag: "1f841c-6af5b-d5aeea80"
Discarded header line: ETag:
"1f841c-6af5b-d5aeea80"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/pdf
Header line: Content-Length: 438107
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain
2.3.345)
Discarded header line: Via: 1.1
ichainserver.devserverxxx.com (iChain
2.3.345)
Retrieving document
/library/ADJA/docs/portlet-1_0-fr-spec.pdf on host:
devserverxxx.com:443
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Wed, 09 May 2007 21:19:01 GMT
Modification Time : Mon, 07 May 2007 14:08:26 GMT
Content-type : application/pdf
Request time: 0 secs
size = 438107
pick: devserverxxx.com, # servers = 1
>devserverxxx.com with a traditional HTTP connection
96:39:2:https://devserverxxx.com/library/ADJA/forms/Indexin
g_Form.doc
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:18:28 GMT
Header line: Server: Apache
Header line: Last-Modified: Tue, 30 Aug 2005 20:19:58 GMT
Header line: ETag: "224003-6a00-55fc3780"
Discarded header line: ETag:
"224003-6a00-55fc3780"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/msword
Header line: Content-Length: 27136
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain
2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserver.com
(iChain 2.3.345)
Retrieving document /library/ADJA/forms/Indexing_Form.doc on
host:
devserverxxx.com:443
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Wed, 09 May 2007 21:18:28 GMT
Modification Time : Tue, 30 Aug 2005 20:19:58 GMT
Content-type : application/msword
Request time: 0 secs
size = 27136
After indexing, I tried to search some terms which are
definitely in both
pdf and doc documents, but no hits!
So, I've tried using parse_doc.pl instead of doc2html with
the same path to
catdoc and pdftotext/pdfinfo. Indexed fine and returned some
valid hits.
Could anyone help me to figure out why I get no hits under
doc2html?
By the way, here is my config file:
# for doc2html
external_parsers: application/msword->text/html
/path/to/doc2html.pl
application/pdf->text/html
/path/to/doc2html.pl
# for parse_doc.pl
#external_parsers: application/msword
/path/to/parse_doc.cgi
# application/pdf /path/to/parse_doc.cgi
____________________________________________________________
_____
Catch suspicious messages before you open them—with Windows
Live Hotmail.
http://
imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_T
AGHM_migration_HM_mini_protection_0507
------------------------------------------------------------
-------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and
take
control of your XML. No limits. Just data. Click to get it
now.
http://sourcefor
ge.net/powerbar/db2/
_______________________________________________
ht://Dig general mailing list: <htdig-general lists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
|