List Info

Thread: doc2html - indexed but no hits




doc2html - indexed but no hits
country flaguser name
United States
2007-05-10 07:43:10
I've been trying to index .pdf and .doc documents in v.
3.2.0b with 
doc2html/catdoc/pdf2html.
I can see both types indexed fine (though I'm not sure why
log doesn't tell 
which words and tags have been indexed). See below:

pick: devserverxxx.com, # servers = 1
>devserverxxx.com with a traditional HTTP connection
316:33:2:https://devserverxxx.com/library/ADJA/docs/por
tlet-1_0-fr-spec.pdf
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:19:01 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 07 May 2007 14:08:26 GMT
Header line: ETag: "1f841c-6af5b-d5aeea80"
Discarded header line: ETag:
"1f841c-6af5b-d5aeea80"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/pdf
Header line: Content-Length: 438107
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain
2.3.345)
Discarded header line: Via: 1.1
ichainserver.devserverxxx.com (iChain 
2.3.345)
Retrieving document
/library/ADJA/docs/portlet-1_0-fr-spec.pdf on host: 
devserverxxx.com:443
Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:19:01 GMT
Modification Time : Mon, 07 May 2007 14:08:26 GMT
Content-type      : application/pdf
Request time: 0 secs
size = 438107

pick: devserverxxx.com, # servers = 1
>devserverxxx.com with a traditional HTTP connection
96:39:2:https://devserverxxx.com/library/ADJA/forms/Indexin
g_Form.doc
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:18:28 GMT
Header line: Server: Apache
Header line: Last-Modified: Tue, 30 Aug 2005 20:19:58 GMT
Header line: ETag: "224003-6a00-55fc3780"
Discarded header line: ETag:
"224003-6a00-55fc3780"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/msword
Header line: Content-Length: 27136
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain
2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserver.com
(iChain 2.3.345)
Retrieving document /library/ADJA/forms/Indexing_Form.doc on
host: 
devserverxxx.com:443
Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:18:28 GMT
Modification Time : Tue, 30 Aug 2005 20:19:58 GMT
Content-type      : application/msword
Request time: 0 secs
size = 27136


After indexing, I tried to search some terms which are
definitely in both 
pdf and doc documents, but no hits!

So, I've tried using parse_doc.pl instead of doc2html with
the same path to 
catdoc and pdftotext/pdfinfo. Indexed fine and returned some
valid hits.

Could anyone help me to figure out why I get no hits under
doc2html?

By the way, here is my config file:

# for doc2html
external_parsers:   application/msword->text/html
/path/to/doc2html.pl 
                    application/pdf->text/html
/path/to/doc2html.pl

# for parse_doc.pl
#external_parsers:   application/msword 
/path/to/parse_doc.cgi 
#                    application/pdf /path/to/parse_doc.cgi

____________________________________________________________
_____
Catch suspicious messages before you open them—with Windows
Live Hotmail. 
http://
imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_T
AGHM_migration_HM_mini_protection_0507



------------------------------------------------------------
-------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and
take
control of your XML. No limits. Just data. Click to get it
now.
http://sourcefor
ge.net/powerbar/db2/
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )