List Info

Thread: Re: doc2html - indexed but no hits




Re: doc2html - indexed but no hits
country flaguser name
United States
2007-05-10 14:29:37
Mike,

I figured it out.
It was a simple mistake on my side. I should have looked at
doc2html and 
pdf2html. Also, because some scripts are running within a
jail and they 
called some other scripts out of the jail, htdig couldn't
index .doc and 
.pdf files right.

I appreciate your time and help!


>From: "CHUN KI SHIN" <ckshin0121hotmail.com>
>To: htdig-generallists.sourceforge.net
>Subject: Re: [htdig] doc2html - indexed but no hits
>Date: Thu, 10 May 2007 13:31:43 -0500
>MIME-Version: 1.0
>X-Originating-IP: [199.253.130.17]
>X-Originating-Email: [ckshin0121hotmail.com]
>X-Sender: ckshin0121hotmail.com
>Received: from lists-outbound.sourceforge.net
([66.35.250.225]) by 
>bay0-mc6-f8.bay0.hotmail.com with Microsoft
SMTPSVC(6.0.3790.2668); Thu, 10 
>May 2007 11:31:56 -0700
>Received: from sc8-sf-list1-new.sourceforge.net 
>(sc8-sf-list1-new-b.sourceforge.net [10.3.1.93])by 
>sc8-sf-spam2.sourceforge.net (Postfix) with ESMTPid
CD310127B2; Thu, 10 May 
>2007 11:31:55 -0700 (PDT)
>Received: from sc8-sf-mx2-b.sourceforge.net 
>([10.3.1.92]helo=mail.sourceforge.net)by
sc8-sf-list1-new.sourceforge.net 
>with esmtp (Exim 4.43)id 1HmDQd-0006ib-3Zfor 
>htdig-generallists.sourceforge.net; Thu, 10 May 2007
11:31:53 -0700
>Received: from bay0-omc2-s17.bay0.hotmail.com
([65.54.246.153])by 
>mail.sourceforge.net with esmtp (Exim 4.44) id
1HmDQc-0001j0-4qfor 
>htdig-generallists.sourceforge.net; Thu, 10 May 2007
11:31:50 -0700
>Received: from hotmail.com ([207.46.10.118]) by 
>bay0-omc2-s17.bay0.hotmail.comwith Microsoft
SMTPSVC(6.0.3790.2668); Thu, 
>10 May 2007 11:31:44 -0700
>Received: from mail pickup service by hotmail.com with
Microsoft 
>SMTPSVC;Thu, 10 May 2007 11:31:44 -0700
>Received: from 207.46.10.123 by
by121fd.bay121.hotmail.msn.com with 
>HTTP;Thu, 10 May 2007 18:31:43 GMT
>X-Message-Info: 
>LsUYwwHHNt2vwKEzD7QXdX+5ZIcQzh6u3DLf2Y1dAJYi4WzeLGPo6RLm
dxir0Vzn
>X-OriginalArrivalTime: 10 May 2007 18:31:44.0476 
>(UTC)FILETIME=[75F109C0:01C79331]
>X-Spam-Score: 0.5 (/)
>X-Spam-Report: Spam Filtering performed by
sourceforge.net.See 
>http://spamassassin.org/
tag/ for more details.Report problems 
>tohttp://sf.net/tracker/?func=add&group_id=1&
amp;atid=2000010.5 
>FROM_ENDS_IN_NUMS      From: ends in numbers0.0
MSGID_FROM_MTA_HEADER  
>Message-Id was added by a relay
>X-BeenThere: htdig-generallists.sourceforge.net
>X-Mailman-Version: 2.1.8
>Precedence: list
>List-Id: "A mailing list for general ht://Dig 
>discussion"<htdig-general.lists.sourceforge.net&
gt;
>List-Unsubscribe: 
><https://lists.sourceforge.net/lists/listinfo/htdig
-general>, 
><mailto:htdig-general-requestlists.sourceforge.net?subject=unsubscribe>
>List-Archive: 
><http://sourceforge.net/mailarchive/forum.php
?forum=htdig-general>
>List-Post: <mailto:htdig-generallists.sourceforge.net>
>List-Help: 
><mailto:htdig-general-requestlists.sourceforge.net?subject=help>
>List-Subscribe: 
><https://lists.sourceforge.net/lists/listinfo/htdig
-general>, 
><mailto:htdig-general-requestlists.sourceforge.net?subject=subscribe>
>Errors-To: htdig-general-bounceslists.sourceforge.net
>Return-Path: htdig-general-bounceslists.sourceforge.net
>
>Mike,
>
>It looks you are right. I reindexed the docs with -i -s
-v option and got 
>the following:
>
>bt.com>
>>To: <htdig-generallists.sourceforge.net>
>>Subject: Re: [htdig] doc2html - indexed but no hits
>>Date: Thu, 10 May 2007 16:21:17 +0100
>>MIME-Version: 1.0
>>Received: from lists-outbound.sourceforge.net
([66.35.250.225]) by 
>>bay0-mc5-f8.bay0.hotmail.com with Microsoft
SMTPSVC(6.0.3790.2668); Thu, 
>>10 May 2007 08:21:49 -0700
>>Received: from sc8-sf-list1-new.sourceforge.net 
>>(sc8-sf-list1-new-b.sourceforge.net [10.3.1.93])by 
>>sc8-sf-spam2.sourceforge.net (Postfix) with ESMTPid
209C1123C2; Thu, 10 
>>May 2007 08:21:49 -0700 (PDT)
>>Received: from sc8-sf-mx2-b.sourceforge.net 
>>([10.3.1.92]helo=mail.sourceforge.net)by
sc8-sf-list1-new.sourceforge.net 
>>with esmtp (Exim 4.43)id 1HmASO-0002S7-L2for 
>>htdig-generallists.sourceforge.net; Thu, 10 May 2007
08:21:28 -0700
>>Received: from smtp2.smtp.bt.com
([217.32.164.150])by mail.sourceforge.net 
>>with esmtp (Exim 4.44) id 1HmASM-0006nn-Cxfor 
>>htdig-generallists.sourceforge.net; Thu, 10 May 2007
08:21:28 -0700
>>Received: from I2KF03BV-UKBR.domain1.systemhost.net
([193.113.197.45]) 
>>bysmtp2.smtp.bt.com with Microsoft
SMTPSVC(6.0.3790.1830); Thu, 10 May 
>>2007 16:21:19 +0100
>>Received: from E03MVZ4-UKDY.domain1.systemhost.net
([193.113.30.63]) 
>>byI2KF03BV-UKBR.domain1.systemhost.net with 
>>MicrosoftSMTPSVC(6.0.3790.211); Thu, 10 May 2007
16:21:19 +0100
>>X-Message-Info: 
>>LsUYwwHHNt3igTN6QK+bgFoRqCYjqfvL2Ze/1rHnaFaU0TpcCHeS
aTTF0/ZTrvaR
>>X-MimeOLE: Produced By Microsoft Exchange V6.5
>>Content-class: urn:content-classes:message
>>X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic:
[htdig] doc2html - 
>>indexed but no hits
>>Thread-Index: AceTFYj6xqGt5BctR2GUiMyIWArneQAAB83g
>>X-OriginalArrivalTime: 10 May 2007 15:21:19.0062 
>>(UTC)FILETIME=[DBDD3760:01C79316]
>>X-Spam-Score: 1.2 (+)
>>X-Spam-Report: Spam Filtering performed by
sourceforge.net.See 
>>http://spamassassin.org/
tag/ for more details.Report problems 
>>tohttp://sf.net/tracker/?func=add&group_id=1&
amp;atid=2000010.2 NO_REAL_NAME   
>>         From: does not include a real name1.0
FORGED_RCVD_HELO       
>>Received: contains a forged HELO
>>X-BeenThere: htdig-generallists.sourceforge.net
>>X-Mailman-Version: 2.1.8
>>Precedence: list
>>List-Id: "A mailing list for general ht://Dig 
>>discussion"<htdig-general.lists.sourceforge.
net>
>>List-Unsubscribe: 
>><https://lists.sourceforge.net/lists/listinfo/htdig
-general>, 
>><mailto:htdig-general-requestlists.sourceforge.net?subject=unsubscribe>
>>List-Archive: 
>><http://sourceforge.net/mailarchive/forum.php
?forum=htdig-general>
>>List-Post: <mailto:htdig-generallists.sourceforge.net>
>>List-Help: 
>><mailto:htdig-general-requestlists.sourceforge.net?subject=help>
>>List-Subscribe: 
>><https://lists.sourceforge.net/lists/listinfo/htdig
-general>, 
>><mailto:htdig-general-requestlists.sourceforge.net?subject=subscribe>
>>Errors-To: htdig-general-bounceslists.sourceforge.net
>>Return-Path: htdig-general-bounceslists.sourceforge.net
>>
>>  In this case I can be fairly sure they were not
called!
>>Note the line that says 'not changed' ? Not sure how
extensive your
>>indexes are, or if you are in a production status,
but you may want to
>>add the -i  flag to do an index from scratch.  From
memory, the -s  flag
>>turns on a set of summary statistics, which may
include useful info.
>>During a normal run at the correct level, you should
see a line like
>>++++---++-
>>for each file that you index.  www.htdig.org  can
reveal what these
>>symbols mean - I can't remember off hand, but this
helps to indicate
>>what is actually found inside a document. Check also
that htmerge is
>>running at a similar verbosity setting.
>>
>>On my system, doc2html etc is called from an
intermediate DOS batch
>>file, which is an easy place to put in an extra bit
of logging.
>>Alternatively, you may be brave enough to put a
debug line into doc2html
>>itself - it is just a bit of PERL if I remember
correctly.
>>
>>Mike
>>NB, I have copied this back to the list - not sure
if you meant to send
>>this direct, I get that wrong all the time!
>>
>> > -----Original Message-----
>> > From: CHUN KI SHIN [mailto:ckshin0121hotmail.com]
>> > Sent: Thursday, May 10, 2007 4:12 PM
>> > To: Brockington,MJ,Michael,JPGA4X R
>> > Subject: Re: [htdig] doc2html - indexed but no
hits
>> >
>> > Thanks for the quick response, Mike.
>> >
>> > Ok, I ran the script with -vv, and I don't
know what I'm
>> > looking for from
>> > the index log. Only thing I can see is the
following:
>> >
>> > pick: devserverxxx.com, # servers = 1
>> > 234:31:2:https://devserverxxx.com/library/ADJA/docs/portlet-1_
>> > 0-fr-spec.pdf:
>> >   not changed
>> >
>> > and the same for .doc.
>> >
>> > Could you tell me how to make sure doc2html is
being called?
>> >
>> > Also, what do you mean by 'statistics' in
htdig?
>> >
>> > Thanks for your time and help!
>> >
>> > >From: <michael.brockingtonbt.com>
>> > >To: <htdig-generallists.sourceforge.net>
>> > >Subject: Re: [htdig] doc2html - indexed
but no hits
>> > >Date: Thu, 10 May 2007 14:14:59 +0100
>> > >MIME-Version: 1.0
>> > >Received: from
lists-outbound.sourceforge.net ([66.35.250.225]) by
>> > >bay0-mc10-f3.bay0.hotmail.com with
Microsoft
>> > SMTPSVC(6.0.3790.2668); Thu,
>> > >10 May 2007 06:15:16 -0700
>> > >Received: from
sc8-sf-list1-new.sourceforge.net
>> > >(sc8-sf-list1-new-b.sourceforge.net
[10.3.1.93])by
>> > >sc8-sf-spam2.sourceforge.net (Postfix)
with ESMTPid
>> > 05C7C12E15; Thu, 10 May
>> > >2007 06:15:16 -0700 (PDT)
>> > >Received: from
sc8-sf-mx1-b.sourceforge.net
>> > >([10.3.1.91]helo=mail.sourceforge.net)by
>> > sc8-sf-list1-new.sourceforge.net
>> > >with esmtp (Exim 4.43)id
1Hm8U9-0004LN-Hnfor
>> > >htdig-generallists.sourceforge.net; Thu,
10 May 2007 06:15:09 -0700
>> > >Received: from smtp2.smtp.bt.com
([217.32.164.150])by
>> > mail.sourceforge.net
>> > >with esmtp (Exim 4.44) id
1Hm8U7-0004Pw-NFfor
>> > >htdig-generallists.sourceforge.net; Thu,
10 May 2007 06:15:09 -0700
>> > >Received: from
I2KF03CV-UKBR.domain1.systemhost.net
>> > ([193.113.197.43])
>> > >bysmtp2.smtp.bt.com with Microsoft
SMTPSVC(6.0.3790.1830);
>> > Thu, 10 May 2007
>> > >14:15:00 +0100
>> > >Received: from
E03MVZ4-UKDY.domain1.systemhost.net ([193.113.30.63])
>> > >byI2KF03CV-UKBR.domain1.systemhost.net
with
>> > MicrosoftSMTPSVC(6.0.3790.211);
>> > >Thu, 10 May 2007 14:15:00 +0100
>> > >X-Message-Info:
>> >
>LsUYwwHHNt3igTN6QK+bgHeD79v5SZW0Ne7jEEII55/mb39+2hv8+2ps
07jKcsv0
>> > >X-MimeOLE: Produced By Microsoft Exchange
V6.5
>> > >Content-class:
urn:content-classes:message
>> > >X-MS-Has-Attach: X-MS-TNEF-Correlator:
Thread-Topic: [htdig]
>> > doc2html -
>> > >indexed but no hits
>> > >Thread-Index:
AceTAM4rcEeX2/+QTI2LarpwABt5LAABAOJg
>> > >X-OriginalArrivalTime: 10 May 2007
13:15:00.0122
>> > >(UTC)FILETIME=[3676BFA0:01C79305]
>> > >X-Spam-Score: 1.2 (+)
>> > >X-Spam-Report: Spam Filtering performed by
sourceforge.net.See
>> > >http://spamassassin.org/
tag/ for more details.Report problems
>> > >tohttp://sf.net/tracker/?func=add&group_id=1&
amp;atid=2000010.2
>> > NO_REAL_NAME
>> > >        From: does not include a real
name1.0 FORGED_RCVD_HELO
>> > >Received: contains a forged HELO
>> > >X-BeenThere: htdig-generallists.sourceforge.net
>> > >X-Mailman-Version: 2.1.8
>> > >Precedence: list
>> > >List-Id: "A mailing list for general
ht://Dig
>> >
>discussion"<htdig-general.lists.sourceforge.net&
gt;
>> > >List-Unsubscribe:
>> > ><https://lists.sourceforge.net/lists/listinfo/htdig
-general>,
>> > ><mailto:htdig-general-requestlists.sourceforge.net?subject=u
>> > nsubscribe>
>> > >List-Archive:
>> > ><http://sourceforge.net/mailarchive/forum.php
?forum=htdig-general>
>> > >List-Post: <mailto:htdig-generallists.sourceforge.net>
>> > >List-Help:
>> > ><mailto:htdig-general-requestlists.sourceforge.net?subject=help>
>> > >List-Subscribe:
>> > ><https://lists.sourceforge.net/lists/listinfo/htdig
-general>,
>> > ><mailto:htdig-general-requestlists.sourceforge.net?subject=s
>> > ubscribe>
>> > >Errors-To: htdig-general-bounceslists.sourceforge.net
>> > >Return-Path: htdig-general-bounceslists.sourceforge.net
>> > >
>> > >Can you tell if  doc2html is actually
being called by htdig? Just
>> > >because htdig is downloading the document,
it does not
>> > guarantee that it
>> > >is being passed over for conversion to an
indexable format.
>> > >It might be worth decreasing the number of
 v's you are
>> > using by one or
>> > >two so that you can see what is being
found in each
>> > document. Not sure
>> > >if you have the 'statistics' turned on?
>> > >
>> > >Regards,
>> > >Mike
>> > >
>> > > > -----Original Message-----
>> > > > From: htdig-general-bounceslists.sourceforge.net
>> > > > [mailto:htdig-general-bounceslists.sourceforge.net] On
>> > > > Behalf Of CHUN KI SHIN
>> > > > Sent: Thursday, May 10, 2007 1:43
PM
>> > > > To: htdig-generallists.sourceforge.net
>> > > > Subject: [htdig] doc2html - indexed
but no hits
>> > > >
>> > > > I've been trying to index .pdf and
.doc documents in v.
>> > 3.2.0b with
>> > > > doc2html/catdoc/pdf2html.
>> > > > I can see both types indexed fine
(though I'm not sure why
>> > > > log doesn't tell
>> > > > which words and tags have been
indexed). See below:
>> > > >
>> > >
>> >
>--------------------------------------------------------
-----
>> > ------------
>> > >This SF.net email is sponsored by DB2
Express
>> > >Download DB2 Express C - the FREE version
of DB2 express and take
>> > >control of your XML. No limits. Just data.
Click to get it now.
>> > >http://sourcefor
ge.net/powerbar/db2/
>> >
>_______________________________________________
>> > >ht://Dig general mailing list:
<htdig-generallists.sourceforge.net>
>> > >ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
>> > >List information (subscribe/unsubscribe,
etc.)
>> > >https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
>> >
>> >
____________________________________________________________
_____
>> > PC Magazine's 2007 editors' choice for best
Web
>> > mail-award-winning Windows
>> > Live Hotmail.
>> > http://imagine-windowslive.com/hotmail/?locale
=en-us&ocid=TXT_
>> > TAGHM_migration_HM_mini_pcmag_0507
>> >
>> >
>>
>>----------------------------------------------------
---------------------
>>This SF.net email is sponsored by DB2 Express
>>Download DB2 Express C - the FREE version of DB2
express and take
>>control of your XML. No limits. Just data. Click to
get it now.
>>http://sourcefor
ge.net/powerbar/db2/
>>_______________________________________________
>>ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
>>ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
>>List information (subscribe/unsubscribe, etc.)
>>https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
>
>________________________________________________________
_________
>See what you’re getting into…before you go there 
>http://newlivehotmail.com/?ocid=TXT_TAG
HM_migration_HM_viral_preview_0507
>
>


>--------------------------------------------------------
-----------------
>This SF.net email is sponsored by DB2 Express
>Download DB2 Express C - the FREE version of DB2 express
and take
>control of your XML. No limits. Just data. Click to get
it now.
>http://sourcefor
ge.net/powerbar/db2/


>_______________________________________________
>ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
>ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
>List information (subscribe/unsubscribe, etc.)
>https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral

____________________________________________________________
_____
Now you can see trouble…before he arrives 
http://newlivehotmail.com/?ocid=TXT_
TAGHM_migration_HM_viral_protection_0507



------------------------------------------------------------
-------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and
take
control of your XML. No limits. Just data. Click to get it
now.
http://sourcefor
ge.net/powerbar/db2/
_______________________________________________
ht://Dig general mailing list: <htdig-generallists.sourceforge.net>
ht://Dig FAQ: http://htdig.so
urceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-gen
eral
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )