Brilliant - I see the problem now, thank you very much.
As you say in my first example I was calling StringReader
straight with the
whatever read() returned - this is not necessarily utf-8.
My own
StringReader class didn't specify utf-8 either.
I've just simply added
uniText=unicode(textString, 'utf8','ignore' )
and passed this into PyLucene.StringReader and
getBestFragment and it works!
Thanks again,
Phil.
-----Original Message-----
From: pylucene-dev-bounces osafoundation.org
[mailto:pylucene-dev-bounces osafoundation.org] On
Behalf Of Andi Vajda
Sent: 28 November 2006 17:05
To: 'pylucene-dev osafoundation.org'
Subject: Re: [pylucene-dev] Problems with StringReader()
On Tue, 28 Nov 2006, BEADLING, Philip, GBM wrote:
> def highlight( self, searchText,
searchResultFilenames ):
> for filename in searchResultFilenames:
> # Find text directory from documents
directory and convert
> network fileshare to local mount
> textFile =
filename.replace("\Documents\","\Text\&qu
ot;) +
".txt"
> textFile = textFile.replace("\",
"/")
> textFile =
>
textFile.replace("//networkshare/IRDcaf/Documentation&q
uot;, "/Documentation")
>
> print "<br>", searchText,
"<br>", textFile
> if os.path.isfile( textFile ):
> filen = open( textFile, 'r' )
> textString = filen.read()
> filen.close()
> term = Term( "field",
searchText )
> termQuery = TermQuery( term )
> scorer = QueryScorer( termQuery )
> highlighter = Highlighter( scorer )
> simpAn = SimpleAnalyzer()
> # PROBLEM IS HERE!!!!
> reader = PyLucene.StringReader(
textString )
> tokenStream =
simpAn.tokenStream("field", reader )
> print highlighter.getBestFragment(
tokenStream, textString
)
>
At first quick glance, it doesn't look like 'textString' is
going to be of
type 'unicode' in the above code sample. What comes out of a
python file's
read method is a object of type 'str'. I believe PyLucene
will try to
convert
the 'str' into a 'unicode' object by assuming 'utf-8'
encoding. If your
'str'
is not 'utf-8' encoded then that is going to fail.
If you send in a piece of code that runs (with the required
data) that
reproduces the problem you're experiencing, I might be able
to help you
better.
Andi..
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
************************************************************
***********************
The Royal Bank of Scotland plc. Registered in Scotland No
90312. Registered Office: 36 St Andrew Square, Edinburgh EH2
2YB.
Authorised and regulated by the Financial Services Authority
This e-mail message is confidential and for use by the
addressee only. If the message is received by anyone other
than the addressee, please return the message to the sender
by replying to it and then delete the message from your
computer. Internet e-mails are not necessarily secure. The
Royal Bank of Scotland plc does not accept responsibility
for
changes made to this message after it was sent.
Whilst all reasonable care has been taken to avoid the
transmission of viruses, it is the responsibility of the
recipient to
ensure that the onward transmission, opening or use of this
message and any attachments will not adversely affect its
systems or data. No responsibility is accepted by The
Royal Bank of Scotland plc in this regard and the recipient
should carry
out such virus and other checks as it considers appropriate.
Visit our websites at:
http://www.rbos.com
http://www.rbsmarkets.com
a>
************************************************************
***********************
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|