Hi !
I'm facing a similar problem. Some HTML docs are correctly
indexed and others are simply rejected even I encoded all
problematic HTML tags as Thorsten suggested.
In the following example, "my_doc.xml" is a valid
"XML" file, compliant with my Solr's schema fields
:
$ java -jar post.jar ./my_doc.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are
encoded in UTF-8, other encodings are not currently
supported
SimplePostTool: POSTing files to http://localhost:
8983/solr/update..
SimplePostTool: POSTing file solrdoc
SimplePostTool: FATAL: Connection error (is Solr running at
http://localhost:89
83/solr/update ?): java.io.IOException: Server returned
HTTP response code: 500 for URL: http://localhost:89
83/solr/update
Is there any way to let "Solr" to be more verbose
than that ?
Do I need to go into the Java code to understand what
happen?
I'm looking for a simple solution.
Thanks in advance
cheers
Y.
----Message d'origine----
>De: "steve.christin gmail.com"
>Sujet: Re: Problem with html code inside xml
>Date: Tue, 2 Oct 2007 16:15:26 +0200
>A: solr-user lucene.apache.org
>
>Thanks
>
>I use this solution:
>
>put <![CDATA[ Here my hml code ]]> in the xml
to be indexed and
>it works, nothing to change in the xsl.
>
>In the schema I use this fieldType
>
><fieldType name="html"
class="solr.TextField"
>positionIncrementGap="100">
> <analyzer>
> <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
> <filter
class="solr.WordDelimiterFilterFactory"
>generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
>catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
> <filter
class="solr.LowerCaseFilterFactory"/>
> <filter
class="solr.StopFilterFactory"
ignoreCase="true"
>words="stopwords.txt"/>
> <filter
class="solr.ISOLatin1AccentFilterFactory"/>
> <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/&g
t;
> </analyzer>
> </fieldType>
>
>----------
>Now question:
>I created a field to index only the text for this html
code.
>
>I created a field type:
>
><fieldType name="htmlTxt"
class="solr.TextField"
>positionIncrementGap="100">
> <analyzer>
> <tokenizer
class="solr.HTMLStripWhitespaceTokenizerFactory"/&
gt;
> <filter
class="solr.WordDelimiterFilterFactory"
>generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
>catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
> <filter
class="solr.LowerCaseFilterFactory"/>
> <filter
class="solr.StopFilterFactory"
ignoreCase="true"
>words="stopwords.txt"/>
> <filter
class="solr.ISOLatin1AccentFilterFactory"/>
> <filter
class="solr.RemoveDuplicatesTokenFilterFactory"/&g
t;
> </analyzer>
> </fieldType>
>
>Everything works (the div tags, p tags are removed) but
some
><strong>nnn</strong> or <br/> tags
are style in the text after
>indexing.
>
>If you've got any idea to solve this problem it we'll be
great.
>
>Thanks
>
>S. Christin
>
>
>
>-------------
>
>
>Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
>
>> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé
wrote:
>>> If I understand, you want to keep the raw html
code in solr like that
>>> (in your posting xml file):
>>>
>>> <field name="storyFullText">
>>> <html></html>
>>> </field>
>>>
>>> I think you should encode your content to
protect these xml entities:
>>> < -> <
>>>> -> >
>>> " -> "
>>> & -> &
>>>
>>> If you use perl, have a look at
HTML::Entities.
>>
>> AFAIR you cannot use tags, they always are getting
transformed to
>> entities. The solution is to have a xsl
transformation after the
>> response that transforms the entities back to
tags.
>>
>> Have a look at the thread
>> h
ttp://marc.info/?t=116775837900001&r=1&w=2
>> and especially at
>> http://marc.info/?l=solr-user&m=116782664828926&a
mp;w=2
>>
>> HTH
>>
>> salu2
>>
>>>
>>>
>>> On 9/25/07, steve.christin gmail.com
<steve.christin gmail.com>
>>> wrote:
>>>> Hello,
>>>>
>>>> I've got some problem with html code who is
embedded in xml file:
>>>>
>>>> Sample source .
>>>>
>>>> <content>
>>>> <stories>
>>>> <div
class="storyTitle">
>>>> Les débats
>>>> </div>
>>>> <div
class="storyIntroductionText">
>>>> Le premier tour des
élections fédérales
>>>> se déroulera le 21
>>>> octobre prochain. D'ici là, La 1ère vous
propose plusieurs rendez-
>>>> vous, dont plusieurs grands débats à
l'enseigne de Forums.
>>>> </div>
>>>> <div
class="paragraph">
>>>> <div
class="paragraphTitle"/>
>>>> <div
class="paragraphText">
>>>> my para
textehere
>>>>
<br/>
>>>>
<br/>
>>>> Vous
trouverez sur cette page
>>>> toutes les dates et les heures de
>>>> ces différents rendez-vous ainsi que le nom
et les partis des
>>>> débatteurs. De plus, vous pourrez également
écouter ou
>>>> réécouter
>>>> l'ensemble de ces émissions.
>>>> </div>
>>>> </div>
>>>> ....
>>>> ---------
>>>> When a make a query on solr I've got
something like that in the
>>>> source code of the xml result:
>>>>
>>>> <td xmlns="http://www.w3.
org/1999/xhtml">
>>>> <span
class="markup"><</span>
>>>> <span
class="start-tag">div</span>
>>>> <span
class="attribute-name">class</span>
>>>> <span
class="markup">=</span>
>>>> <span
class="attribute-value">"paragraph"&l
t;/span>
>>>> <span
class="markup">></span><div
class="expander-content">
>>>> <div
class="indent"><span
class="markup"><</span>
>>>> <span
class="start-tag">div</span>
>>>> <span
class="attribute-name">class</span>
>>>> <span
class="markup">=</span>
>>>> <span
class="attribute-value">"paragraphTitle&qu
ot;</span>
>>>> <span
class="markup">/></span></div&g
t;<table><tr>
>>>> <td
class="expander">−<div
class="spacer"/>
>>>> </td><td><span
class="markup"><</span>
>>>> ...
>>>>
>>>> It is not exactly what I want. I want to
keep the html tags, that
>>>> all
>>>> without formatting.
>>>>
>>>> So the br tags and a tags are well formed
in xml and json result,
>>>> but
>>>> the div tags are not kept.
>>>> ---------
>>>> In the schema.xml I've got this for the
html content
>>>>
>>>> <fieldType name="html"
class="solr.TextField" />
>>>>
>>>> <field name="storyFullText"
type="html" indexed="true"
>>>> stored="true"
multiValued="true"/>
>>>>
>>>> ---------
>>>>
>>>> Any help would be appreciate.
>>>>
>>>> Thanks in advance.
>>>>
>>>> S. Christin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Thorsten Scherler
>> thorsten.at.apache.org
>> Open Source Java consulting,
training and
>> solutions
>>
>
|