List Info

Thread: Re: Re: Problem with html code inside xml




Re: Re: Problem with html code inside xml
country flaguser name
France
2007-10-02 18:01:26
Hi !

I'm facing a similar problem. Some HTML docs are correctly
indexed and others are simply rejected even I encoded all
problematic HTML tags as Thorsten suggested.

In the following example, "my_doc.xml" is a valid
"XML" file, compliant with my Solr's schema fields
:

$ java -jar post.jar ./my_doc.xml 

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are
encoded in UTF-8, other encodings are not currently
supported
SimplePostTool: POSTing files to http://localhost:
8983/solr/update..
SimplePostTool: POSTing file solrdoc
SimplePostTool: FATAL: Connection error (is Solr running at
http://localhost:89
83/solr/update ?): java.io.IOException: Server returned
HTTP response code: 500 for URL: http://localhost:89
83/solr/update

Is there any way to let "Solr" to be more verbose
than that ?
Do I need to go into the Java code to understand what
happen?
 I'm looking for a simple solution.

Thanks in advance

cheers
Y.

----Message d'origine----
>De: "steve.christingmail.com" 
>Sujet: Re: Problem with html code inside xml
>Date: Tue, 2 Oct 2007 16:15:26 +0200
>A: solr-userlucene.apache.org
>
>Thanks
>
>I use this solution:
>
>put  <![CDATA[  Here my hml code   ]]> in the xml
to be indexed and  
>it works, nothing to change in the xsl.
>
>In the schema I use this fieldType
>
><fieldType name="html"
class="solr.TextField"  
>positionIncrementGap="100">
>     	<analyzer>
>         	<tokenizer
class="solr.WhitespaceTokenizerFactory"/>
>          	<filter
class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1"
generateNumberParts="1"
catenateWords="1"  
>catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
>          	<filter
class="solr.LowerCaseFilterFactory"/>
>          	<filter
class="solr.StopFilterFactory"
ignoreCase="true"  
>words="stopwords.txt"/>
>          	<filter
class="solr.ISOLatin1AccentFilterFactory"/>
>          	<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/&g
t;
>      	</analyzer>
>      </fieldType>
>
>----------
>Now question:
>I created a field to index only the text for this html
code.
>
>I created a field type:
>
><fieldType name="htmlTxt"
class="solr.TextField"  
>positionIncrementGap="100">
>     	<analyzer>
>         	<tokenizer
class="solr.HTMLStripWhitespaceTokenizerFactory"/&
gt;
>          	<filter
class="solr.WordDelimiterFilterFactory"  
>generateWordParts="1"
generateNumberParts="1"
catenateWords="1"  
>catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
>          	<filter
class="solr.LowerCaseFilterFactory"/>
>          	<filter
class="solr.StopFilterFactory"
ignoreCase="true"  
>words="stopwords.txt"/>
>          	<filter
class="solr.ISOLatin1AccentFilterFactory"/>
>          	<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/&g
t;
>      	</analyzer>
>      </fieldType>
>
>Everything works (the div tags, p tags are removed) but
some  
><strong>nnn</strong>   or <br/> tags
are style in the text after  
>indexing.
>
>If you've got any idea to solve this problem it we'll be
great.
>
>Thanks
>
>S. Christin
>
>
>
>-------------
>
>
>Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
>
>> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé
wrote:
>>> If I understand, you want to keep the raw html
code in solr like that
>>> (in your posting xml file):
>>>
>>> <field name="storyFullText">
>>>   <html></html>
>>> </field>
>>>
>>> I think you should encode your content to
protect these xml entities:
>>> <  ->  &lt;
>>>> -> &gt;
>>> " -> &quot;
>>> & -> &amp;
>>>
>>> If you use perl, have a look at
HTML::Entities.
>>
>> AFAIR you cannot use tags, they always are getting
transformed to
>> entities. The solution is to have a xsl
transformation after the
>> response that transforms the entities back to
tags.
>>
>> Have a look at the thread
>> h
ttp://marc.info/?t=116775837900001&r=1&w=2
>> and especially at
>> http://marc.info/?l=solr-user&m=116782664828926&a
mp;w=2
>>
>> HTH
>>
>> salu2
>>
>>>
>>>
>>> On 9/25/07, steve.christingmail.com
<steve.christingmail.com>  
>>> wrote:
>>>> Hello,
>>>>
>>>> I've got some problem with html code who is
embedded in xml file:
>>>>
>>>> Sample source .
>>>>
>>>> <content>
>>>>         <stories>
>>>>                 <div
class="storyTitle">
>>>>                          Les débats
>>>>                 </div>
>>>>                 <div
class="storyIntroductionText">
>>>>                         Le premier tour des
élections fédérales  
>>>> se déroulera le 21
>>>> octobre prochain. D'ici là, La 1ère vous
propose plusieurs rendez-
>>>> vous, dont plusieurs grands débats à
l'enseigne de Forums.
>>>>                 </div>
>>>>                 <div
class="paragraph">
>>>>                         <div
class="paragraphTitle"/>
>>>>                         <div
class="paragraphText">
>>>>                                 my para
textehere
>>>>                                
<br/>
>>>>                                
<br/>
>>>>                                 Vous
trouverez sur cette page  
>>>> toutes les dates et les heures de
>>>> ces différents rendez-vous ainsi que le nom
et les partis des
>>>> débatteurs. De plus, vous pourrez également
écouter ou  
>>>> réécouter
>>>> l'ensemble de ces émissions.
>>>>                         </div>
>>>>                 </div>
>>>> ....
>>>> ---------
>>>> When a make a query on solr I've got
something like that in the
>>>> source code of the xml result:
>>>>
>>>> <td xmlns="http://www.w3.
org/1999/xhtml">
>>>> <span
class="markup">&lt;</span>
>>>> <span
class="start-tag">div</span>
>>>> <span
class="attribute-name">class</span>
>>>> <span
class="markup">=</span>
>>>> <span
class="attribute-value">"paragraph"&l
t;/span>
>>>> <span
class="markup">&gt;</span><div
class="expander-content">
>>>> <div
class="indent"><span
class="markup">&lt;</span>
>>>> <span
class="start-tag">div</span>
>>>> <span
class="attribute-name">class</span>
>>>> <span
class="markup">=</span>
>>>> <span
class="attribute-value">"paragraphTitle&qu
ot;</span>
>>>> <span
class="markup">/&gt;</span></div&g
t;<table><tr>
>>>> <td
class="expander">−<div
class="spacer"/>
>>>> </td><td><span
class="markup">&lt;</span>
>>>> ...
>>>>
>>>> It is not exactly what I want. I want to
keep the html tags, that  
>>>> all
>>>> without formatting.
>>>>
>>>> So the br tags and a tags are well formed
in xml and json result,  
>>>> but
>>>> the div tags are not kept.
>>>> ---------
>>>> In the schema.xml I've got this for the
html content
>>>>
>>>> <fieldType name="html"
class="solr.TextField" />
>>>>
>>>>   <field name="storyFullText"
type="html" indexed="true"
>>>> stored="true"
multiValued="true"/>
>>>>
>>>> ---------
>>>>
>>>> Any help would be appreciate.
>>>>
>>>> Thanks in advance.
>>>>
>>>> S. Christin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> -- 
>> Thorsten Scherler                                 

>> thorsten.at.apache.org
>> Open Source Java                      consulting,
training and  
>> solutions
>>
>


Re: Re: Problem with html code inside xml
user name
2007-10-02 18:20:03
: SimplePostTool: FATAL: Connection error (is Solr running
at http://localhost:89
83/solr/update ?): java.io.IOException: Server returned
HTTP response code: 500 for URL: http://localhost:89
83/solr/update
: 
: Is there any way to let "Solr" to be more
verbose than that ?

Solr outputs all errors using whatever default error page
format your 
servlet container uses, it also logs all errors tothe
servlet containers 
loging system.

this specific error indicates that post.jar could not
connect to Solr at 
all (hence the "FATAL: Connection error" and the
hint that perhaps Solr 
isn't actually runing at the URL ypost.jar is trying to
contact.) 

If you are using the example Jetty setup that comes with
Solr, and you 
send a document that triggers a Solr error, post.jar will
output something 
like this (in this specific error, the problem is that the
document 
being posted is total giberesh, an not XML at all)...

SimplePostTool: FATAL: Solr returned an error:
ParseError_at_rowcol11_Message_only_whitespace_content_allow
ed_before_start_tag_and_not___javaxxmlstreamXMLStreamExcepti
on_ParseError_at_rowcol11_Message_only_whitespace_content_al
lowed_before_start_tag_and_not___at_combeaxmlstreamMXParserp
arsePrologMXParserjava2044__at_combeaxmlstreamMXParsernextIm
plMXParserjava1947__at_combeaxmlstreamMXParsernextMXParserja
va1333__at_orgapachesolrhandlerXmlUpdateRequestHandlerproces
sUpdateXmlUpdateRequestHandlerjava148__at_orgapachesolrhandl
erXmlUpdateRequestHandlerhandleRequestBodyXmlUpdateRequestHa
ndlerjava123__at_orgapachesolrhandlerRequestHandlerBasehandl
eRequestRequestHandlerBasejava78__at_orgapachesolrcoreSolrCo
reexecuteSolrCorejava807__at_orgapachesolrservletSolrDispatc
hFilterexecuteSolrDispatchFilterjava206__at_orgapachesolrser
vletSolrDispatchFilterdoFilterSolrDispatchFilterjava174__at_
orgmortbayjettyservletServletHandler$CachedChaindoFilterServ
letHandlerjava1089__at_orgmortbayjettyservletServletHandlerh
andleServletHandlerjava365__at_orgmortbayjettysecuritySecuri
tyHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyser
vletSessionHandlerhandleSessionHandlerjava181__at_orgmortbay
jettyhandlerContextHandlerhandleContextHandlerjava712__at_or
gmortbayjettywebappWebAppContexthandleWebAppContextjava405__
at_orgmortbayjettyhandlerContextHandlerCollectionhandleConte
xtHandlerCollectionjava211__at_orgmortbayjettyhandlerHandler
CollectionhandleHandlerCollectionjava114__at_orgmortbayjetty
handlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmort
bayjettyServerhandleServerjava285__at_orgmortbayjettyHttpCon
nectionhandleRequestHttpConnectionjava502__at_orgmortbayjett
yHttpConnection$RequestHandlercontentHttpConnectionjava835__
at_orgmortbayjettyHttpParserparseNextHttpParserjava641__at_o
rgmortbayjettyHttpParserparseAvailableHttpParserjava202__at_
orgmortbayjettyHttpCo



-Hoss

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )