List Info

Thread: Indexing XML




Indexing XML
user name
2007-10-05 02:44:13
Hi,

 

I wish to index well formed xml documents as they are.

I have a database filled with MARCXML records. An example of
these looks like this:

 

        <record

            ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"

            xmlns="http://www.loc.g
ov/MARC21/slim" xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">

            <leader>00000nam  22      a
4500</leader>

            <controlfield
tag="001">000500000</controlfield>

            <controlfield
tag="005">20050826220257.0</controlfield>


            <controlfield
tag="008">000710s1998    xx      r     000 0
dut d</controlfield>

            <datafield ind1=" " ind2="
" tag="040">

                <subfield
code="a">Univ</subfield>

            </datafield>

            <datafield ind1="1" ind2="
" tag="100">

                <subfield code="a">van
Wetten, J. W.</subfield>

            </datafield>

            <datafield ind1="1"
ind2="3" tag="245">

                <subfield code="a">De
positie van vrouwen in de asielprocedure /</subfield>

                <subfield code="c">J.W. van
Wetten, N. Dijkhof, F. Heide.</subfield>

            </datafield>

        </record>

 

The idea is to create Lucene indexes on specific MARC fields
and store the complete MARC record in Lucene 'as is'. In the
presentation layer of my application I would then have this
complete MARC record at hand, and as such have full
flexibility on which MARC fields to display. So I want to
create the following record through XSLT and feed this to
SOLR. 

 

<doc>

<field name="title">De positie van vrouwen
in de asielprocedure</field>

<field name="author">van Wetten, J.
W.</field>

...

<field name="originalRecord">

  <record

            ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"

            xmlns="http://www.loc.g
ov/MARC21/slim" xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">

            <leader>00000nam  22      a
4500</leader>

            <controlfield
tag="001">000500000</controlfield>

            <controlfield
tag="005">20050826220257.0</controlfield>


            <controlfield
tag="008">000710s1998    xx      r     000 0
dut d</controlfield>

            <datafield ind1=" " ind2="
" tag="040">

                <subfield
code="a">UGent</subfield>

            </datafield>

            <datafield ind1="1" ind2="
" tag="100">

                <subfield code="a">van
Wetten, J. W.</subfield>

            </datafield>

            <datafield ind1="1"
ind2="3" tag="245">

                <subfield code="a">De
positie van vrouwen in de asielprocedure /</subfield>

                <subfield code="c">J.W. van
Wetten, N. Dijkhof, F. Heide.</subfield>

            </datafield>

        </record>

</field>

</doc>

 

I have the following in my schema.xml:

 

<field name="author" type="text"
indexed="true" stored="true"
termVectors="true"/>

<field name="title" type="text"
indexed="true" stored="true"
termVectors="true"/>

<field name="originalRecord"
type="text" indexed="false"
stored="true"/>

 

 

SOLR has of course a problem with the XML in the
'originalRecord' field. 

Is there a solution to this? Has anyone done this before? 

 

Thanks a lot.

Benoit.

 

 

=============================

PAUWELS Benoit

Université Libre de Bruxelles - Libraries

Head of Automation

Av. F.D. Roosevelt 50, CP 180

1050 BRUSSELS

Belgium

Tel: + 32 2 650 23 91

Fax: + 32 2 650 23 91

=============================

 

 

Re: Indexing XML
user name
2007-10-05 05:42:13
> SOLR has of course a problem with the XML in the
'originalRecord' field.
> Is there a solution to this? Has anyone done this
before?


I would suggest changing the field type of
"originalRecord" to "string"
rather than "text", and if you're still having
trouble with the XML data
simply encapsulated the data with a CDATA:

<field name="originalRecord"><![CDATA[
... ]]></field>

cheers,
Piete
Re: Indexing XML
country flaguser name
United States
2007-10-05 08:21:12
Hello Benoit,

An additonal thing to check out is the work being done on
fac-back-opac.
They have a parser that will parse native MARC records. 

I would assume that if you can extract your records in MARC
XML you can
extract them in native MARC.

I've used the parser and it works well.

al

On Fri, 2007-10-05 at 02:44 -0500, PAUWELS Benoit wrote:
> Hi,
> 
> 
> 
> I wish to index well formed xml documents as they are.
> 
> I have a database filled with MARCXML records. An
example of these looks like this:
> 
> 
> 
>         <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
> 
>             xmlns="http://www.loc.g
ov/MARC21/slim" xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a
4500</leader>
> 
>             <controlfield
tag="001">000500000</controlfield>
> 
>             <controlfield
tag="005">20050826220257.0</controlfield>

> 
>             <controlfield
tag="008">000710s1998    xx      r     000 0
dut d</controlfield>
> 
>             <datafield ind1=" "
ind2=" " tag="040">
> 
>                 <subfield
code="a">Univ</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2=" " tag="100">
> 
>                 <subfield code="a">van
Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2="3" tag="245">
> 
>                 <subfield code="a">De
positie van vrouwen in de asielprocedure /</subfield>
> 
>                 <subfield code="c">J.W.
van Wetten, N. Dijkhof, F. Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
> 
> 
> The idea is to create Lucene indexes on specific MARC
fields and store the complete MARC record in Lucene 'as is'.
In the presentation layer of my application I would then
have this complete MARC record at hand, and as such have
full flexibility on which MARC fields to display. So I want
to create the following record through XSLT and feed this to
SOLR.
> 
> 
> 
> <doc>
> 
> <field name="title">De positie van
vrouwen in de asielprocedure</field>
> 
> <field name="author">van Wetten, J.
W.</field>
> 
> ...
> 
> <field name="originalRecord">
> 
>   <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
> 
>             xmlns="http://www.loc.g
ov/MARC21/slim" xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a
4500</leader>
> 
>             <controlfield
tag="001">000500000</controlfield>
> 
>             <controlfield
tag="005">20050826220257.0</controlfield>

> 
>             <controlfield
tag="008">000710s1998    xx      r     000 0
dut d</controlfield>
> 
>             <datafield ind1=" "
ind2=" " tag="040">
> 
>                 <subfield
code="a">UGent</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2=" " tag="100">
> 
>                 <subfield code="a">van
Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2="3" tag="245">
> 
>                 <subfield code="a">De
positie van vrouwen in de asielprocedure /</subfield>
> 
>                 <subfield code="c">J.W.
van Wetten, N. Dijkhof, F. Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
> </field>
> 
> </doc>
> 
> 
> 
> I have the following in my schema.xml:
> 
> 
> 
> <field name="author" type="text"
indexed="true" stored="true"
termVectors="true"/>
> 
> <field name="title" type="text"
indexed="true" stored="true"
termVectors="true"/>
> 
> <field name="originalRecord"
type="text" indexed="false"
stored="true"/>
> 
> 
> 
> 
> 
> SOLR has of course a problem with the XML in the
'originalRecord' field.
> 
> Is there a solution to this? Has anyone done this
before?
> 
> 
> 
> Thanks a lot.
> 
> Benoit.
> 
> 
> 
> 
> 
> =============================
> 
> PAUWELS Benoit
> 
> Université Libre de Bruxelles - Libraries
> 
> Head of Automation
> 
> Av. F.D. Roosevelt 50, CP 180
> 
> 1050 BRUSSELS
> 
> Belgium
> 
> Tel: + 32 2 650 23 91
> 
> Fax: + 32 2 650 23 91
> 
> =============================
> 
> 
> 
> 
> 
-- 
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and
Universities 
(507)389-1975
alan.rykhusmnsu.edu

------------------------------------------------------------
-----------

"You and I as individuals can, by borrowing, live
beyond our means, but
only for a limited period of time. Why should we think that
collectively, as a nation, we are not bound by that same
limitation?"
-- Ronald Reagan


Re: Indexing XML
user name
2007-10-05 09:22:17
Solr is not an XML engine (or a MARC engine). It uses XML as
an input format
for fielded data. It does not index or search arbitrary XML.
You need to
convert your XML into Solr's format.

I would recommend expressing MARC in a Solr schema, then
working on the
input XML. The input XML depends on the schema.

If you need an XML engine, I'd recommend MarkLogic
(commercial), a very
good product.

wunder

On 10/5/07 12:44 AM, "PAUWELS  Benoit"
<Benoit.Pauwelsulb.ac.be> wrote:

> Hi,
> 
> I wish to index well formed xml documents as they are.
> 
> I have a database filled with MARCXML records. An
example of these looks like
> this:
> 
>  
> 
>         <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
> 
>             xmlns="http://www.loc.g
ov/MARC21/slim"
> xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a
4500</leader>
> 
>             <controlfield
tag="001">000500000</controlfield>
> 
>             <controlfield
tag="005">20050826220257.0</controlfield>

> 
>             <controlfield
tag="008">000710s1998    xx      r     000 0
dut
> d</controlfield>
> 
>             <datafield ind1=" "
ind2=" " tag="040">
> 
>                 <subfield
code="a">Univ</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2=" " tag="100">
> 
>                 <subfield code="a">van
Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2="3" tag="245">
> 
>                 <subfield code="a">De
positie van vrouwen in de asielprocedure
> /</subfield>
> 
>                 <subfield code="c">J.W.
van Wetten, N. Dijkhof, F.
> Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
>  
> 
> The idea is to create Lucene indexes on specific MARC
fields and store the
> complete MARC record in Lucene 'as is'. In the
presentation layer of my
> application I would then have this complete MARC record
at hand, and as such
> have full flexibility on which MARC fields to display.
So I want to create the
> following record through XSLT and feed this to SOLR.
> 
>  
> 
> <doc>
> 
> <field name="title">De positie van
vrouwen in de asielprocedure</field>
> 
> <field name="author">van Wetten, J.
W.</field>
> 
> ...
> 
> <field name="originalRecord">
> 
>   <record
> 
>             ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim
> http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
> 
>             xmlns="http://www.loc.g
ov/MARC21/slim"
> xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
> 
>             <leader>00000nam  22      a
4500</leader>
> 
>             <controlfield
tag="001">000500000</controlfield>
> 
>             <controlfield
tag="005">20050826220257.0</controlfield>

> 
>             <controlfield
tag="008">000710s1998    xx      r     000 0
dut
> d</controlfield>
> 
>             <datafield ind1=" "
ind2=" " tag="040">
> 
>                 <subfield
code="a">UGent</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2=" " tag="100">
> 
>                 <subfield code="a">van
Wetten, J. W.</subfield>
> 
>             </datafield>
> 
>             <datafield ind1="1"
ind2="3" tag="245">
> 
>                 <subfield code="a">De
positie van vrouwen in de asielprocedure
> /</subfield>
> 
>                 <subfield code="c">J.W.
van Wetten, N. Dijkhof, F.
> Heide.</subfield>
> 
>             </datafield>
> 
>         </record>
> 
> </field>
> 
> </doc>
> 
>  
> 
> I have the following in my schema.xml:
> 
>  
> 
> <field name="author" type="text"
indexed="true" stored="true"
> termVectors="true"/>
> 
> <field name="title" type="text"
indexed="true" stored="true"
> termVectors="true"/>
> 
> <field name="originalRecord"
type="text" indexed="false"
stored="true"/>
> 
>  
> 
>  
> 
> SOLR has of course a problem with the XML in the
'originalRecord' field.
> 
> Is there a solution to this? Has anyone done this
before?
> 
>  
> 
> Thanks a lot.
> 
> Benoit.
> 
>  
> 
>  
> 
> =============================
> 
> PAUWELS Benoit
> 
> Université Libre de Bruxelles - Libraries
> 
> Head of Automation
> 
> Av. F.D. Roosevelt 50, CP 180
> 
> 1050 BRUSSELS
> 
> Belgium
> 
> Tel: + 32 2 650 23 91
> 
> Fax: + 32 2 650 23 91
> 
> =============================
> 
>  
> 
>  
> 


[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )