Benoit,
Are you familiar with the Vufind project (http://www.vufind.org)? If
you
look at the PHP code in the import folder to see how the
indexing is
working (there's an XSL transformation that then updates the
index).
I've also written some initial code to use embedded Solr to
do this
indexing directly from marc format files, including holding
the entire
marcxml format record in the index.
You can contact me off-list if you have questions...
Wayne
Walter Underwood wrote:
> Solr is not an XML engine (or a MARC engine). It uses
XML as an input format
> for fielded data. It does not index or search arbitrary
XML. You need to
> convert your XML into Solr's format.
>
> I would recommend expressing MARC in a Solr schema,
then working on the
> input XML. The input XML depends on the schema.
>
> If you need an XML engine, I'd recommend MarkLogic
(commercial), a very
> good product.
>
> wunder
>
> On 10/5/07 12:44 AM, "PAUWELS Benoit"
<Benoit.Pauwels ulb.ac.be> wrote:
>
>> Hi,
>>
>> I wish to index well formed xml documents as they
are.
>>
>> I have a database filled with MARCXML records. An
example of these looks like
>> this:
>>
>>
>>
>> <record
>>
>> ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
>>
>> xmlns="http://www.loc.g
ov/MARC21/slim"
>> xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
>>
>> <leader>00000nam 22 a
4500</leader>
>>
>> <controlfield
tag="001">000500000</controlfield>
>>
>> <controlfield
tag="005">20050826220257.0</controlfield>
>>
>> <controlfield
tag="008">000710s1998 xx r 000 0
dut
>> d</controlfield>
>>
>> <datafield ind1=" "
ind2=" " tag="040">
>>
>> <subfield
code="a">Univ</subfield>
>>
>> </datafield>
>>
>> <datafield ind1="1"
ind2=" " tag="100">
>>
>> <subfield
code="a">van Wetten, J. W.</subfield>
>>
>> </datafield>
>>
>> <datafield ind1="1"
ind2="3" tag="245">
>>
>> <subfield
code="a">De positie van vrouwen in de
asielprocedure
>> /</subfield>
>>
>> <subfield
code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>> </datafield>
>>
>> </record>
>>
>>
>>
>> The idea is to create Lucene indexes on specific
MARC fields and store the
>> complete MARC record in Lucene 'as is'. In the
presentation layer of my
>> application I would then have this complete MARC
record at hand, and as such
>> have full flexibility on which MARC fields to
display. So I want to create the
>> following record through XSLT and feed this to
SOLR.
>>
>>
>>
>> <doc>
>>
>> <field name="title">De positie van
vrouwen in de asielprocedure</field>
>>
>> <field name="author">van Wetten, J.
W.</field>
>>
>> ...
>>
>> <field name="originalRecord">
>>
>> <record
>>
>> ns0:schemaLocation="http://www.loc.gov/MAR
C21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC
21slim.xsd"
>>
>> xmlns="http://www.loc.g
ov/MARC21/slim"
>> xmlns:ns0="h
ttp://www.w3.org/2001/XMLSchema-instance">
>>
>> <leader>00000nam 22 a
4500</leader>
>>
>> <controlfield
tag="001">000500000</controlfield>
>>
>> <controlfield
tag="005">20050826220257.0</controlfield>
>>
>> <controlfield
tag="008">000710s1998 xx r 000 0
dut
>> d</controlfield>
>>
>> <datafield ind1=" "
ind2=" " tag="040">
>>
>> <subfield
code="a">UGent</subfield>
>>
>> </datafield>
>>
>> <datafield ind1="1"
ind2=" " tag="100">
>>
>> <subfield
code="a">van Wetten, J. W.</subfield>
>>
>> </datafield>
>>
>> <datafield ind1="1"
ind2="3" tag="245">
>>
>> <subfield
code="a">De positie van vrouwen in de
asielprocedure
>> /</subfield>
>>
>> <subfield
code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>> </datafield>
>>
>> </record>
>>
>> </field>
>>
>> </doc>
>>
>>
>>
>> I have the following in my schema.xml:
>>
>>
>>
>> <field name="author"
type="text" indexed="true"
stored="true"
>> termVectors="true"/>
>>
>> <field name="title"
type="text" indexed="true"
stored="true"
>> termVectors="true"/>
>>
>> <field name="originalRecord"
type="text" indexed="false"
stored="true"/>
>>
>>
>>
>>
>>
>> SOLR has of course a problem with the XML in the
'originalRecord' field.
>>
>> Is there a solution to this? Has anyone done this
before?
>>
>>
>>
>> Thanks a lot.
>>
>> Benoit.
>>
>>
>>
>>
>>
>> =============================
>>
>> PAUWELS Benoit
>>
>> Université Libre de Bruxelles - Libraries
>>
>> Head of Automation
>>
>> Av. F.D. Roosevelt 50, CP 180
>>
>> 1050 BRUSSELS
>>
>> Belgium
>>
>> Tel: + 32 2 650 23 91
>>
>> Fax: + 32 2 650 23 91
>>
>> =============================
>>
>>
>>
>>
>>
>
--
/**
* Wayne Graham
* Earl Gregg Swem Library
* PO Box 8794
* Williamsburg, VA 23188
* 757.221.3112
* http://swem.wm.
edu/blogs/waynegraham/
*/
|