|
List Info
Thread: searching repeated and untokenized fields
|
|
| searching repeated and untokenized
fields |

|
2006-05-01 02:44:36 |
I have a couple of questions regarding indexing and
searching a
document that has repeated values for the same field
(specifically,
the authors of a document, in this case):
Firstly, I'm adding the repeated field with this code:
for creator in creators:
doc.add(Field('creator', creator, Field.Store.YES,
Field.Index.UN_TOKENIZED))
but can't find a way to read those fields back out from the
index. If
I use
for author in hits[i]["creator"]:
print author
then just the first "creator" entry is returned
for that document and
gets split into a list of individual letters - in other
words, hits[i]
["creator"] is a string and not a list.
Secondly, it doesn't seem to be possible (in PyLucene
1.9.1) to
search an untokenized field using a term that contains
spaces. For a
document that has a creator "Doe J", the query
creator:"Doe J"
doesn't return any results, and
creator oe J
doesn't match what it needs to.
Has anyone found solutions to these problems already? For
the first I
could just replace spaces with underscores during the
indexing, but
that wouldn't be the ideal solution.
alf.
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| searching repeated and untokenized
fields |

|
2006-05-01 06:53:50 |
On Sun, 30 Apr 2006, Alf Eaton wrote:
> I have a couple of questions regarding indexing and
searching a document that
> has repeated values for the same field (specifically,
the authors of a
> document, in this case):
>
> Firstly, I'm adding the repeated field with this code:
>
> for creator in creators:
> doc.add(Field('creator', creator, Field.Store.YES,
> Field.Index.UN_TOKENIZED))
>
> but can't find a way to read those fields back out
from the index. If I use
>
> for author in hits[i]["creator"]:
> print author
I'm not sure I understand what you're trying to do in the
code above.
In PyLucene 1.9.1, the way to iterate hits is:
for i, doc in hits:
print doc['creator']
If there is more than one field called 'creator' then, you
might want to try:
for i, doc in hits:
for creator in doc.getFields('creator'):
print creator
In PyLucene 2.0rc1, you can also say:
for hit in hits:
for creator in
hit.getDocument().getFields('creator'):
print creator
If this doesn't work, please send in code that illustrates
the problem (that
would help in understanding and fixing the potential
bug(s)).
> Secondly, it doesn't seem to be possible (in PyLucene
1.9.1) to search an
> untokenized field using a term that contains spaces.
For a document that has
> a creator "Doe J", the query
> creator:"Doe J"
> doesn't return any results, and
> creator oe J
> doesn't match what it needs to.
Again, please send in code that reproduces the problem. If
you can make sure
that what you're trying to do work in Java Lucene, that's
a plus.
Ideally, your sample code would be organized as unit tests.
Thanks !
Andi..
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| searching repeated and untokenized
fields |

|
2006-05-01 13:20:13 |
On 01 May 2006, at 02:53, Andi Vajda wrote:
>
> On Sun, 30 Apr 2006, Alf Eaton wrote:
>
>> I have a couple of questions regarding indexing and
searching a
>> document that has repeated values for the same
field
>> (specifically, the authors of a document, in this
case):
>>
>> Firstly, I'm adding the repeated field with this
code:
>>
>> for creator in creators:
>> doc.add(Field('creator', creator,
Field.Store.YES,
>> Field.Index.UN_TOKENIZED))
>>
>> but can't find a way to read those fields back out
from the index.
>> If I use
>>
>> for author in hits[i]["creator"]:
>> print author
>
> I'm not sure I understand what you're trying to do in
the code above.
> In PyLucene 1.9.1, the way to iterate hits is:
>
> for i, doc in hits:
> print doc['creator']
>
> If there is more than one field called 'creator'
then, you might
> want to try:
> for i, doc in hits:
> for creator in doc.getFields('creator'):
> print creator
Great, that was almost it:
for i, doc in hits:
for a in doc.getFields('creator'):
author = a.stringValue()
I'll work on a proper example of my other problem.
af.
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
| searching repeated and untokenized
fields |

|
2006-05-02 01:42:12 |
On 01 May 2006, at 02:53, Andi Vajda wrote:
>> Secondly, it doesn't seem to be possible (in
PyLucene 1.9.1) to
>> search an untokenized field using a term that
contains spaces. For
>> a document that has a creator "Doe J",
the query
>> creator:"Doe J"
>> doesn't return any results, and
>> creator oe J
>> doesn't match what it needs to.
>
> Again, please send in code that reproduces the problem.
If you can
> make sure that what you're trying to do work in Java
Lucene, that's
> a plus.
> Ideally, your sample code would be organized as unit
tests.
Good idea to do the tests: I realised that StandardAnalyzer
was
converting the search terms to lowercase when used in
QueryParser,
but not when adding untokenized fields to the document using
IndexWriter, so the two weren't matching. Fixed now, thanks
(and it's
presumably not a PyLucene problem).
alf.
--------
#!/usr/bin/env python
from PyLucene import *
filestore = FSDirectory.getDirectory("test",
True)
analyzer = StandardAnalyzer()
filewriter = IndexWriter(filestore, analyzer, True)
doc = Document()
doc.add(Field('author-space', "Doe J",
Field.Store.YES,
Field.Index.UN_TOKENIZED))
doc.add(Field('author-space-tok', "Doe J",
Field.Store.YES,
Field.Index.TOKENIZED))
doc.add(Field('author-underscore', "Doe_J",
Field.Store.YES,
Field.Index.UN_TOKENIZED))
doc.add(Field('author-underscore-tok',
"Doe_J", Field.Store.YES,
Field.Index.TOKENIZED))
filewriter.addDocument(doc)
filewriter.close()
searcher = IndexSearcher("test")
for q in ("Doe J", "Doe_J"):
for f in ("author-space",
"author-space-tok", "author-
underscore", "author-underscore-tok"):
#query = QueryParser.parse(q, f, analyzer) # only
works for
tokenized fields
query = TermQuery(Term(f, q)) # only works for
untokenized
fields
hits = searcher.search(query)
print "\nQ: %s\nQuery: %s\n" % (q,
query)
for i, doc in hits:
print "Result: %s\n" % doc[f]
_______________________________________________
pylucene-dev mailing list
pylucene-dev osafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
|
|
[1-4]
|
|