List Info

Thread: searching repeated and untokenized fields




searching repeated and untokenized fields
user name
2006-05-01 02:44:36
I have a couple of questions regarding indexing and
searching a  
document that has repeated values for the same field
(specifically,  
the authors of a document, in this case):

Firstly, I'm adding the repeated field with this code:

for creator in creators:
	doc.add(Field('creator', creator, Field.Store.YES,  
Field.Index.UN_TOKENIZED))

but can't find a way to read those fields back out from the
index. If  
I use

for author in hits[i]["creator"]:
         print author

then just the first "creator" entry is returned
for that document and  
gets split into a list of individual letters - in other
words, hits[i] 
["creator"] is a string and not a list.


Secondly, it doesn't seem to be possible (in PyLucene
1.9.1) to  
search an untokenized field using a term that contains
spaces. For a  
document that has a creator "Doe J", the query
creator:"Doe J"
doesn't return any results, and
creatoroe J
doesn't match what it needs to.


Has anyone found solutions to these problems already? For
the first I  
could just replace spaces with underscores during the
indexing, but  
that wouldn't be the ideal solution.

alf.

_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
searching repeated and untokenized fields
user name
2006-05-01 06:53:50
On Sun, 30 Apr 2006, Alf Eaton wrote:

> I have a couple of questions regarding indexing and
searching a document that 
> has repeated values for the same field (specifically,
the authors of a 
> document, in this case):
>
> Firstly, I'm adding the repeated field with this code:
>
> for creator in creators:
> 	doc.add(Field('creator', creator, Field.Store.YES, 
> Field.Index.UN_TOKENIZED))
>
> but can't find a way to read those fields back out
from the index. If I use
>
> for author in hits[i]["creator"]:
>       print author

I'm not sure I understand what you're trying to do in the
code above.
In PyLucene 1.9.1, the way to iterate hits is:

   for i, doc in hits:
       print doc['creator']

If there is more than one field called 'creator' then, you
might want to try:
   for i, doc in hits:
      for creator in doc.getFields('creator'):
          print creator

In PyLucene 2.0rc1, you can also say:

   for hit in hits:
       for creator in
hit.getDocument().getFields('creator'):
           print creator

If this doesn't work, please send in code that illustrates
the problem (that 
would help in understanding and fixing the potential
bug(s)).

> Secondly, it doesn't seem to be possible (in PyLucene
1.9.1) to search an 
> untokenized field using a term that contains spaces.
For a document that has 
> a creator "Doe J", the query
> creator:"Doe J"
> doesn't return any results, and
> creatoroe J
> doesn't match what it needs to.

Again, please send in code that reproduces the problem. If
you can make sure 
that what you're trying to do work in Java Lucene, that's
a plus.

Ideally, your sample code would be organized as unit tests.

Thanks !

Andi..
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
searching repeated and untokenized fields
user name
2006-05-01 13:20:13
On 01 May 2006, at 02:53, Andi Vajda wrote:
>
> On Sun, 30 Apr 2006, Alf Eaton wrote:
>
>> I have a couple of questions regarding indexing and
searching a  
>> document that has repeated values for the same
field  
>> (specifically, the authors of a document, in this
case):
>>
>> Firstly, I'm adding the repeated field with this
code:
>>
>> for creator in creators:
>> 	doc.add(Field('creator', creator,
Field.Store.YES,  
>> Field.Index.UN_TOKENIZED))
>>
>> but can't find a way to read those fields back out
from the index.  
>> If I use
>>
>> for author in hits[i]["creator"]:
>>       print author
>
> I'm not sure I understand what you're trying to do in
the code above.
> In PyLucene 1.9.1, the way to iterate hits is:
>
>   for i, doc in hits:
>       print doc['creator']
>
> If there is more than one field called 'creator'
then, you might  
> want to try:
>   for i, doc in hits:
>      for creator in doc.getFields('creator'):
>          print creator

Great, that was almost it:

for i, doc in hits:
         for a in doc.getFields('creator'):
             author = a.stringValue()

I'll work on a proper example of my other problem.

af.
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
searching repeated and untokenized fields
user name
2006-05-02 01:42:12
On 01 May 2006, at 02:53, Andi Vajda wrote:

>> Secondly, it doesn't seem to be possible (in
PyLucene 1.9.1) to  
>> search an untokenized field using a term that
contains spaces. For  
>> a document that has a creator "Doe J",
the query
>> creator:"Doe J"
>> doesn't return any results, and
>> creatoroe J
>> doesn't match what it needs to.
>
> Again, please send in code that reproduces the problem.
If you can  
> make sure that what you're trying to do work in Java
Lucene, that's  
> a plus.
> Ideally, your sample code would be organized as unit
tests.

Good idea to do the tests: I realised that StandardAnalyzer
was  
converting the search terms to lowercase when used in
QueryParser,  
but not when adding untokenized fields to the document using
 
IndexWriter, so the two weren't matching. Fixed now, thanks
(and it's  
presumably not a PyLucene problem).

alf.

--------

#!/usr/bin/env python

from PyLucene import *

filestore = FSDirectory.getDirectory("test",
True)
analyzer = StandardAnalyzer()
filewriter = IndexWriter(filestore, analyzer, True)

doc = Document()

doc.add(Field('author-space', "Doe J",
Field.Store.YES,  
Field.Index.UN_TOKENIZED))
doc.add(Field('author-space-tok', "Doe J",
Field.Store.YES,  
Field.Index.TOKENIZED))
doc.add(Field('author-underscore', "Doe_J",
Field.Store.YES,  
Field.Index.UN_TOKENIZED))
doc.add(Field('author-underscore-tok',
"Doe_J", Field.Store.YES,  
Field.Index.TOKENIZED))

filewriter.addDocument(doc)
filewriter.close()

searcher = IndexSearcher("test")

for q in ("Doe J", "Doe_J"):
     for f in ("author-space",
"author-space-tok", "author- 
underscore", "author-underscore-tok"):
         #query = QueryParser.parse(q, f, analyzer) # only
works for  
tokenized fields
         query = TermQuery(Term(f, q)) # only works for
untokenized  
fields
         hits = searcher.search(query)
         print "\nQ: %s\nQuery: %s\n" % (q,
query)
         for i, doc in hits:
             print "Result: %s\n" % doc[f]
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev
[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )