List Info

Thread: Fwd: NO_NORMS and TOKENIZED?




Fwd: NO_NORMS and TOKENIZED?
country flaguser name
United States
2007-02-16 19:42:49
A recent e-mail from Mr. KinoSearch to java-user has a quote
that I  
wanted to point out here:

Begin forwarded message:
> KS 0.20 doesn't even have Document or Field classes. 
 
They've  
> been eliminated, and native Perl hashes are now used to
transport  
> document data.

I think we could simplify (wow, even at this early stage)
the solrb  
code a bit by simply representing a document as a Hash.  For
 
multiValued data, the values would be arrays.  Do we really
need any  
other semantics at the solrb level, or does a Hash convey it
all?   
Just thinking out loud here, so feel free to ignore me.

Marvin makes some other great points about fixed schemas,
which maps  
to the schema.xml facility of Solr I believe.  I am
interested in  
exploring how field names get mapped, along with client
knowledge of  
Solr's schema.xml structure, can make an elegant API.

	Erik

p.s. I've queued up several ruby-dev e-mails to respond to
over the  
next several days.  I'm happily swamped and eager to keep
the  
momentum of solrb and Flare going strong.

Re: NO_NORMS and TOKENIZED?
user name
2007-02-16 19:57:06
On 2/16/07, Erik Hatcher <erikehatchersolutions.com>
wrote:
> A recent e-mail from Mr. KinoSearch to java-user has a
quote that I
> wanted to point out here:
>
> Begin forwarded message:
> > KS 0.20 doesn't even have Document or Field
classes.    They've
> > been eliminated, and native Perl hashes are now
used to transport

> > document data.
>
> I think we could simplify (wow, even at this early
stage) the solrb
> code a bit by simply representing a document as a Hash.
 For
> multiValued data, the values would be arrays.

That's how it works with the simple python client... I like
to use
natives when possible.

Example usage from solr.py:
from solr import *

c = SolrConnection(host='localhost:8983', persistent=True)
c.add(id='500',name='python test doc')
c.delete('123')
c.commit()
print c.search(q='id:[* TO *]', wt='python',
rows='10',indent='on')

The separate params to add is just syntactic sugar for a map
(I
believe ruby has the same thing).  Adding multiple documents
is done
as an array-of-map.

> Do we really need any
> other semantics at the solrb level, or does a Hash
convey it all?

boosts?
Perhaps you could still use a hash, but the value could
optionally be
a boosted value... is it possible to annotate any value in
Ruby, or is
a separate BoostedValue class needed?

Might want to keep in mind updateable documents for the
future.  Not
sure how you would want them represented, but it looks like
there will
be a separate param (or parameters) telling solr how to
update
different fields (append, overwrite, increment, remove,
etc)

-Yonik

Re: NO_NORMS and TOKENIZED?
country flaguser name
United States
2007-02-16 19:59:39
On Feb 16, 2007, at 5:42 PM, Erik Hatcher wrote:

> I think we could simplify (wow, even at this early
stage) the solrb  
> code a bit by simply representing a document as a
Hash.

FWIW, "documents as hashes" was something that
Dave Balmain and I  
discussed at length on the Ferret list and found ourselves
in  
complete agreement about.

> Marvin makes some other great points about fixed
schemas, which  
> maps to the schema.xml facility of Solr I believe.

You are kind.  The funny thing is, I know very little about
Solr.   
(I'm an accidental legacy subscriber to this list.)  I
didn't even  
realize it had a "schema" package until I was
preparing to post and  
went researching.

The KinoSearch::Schema class was actually inspired by the
raft of  
object-relational mappers on CPAN: Class:BI (by
Tony Bowden, a  
primary developer of Plucene), DBIx::Class, etc.  I thought
of it as  
and ORM without the underlying SQL table definition.

If there is convergence between KS and Solr on this issue,
though, I  
wouldn't be surprised.

Marvin Humphrey
Rectangular Research
http://www.rectangular.co
m/



RE: NO_NORMS and TOKENIZED?
country flaguser name
United States
2007-02-16 21:54:52
I guess I'm in the minority here. I tend to use custom
classes because I
can tweak the API to make it easier for people writing to
the API. I'll
duck type all over the place to make things act like what
someone might
expect, but I don't like forcing people to create raw data
structures
when there are semantics that can make things easier.

In this case at hand, I was going to say that making a doc a
hash makes
adding fields sequentially difficult. But ruby's nice enough
that it
simply makes it a little ugly:
	doc[:contents].to_a.push data

I like being able to write
	doc[:contents] = data
where I know there's no other data for that field but I like
being able
to write
      doc << { :content => data }
A doc can certainly act like a hash in some cases, but it's
more than a
hash.

One of the first tests I wrote (a long time ago, with less
ruby under my
belt) for the jruby interface for adding docs was:

  def test_add_lists

    index << [ :contents, "the quick brown fox
jumped over the lazy
dog" ] 
           << [ :contents, "Alas poor
Yorick,", 
                :contents, "I knew him Horatio" ]

           << [ :contents, [ "To be,",
"or not ", "to be" ] ]

  end

Was never too happy with the middle case. Maybe an extra
list level?

This doesn't use an explicit doc object but it makes one
internally.
That nice little cool (pronounced "scary") ability
in ruby to say

class Array
  def to_lucene_doc
	...
  end
end

class Hash
  def to_lucene_doc
	...
  end
end

module Lucene
  class Document
    def to_lucene_doc
      self
    end
  end
  class Index
    def << *docs
       docs.map! { |doc| doc.to_lucene_doc }
       ...
    end
end

I haven't had as much time as I would like, either, to
follow solrb (and
I guess I'm also in the minority thinking that solrb was a
clever name).

-----Original Message-----
From: Erik Hatcher [mailto:erikehatchersolutions.com] 
Sent: Friday, February 16, 2007 5:43 PM
To: ruby-devlucene.apache.org
Subject: Fwd: NO_NORMS and TOKENIZED?

A recent e-mail from Mr. KinoSearch to java-user has a quote
that I  
wanted to point out here:

Begin forwarded message:
> KS 0.20 doesn't even have Document or Field classes. 
 
They've  
> been eliminated, and native Perl hashes are now used to
transport  
> document data.

I think we could simplify (wow, even at this early stage)
the solrb  
code a bit by simply representing a document as a Hash.  For
 
multiValued data, the values would be arrays.  Do we really
need any  
other semantics at the solrb level, or does a Hash convey it
all?   
Just thinking out loud here, so feel free to ignore me.

Marvin makes some other great points about fixed schemas,
which maps  
to the schema.xml facility of Solr I believe.  I am
interested in  
exploring how field names get mapped, along with client
knowledge of  
Solr's schema.xml structure, can make an elegant API.

	Erik

p.s. I've queued up several ruby-dev e-mails to respond to
over the  
next several days.  I'm happily swamped and eager to keep
the  
momentum of solrb and Flare going strong.

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )