List Info

Thread: Re: add CJKTokenizer to solr




Re: add CJKTokenizer to solr
country flaguser name
Croatia
2007-06-22 06:22:11
Tokenizers are not thread safe (I made a mistake yesterday
saying they are - I don't know what I was thinking).
This is why:

public abstract class Tokenizer extends TokenStream {
  /** The text source for this Tokenizer. */
  protected Reader input;                                  
<---- oops :(
  ...

public abstract class CharTokenizer extends Tokenizer {
  public CharTokenizer(Reader input) {
    super(input);
  }
  ...

Otis
 
--
Lucene Consulting -- http://lucene-consultin
g.com/


----- Original Message ----
From: Daniel Alheiros <Daniel.Alheirosbbc.co.uk>
To: solr-userlucene.apache.org
Sent: Friday, June 22, 2007 12:43:50 PM
Subject: Re: add CJKTokenizer to solr

Sorry I've confused things a bit... The thread safeness have
to be
considered only on the Tokenizers, not on the factories. So
are the
Tokenizers thread safe?

Regards,
Daniel


On 22/6/07 11:36, "Daniel Alheiros"
<Daniel.Alheirosbbc.co.uk> wrote:

> Hi Hoss.
> 
> I've done a few tests using reflection to instantiate a
simple object and
> the results will vary a lot depending on the JVM. As
the JVM optimizes code
> as it is executed it will vary depending on the usage,
but I think we have
> something to consider:
> 
> If done 1,000 samples (5 clean X loop of 200) and each
sample is creating
> 100,000 objects and the results were:
> 
> With reflection:
>     - Average                      : 0.0005418
>     - Worst (first clean execution): 0.0007760
> 
> Without reflection:
>     - Average                      : 0.0000469
>     - Worst (first clean execution): 0.0002140
> 
> So comparing these numbers, I can see that using
reflection on the average
> case will cost 10 times more than creating the object
without reflection.
> 
> But my question is: Do we need to create factories so
frequently or the are
> just create once and re-used (are they thread safe)?
The term Factory made
> me think of a class that is responsible for building
others instance, so
> usually they can be singletons... If they don't need to
be created all the
> time it will not impact really and will give extra
flexibility in terms of
> incorporating new Tokenizers (it would make easier to
make Solr/Lucene
> versions less coupled).
> 
> Environment:
> java version "1.5.0_07"
> Java(TM) 2 Runtime Environment, Standard Edition (build
1.5.0_07-164)
> Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed
mode, sharing)
> Heap size: 256M
> Running on a PowerPC - Mac OS/X 10.4.9 with 1.5Gb RAM
> 
> Regards,
> Daniel
> 
> 
> On 21/6/07 20:39, "Chris Hostetter"
<hossman_lucenefucit.org> wrote:
> 
>> 
>> : Why instead of that we don't create an
UbberFactory that takes the
>> Tokenizer
>> : class as a parameter and instantiates the proper
Tokenizer?
>> 
>> The idea has come up before ... and there's really
no reason why it
>> wouldn't be okay to include a reflection based
facotry like this in Solr
>> -- it just hasn't been done yet.
>> 
>> One of the reasons is that there are some
performance costs associated
>> with the reflection, so we wouldn't want to
competley replace the existing
>> "configuration via factory name" model
with a "configure via class name
>> and an uber factory does the reflection quetly in
the background" model
>> because it's the kind of appraoch that would really
only make sense for
>> simple prototypes -- in any system where you are
really concerned about
>> performacne, reflection on every analyzer call
would probably be pretty
>> expensive.  (allthough i'd love to see benchmarks
prove me wrong)
>> 
>> Another question in my mind is "why doesn't
solr provide an optional jar
>> with factories for every tokenizer/tokenfilter in
the lucene contribs?"
>> ... the only answer to that is that no one has
bothered to crank out a
>> patch that does it.
>> 
>> http://www.nabble.com/Re%3A-making-
schema.xml-nicer-to-read-use-p5939980.html
>> htt
p://www.nabble.com/foo-tf1737025.html#a4720545
>> 
>> 
>> -Hoss
>> 
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and
may contain personal
> views which are not the views of the BBC unless
specifically stated.
> If you have received it in error, please delete it from
your system.
> Do not use, copy or disclose the information in any way
nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or
received.
> Further communication will signify your consent to
this.
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may
contain personal views which are not the views of the BBC
unless specifically stated.
If you have received it in error, please delete it from your
system.
Do not use, copy or disclose the information in any way nor
act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                    




[1]

about | contact  Other archives ( Real Estate discussion Medical topics )