List Info

Thread: Re: Field constructor, avoiding String.intern()




Re: Field constructor, avoiding String.intern()
country flaguser name
United States
2007-02-23 12:28:19
True. However, in the case where you are processing
Documents one at a time 
and discarding them (e.g. We use hitCollector to process all
documents from
a search), or memory is not an issue, it would be nice to
have the ability
to disable the interning for performance sake.




Robert Engels wrote:
> 
> I don't think it is just the performance gain of
equals() where intern 
> () matters.
> 
> It also reduces memory consumption dramatically when
working with  
> large collections of documents in memory - although
this could also  
> be done with constants, there is nothing in Java to
enforce it (thus  
> the use of intern()).
> 
> 
> On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:
> 
>>
>> In our case, we're trying to optimize document()
retrieval and we  
>> found that
>> disabling the String interning in the Field
constructor improved  
>> performance
>> dramatically. I agree that interning should be an
option on the  
>> constructor.
>> For document retrieval, at least for a small of
amount of fields, the
>> performance gain of using equals() on interned
strings is no match  
>> for the
>> performance loss of interning the field name of
each field.
>>
>>
>>
>> Wolfgang Hoschek-2 wrote:
>>>
>>> I noticed that, too, but in my case the
difference was often much
>>> more extreme: it was one of the primary
bottlenecks on indexing. This
>>> is the primary reason why
MemoryIndex.addField(...) navigates around
>>> the problem by taking a parameter of type
"String fieldName" instead
>>> of type "Field":
>>>
>>> 	public void addField(String fieldName,
TokenStream stream) {
>>> 		/*
>>> 		 * Note that this method signature avoids
having a user call new
>>> 		 * o.a.l.d.Field(...) which would be much too
expensive due to the
>>> 		 * String.intern() usage of that class.
>>>                   */
>>>
>>> Wolfgang.
>>>
>>> On Feb 14, 2006, at 1:42 PM, Tatu Saloranta
wrote:
>>>
>>>> After profiling in-memory indexing, I
noticed that
>>>> calls to String.intern() showed up
surprisingly high;
>>>> especially the one from Field()
constructor. This is
>>>> understandable due to overhead
String.intern() has
>>>> (being native and synchronized method;
overhead
>>>> incurred even if String is already
interned), and the
>>>> fact this essentially gets called once per
>>>> document+field combination.
>>>>
>>>> Now, it would be quite easy to improve
things a bit
>>>> (in theory), such that most intern() calls
could be
>>>> avoid, transparent to the calling app; for
example,
>>>> for each IndexWriter() one could use a
simple
>>>> HashMap() for caching interned Strings.
This approach
>>>> is more than twice as fast as directly
calling
>>>> intern(). One could also use per-thread
cache, or
>>>> global one; all of which would probably be
faster.
>>>> However, Field constructor hard-codes call
to
>>>> intern(), so it would be necessary to add a
new
>>>> constructor that indicates that field name
is known to
>>>> be interned.
>>>> And there would also need to be a way to
invoke the
>>>> new optional functionality.
>>>>
>>>> Has anyone tried this approach to see if
speedup is
>>>> worth the hassle (in my case it'd probably
be
>>>> something like 2 - 3%, assuming profiler's
5% for
>>>> intern() is accurate)?
>>>>
>>>> -+ Tatu +-
>>>>
>>>>
>>>>
__________________________________________________
>>>> Do You Yahoo!?
>>>> Tired of spam?  Yahoo! Mail has the best
spam protection around
>>>> http://mail.yahoo.com
>>>>
>>>>
------------------------------------------------------------
-------- 
>>>> -
>>>> To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
>>>> For additional commands, e-mail:
java-dev-helplucene.apache.org
>>>>
>>>
>>>
>>>
------------------------------------------------------------
---------
>>> To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
>>> For additional commands, e-mail:
java-dev-helplucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Fi
eld- 
>>
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a
9123600
>> Sent from the Lucene - Java Developer mailing list
archive at  
>> Nabble.com.
>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-dev-helplucene.apache.org
>>
> 
> 
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble
.com/Field-constructor%2C-avoiding-String.intern%28%29-tf112
3597.html#a9124055
Sent from the Lucene - Java Developer mailing list archive
at Nabble.com.


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Field constructor, avoiding String.intern()
country flaguser name
United States
2007-02-23 12:54:26
James Kennedy wrote:
> True. However, in the case where you are processing
Documents one at a time 
> and discarding them (e.g. We use hitCollector to
process all documents from
> a search), or memory is not an issue, it would be nice
to have the ability
> to disable the interning for performance sake.

Accessing documents from a hit-collector is not advised.  It
is 
generally best to compose queries and filters to reduce the
number of 
matches.  When that's not feasible, a hit collector that
uses a 
FieldCache to filter by or collect field values is much
faster than 
accessing documents.

Doug

------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Field constructor, avoiding String.intern()
country flaguser name
United States
2007-02-23 17:04:51
On Feb 23, 2007, at 10:28 AM, James Kennedy wrote:

>
> True. However, in the case where you are processing
Documents one  
> at a time
> and discarding them (e.g. We use hitCollector to
process all  
> documents from
> a search), or memory is not an issue, it would be nice
to have the  
> ability
> to disable the interning for performance sake.

I don't know how much it would increase overall throughput
in a  
variety of use cases, but one approach could be to add a
copy-like- 
this factory method like Field.createField(Reader) to
Field.java,  
analog to the method Term.createTerm(String text) that was
added to  
Term.java sometime ago for a similar reason.

This would guarantee that the name continues to be interned
yet  
allows to avoid the interning overhead on use cases where a
field  
with the same parametrization (yet a different content
String/Reader)  
is constructed many times, which is probably the most common
case  
where intern() overhead might matter.

For example, something like

Field f1 = ...
Field f2 = f1.createSimilarField(Reader);

   /**
    * Optimized construction of new Terms by reusing same
field as  
this Term
    * - avoids field.intern() overhead
    * param text The text of the new term (field is
implicitly same  
as this Term instance)
    * return A new Term
    */
   public Term createTerm(String text)
   {
       return new Term(field,text,false);
   }

Wolfgang.

>
>
>
>
> Robert Engels wrote:
>>
>> I don't think it is just the performance gain of
equals() where  
>> intern
>> () matters.
>>
>> It also reduces memory consumption dramatically
when working with
>> large collections of documents in memory - although
this could also
>> be done with constants, there is nothing in Java to
enforce it (thus
>> the use of intern()).
>>
>>
>> On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:
>>
>>>
>>> In our case, we're trying to optimize
document() retrieval and we
>>> found that
>>> disabling the String interning in the Field
constructor improved
>>> performance
>>> dramatically. I agree that interning should be
an option on the
>>> constructor.
>>> For document retrieval, at least for a small of
amount of fields,  
>>> the
>>> performance gain of using equals() on interned
strings is no match
>>> for the
>>> performance loss of interning the field name of
each field.
>>>
>>>
>>>
>>> Wolfgang Hoschek-2 wrote:
>>>>
>>>> I noticed that, too, but in my case the
difference was often much
>>>> more extreme: it was one of the primary
bottlenecks on indexing.  
>>>> This
>>>> is the primary reason why
MemoryIndex.addField(...) navigates  
>>>> around
>>>> the problem by taking a parameter of type
"String fieldName"  
>>>> instead
>>>> of type "Field":
>>>>
>>>> 	public void addField(String fieldName,
TokenStream stream) {
>>>> 		/*
>>>> 		 * Note that this method signature avoids
having a user call new
>>>> 		 * o.a.l.d.Field(...) which would be much
too expensive due to  
>>>> the
>>>> 		 * String.intern() usage of that class.
>>>>                   */
>>>>
>>>> Wolfgang.
>>>>
>>>> On Feb 14, 2006, at 1:42 PM, Tatu Saloranta
wrote:
>>>>
>>>>> After profiling in-memory indexing, I
noticed that
>>>>> calls to String.intern() showed up
surprisingly high;
>>>>> especially the one from Field()
constructor. This is
>>>>> understandable due to overhead
String.intern() has
>>>>> (being native and synchronized method;
overhead
>>>>> incurred even if String is already
interned), and the
>>>>> fact this essentially gets called once
per
>>>>> document+field combination.
>>>>>
>>>>> Now, it would be quite easy to improve
things a bit
>>>>> (in theory), such that most intern()
calls could be
>>>>> avoid, transparent to the calling app;
for example,
>>>>> for each IndexWriter() one could use a
simple
>>>>> HashMap() for caching interned Strings.
This approach
>>>>> is more than twice as fast as directly
calling
>>>>> intern(). One could also use per-thread
cache, or
>>>>> global one; all of which would probably
be faster.
>>>>> However, Field constructor hard-codes
call to
>>>>> intern(), so it would be necessary to
add a new
>>>>> constructor that indicates that field
name is known to
>>>>> be interned.
>>>>> And there would also need to be a way
to invoke the
>>>>> new optional functionality.
>>>>>
>>>>> Has anyone tried this approach to see
if speedup is
>>>>> worth the hassle (in my case it'd
probably be
>>>>> something like 2 - 3%, assuming
profiler's 5% for
>>>>> intern() is accurate)?
>>>>>
>>>>> -+ Tatu +-
>>>>>
>>>>>
>>>>>
__________________________________________________
>>>>> Do You Yahoo!?
>>>>> Tired of spam?  Yahoo! Mail has the
best spam protection around
>>>>> http://mail.yahoo.com
>>>>>
>>>>>
------------------------------------------------------------
------ 
>>>>> --
>>>>> -
>>>>> To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
>>>>> For additional commands, e-mail:
java-dev-helplucene.apache.org
>>>>>
>>>>
>>>>
>>>>
------------------------------------------------------------
------- 
>>>> --
>>>> To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
>>>> For additional commands, e-mail:
java-dev-helplucene.apache.org
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context: http://www.nabble.com/Fi
eld-
>>>
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a
9123600
>>> Sent from the Lucene - Java Developer mailing
list archive at
>>> Nabble.com.
>>>
>>>
>>>
------------------------------------------------------------
-------- 
>>> -
>>> To unsubscribe, e-mail:
java-dev-unsubscribelucene.apache.org
>>> For additional commands, e-mail:
java-dev-helplucene.apache.org
>>>
>>
>>
>>
------------------------------------------------------------
---------
>> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
>> For additional commands, e-mail: java-dev-helplucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Fi
eld- 
>
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a
9124055
> Sent from the Lucene - Java Developer mailing list
archive at  
> Nabble.com.
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
>


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )