List Info

Thread: Term pollution from binary data




Term pollution from binary data
country flaguser name
United States
2007-11-06 18:56:40
Hi All,

We are experiencing OOM's when binary data contained in text
files 
(e.g., a base64 section of a text file) is indexed.  We have
extensive 
recognition of file types but have encountered binary
sections inside of 
otherwise normal text files.

We are using the default value of 128 for termIndexInterval.
 The 
problem arises because binary data generates a large set of
random 
tokens, leading to totalTerms/termIndexInterval terms stored
in memory.  
Increasing the -Xmx is not viable as it is already maxed.

Does anybody know of a better solution to this problem than
writing some 
kind of binary section recognizer/filter?

It appears that termIndexInterval is factored into the
stored index and 
thus cannot be changed dynamically to work around the
problem after an 
index has become polluted.  Other than identifying the
documents 
containing binary data, deleting them, and then optimizing
the whole 
index, has anybody found a better way to recover from this
problem?

Thanks for any insights or suggestions,

Chuck


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


Re: Term pollution from binary data
country flaguser name
United States
2007-11-06 19:15:45
I think the binary section recognizer is probably your best
best.

If you write an analyzer that ignores terms that consist of
only  
hexadecimal digits, and contain embedded digits, you will
probably  
reduce the pollution quite a bit, and it is trivial to
write, and not  
too expensive to check.


On Nov 6, 2007, at 6:56 PM, Chuck Williams wrote:

> Hi All,
>
> We are experiencing OOM's when binary data contained in
text files  
> (e.g., a base64 section of a text file) is indexed.  We
have  
> extensive recognition of file types but have
encountered binary  
> sections inside of otherwise normal text files.
>
> We are using the default value of 128 for
termIndexInterval.  The  
> problem arises because binary data generates a large
set of random  
> tokens, leading to totalTerms/termIndexInterval terms
stored in  
> memory.  Increasing the -Xmx is not viable as it is
already maxed.
>
> Does anybody know of a better solution to this problem
than writing  
> some kind of binary section recognizer/filter?
>
> It appears that termIndexInterval is factored into the
stored index  
> and thus cannot be changed dynamically to work around
the problem  
> after an index has become polluted.  Other than
identifying the  
> documents containing binary data, deleting them, and
then  
> optimizing the whole index, has anybody found a better
way to  
> recover from this problem?
>
> Thanks for any insights or suggestions,
>
> Chuck
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
> For additional commands, e-mail: java-dev-helplucene.apache.org
>


------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribelucene.apache.org
For additional commands, e-mail: java-dev-helplucene.apache.org


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )