I think the binary section recognizer is probably your best
best.
If you write an analyzer that ignores terms that consist of
only
hexadecimal digits, and contain embedded digits, you will
probably
reduce the pollution quite a bit, and it is trivial to
write, and not
too expensive to check.
On Nov 6, 2007, at 6:56 PM, Chuck Williams wrote:
> Hi All,
>
> We are experiencing OOM's when binary data contained in
text files
> (e.g., a base64 section of a text file) is indexed. We
have
> extensive recognition of file types but have
encountered binary
> sections inside of otherwise normal text files.
>
> We are using the default value of 128 for
termIndexInterval. The
> problem arises because binary data generates a large
set of random
> tokens, leading to totalTerms/termIndexInterval terms
stored in
> memory. Increasing the -Xmx is not viable as it is
already maxed.
>
> Does anybody know of a better solution to this problem
than writing
> some kind of binary section recognizer/filter?
>
> It appears that termIndexInterval is factored into the
stored index
> and thus cannot be changed dynamically to work around
the problem
> after an index has become polluted. Other than
identifying the
> documents containing binary data, deleting them, and
then
> optimizing the whole index, has anybody found a better
way to
> recover from this problem?
>
> Thanks for any insights or suggestions,
>
> Chuck
>
>
>
------------------------------------------------------------
---------
> To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
> For additional commands, e-mail: java-dev-help lucene.apache.org
>
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-dev-unsubscribe lucene.apache.org
For additional commands, e-mail: java-dev-help lucene.apache.org
|