List Info

Thread: MemoryError on Solaris - GC memory leak ?




MemoryError on Solaris - GC memory leak ?
country flaguser name
Germany
2007-05-22 07:33:34
Hi,

We're using PyLucene to create an index of our python-based
database (for
searching). All works fine until the indexer is tested on a
larger scale
database (like one with > 2.000.000 documents). First we
got an "IOError:
Too many open files" during index creation which we
could fix by using the
compound file format (writer.setUseCompoundFile(True)).

Next we observed "GC Warning" messages during
index creation :
 GC Warning: Repeated allocation of very large block (appr.
size 1466368):
             May lead to memory leak and poor performance.
             
And all of a sudden (well after 8 hours indexing) the
process crashes with a
[Python] MemoryError.

We're using Python-2.5.1 with PyLucene 2.0.0.8 on SunOS
Solaris. 
PyLucene was built with gcc version 4.1.2
(gcc/sparc-sun-solaris2.9/4.1.2)

'make test' succeeds - though there is one GC warning (on
different tests):
 GC Warning: Large stack limit(2147479552): only scanning 8
MB

Has anyone suffered similar problems or knows a solution?

We've been trying to track the problem and even tried to use
frequent
"manual cleanup" [by calling Python's gc.collect()
and PyLucene's
System.gc()] but that didnt seem to help as the process
still grows
constantly during indexing (more details are given below). 

I've gone through most of the mailing list - including the
thread on
"pylucene and 2gb limit of files" and found that
similar problems have been
observed already - usually the suggestion is to use a
different (or patched)
version of GCC though this doesnt seem to fix the problem in
all cases. I
don't think the "2gb limit of files" could be a
cause here - the Index (on
disc) actually was only about 400 MB when the process died.


So next we'd like to test with GCC 4.2 - if that is
suggested. Is there a
recommended/stable install-config (PyLucene+GCC/GCJ version)
for Sun
Solaris? 

Are the (in this list) discussed problems with GCC/GCJ fixed
in its current
4.2.0 release and is it suggested to use this version for
PyLucene 2.0?

If there's any other way to get rid of the GC Warning (and
memory leak) that
would be of interest of course...


Any help would be appreciated.


Kind regards

Thomas Koch
--
OrbiTeam Software GmbH & Co. KG     
http://www.orbiteam.de
Bonn (Germany)         

--
Note:

Indexing is done in a batch process - the process reads the
whole database
and puts a new document into the index for each object in
the database.

 open index writer
 Merge factor:  100
 Max buffered docs: 100
 Max merge docs:9999999
 12:56:13 running database scan...
 memory usage: 42.69 MB
...
[now while documents are added to the index the memory 
 consumption of the python process grows and grows...]
 13:23:18 added to index: 44691
 memory usage: 161.851562 MB

_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re: MemoryError on Solaris - GC memory leak ?
country flaguser name
United States
2007-05-22 10:23:11
On Tue, 22 May 2007, Thomas Koch wrote:

> We're using PyLucene to create an index of our
python-based database (for
> searching). All works fine until the indexer is tested
on a larger scale
> database (like one with > 2.000.000 documents).
First we got an "IOError:
> Too many open files" during index creation which
we could fix by using the
> compound file format
(writer.setUseCompoundFile(True)).
>
> Next we observed "GC Warning" messages during
index creation :
> GC Warning: Repeated allocation of very large block
(appr. size 1466368):
>             May lead to memory leak and poor
performance.
>
> And all of a sudden (well after 8 hours indexing) the
process crashes with a
> [Python] MemoryError.
>
> We're using Python-2.5.1 with PyLucene 2.0.0.8 on SunOS
Solaris.
> PyLucene was built with gcc version 4.1.2
(gcc/sparc-sun-solaris2.9/4.1.2)
>
> 'make test' succeeds - though there is one GC warning
(on different tests):
> GC Warning: Large stack limit(2147479552): only
scanning 8 MB
>
> Has anyone suffered similar problems or knows a
solution?
>
> We've been trying to track the problem and even tried
to use frequent
> "manual cleanup" [by calling Python's
gc.collect() and PyLucene's
> System.gc()] but that didnt seem to help as the process
still grows
> constantly during indexing (more details are given
below).
>
> I've gone through most of the mailing list - including
the thread on
> "pylucene and 2gb limit of files" and found
that similar problems have been
> observed already - usually the suggestion is to use a
different (or patched)
> version of GCC though this doesnt seem to fix the
problem in all cases. I
> don't think the "2gb limit of files" could be
a cause here - the Index (on
> disc) actually was only about 400 MB when the process
died.
>
> So next we'd like to test with GCC 4.2 - if that is
suggested. Is there a
> recommended/stable install-config (PyLucene+GCC/GCJ
version) for Sun
> Solaris?
>
> Are the (in this list) discussed problems with GCC/GCJ
fixed in its current
> 4.2.0 release and is it suggested to use this version
for PyLucene 2.0?
>
> If there's any other way to get rid of the GC Warning
(and memory leak) that
> would be of interest of course...
>

I have never done anything with gcj or PyLucene on Solaris.
I don't have 
access to a Solaris machine either. There are ways to tune
the libgcj garbage 
collector via environment variables that I'm not too
familiar with. I'm 
unsure about the status of the Solaris port of gcj, it 'may'
be incomplete or 
in some state of disrepair. I simply don't know.

The javagcc.gnu.org mailing list (about all things gcj) may
have more 
information about the Solaris port. I remember someone
asking about it 
recently. For problems with the garbage collector you also
may want to ask the 
gclinux.hpl.hp.com mailing list.

Andi..

_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

Re: MemoryError on Solaris - GC memory leak ?
country flaguser name
United States
2007-05-22 11:55:55
On Tue, May 22, 2007 at 02:33:34PM +0200, Thomas Koch
wrote:
> Hi,
> 
> We're using PyLucene to create an index of our
python-based database (for
> searching).

...

> Next we observed "GC Warning" messages during
index creation :
>  GC Warning: Repeated allocation of very large block
(appr. size 1466368):
>              May lead to memory leak and poor
performance.

We had to rebuild GCJ 3.4.6 with LARGE_CONFIG defined to
avoid this
message.  I checked GCJ 4.2.0, and LARGE_CONFIG still
doesn't seem to
be defined by default.  The comment from 4.2.0's
Makefile.direct still
reads:

"# -DLARGE_CONFIG tunes the collector for unusually
large heaps.
 #   Necessary for heaps larger than about 500 MB on most
machines.
 #   Recommended for heaps larger than about 64 MB.
"

It's possible I'm missing something about the 4.2 build
process 
which sets LARGE_CONFIG, of course.

Also, the "Large stack limit" message comes from
boehm-gc/solaris_threads.c in gcj, so that warning seems
solaris-specific.  You might be able to avoid that by
setting your
maximum stack size lower than 8M with ulimit (the number
reported is
2G?)

> If there's any other way to get rid of the GC Warning
(and memory leak) that
> would be of interest of course...

You could probably divide up your documents, and index, say,
50K in
one process, exit, do the next 50K in a new process, etc.,
tuning
the batch sizes as needed.  Inelegant, but it'd probably
work.

    Aaron Lav (asl2pobox.com)
_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

AW: MemoryError on Solaris - GC memory leak ?
country flaguser name
Germany
2007-05-23 06:48:17
> I have never done anything with gcj or PyLucene on
Solaris. I 
> don't have access to a Solaris machine either. There
are ways 
> to tune the libgcj garbage collector via environment 
> variables that I'm not too familiar with. I'm unsure
about 
> the status of the Solaris port of gcj, it 'may' be
incomplete 
> or in some state of disrepair. I simply don't know.
> 
In general GNU ports to Solaris are quite stable  - don't
know about GCJ
either, though they explicitly mention Solaris support on
http://gcc.gnu.o
rg/java/faq.html#1_5

> The javagcc.gnu.org mailing list (about all things
gcj) may 
> have more information about the Solaris port. I
remember 
> someone asking about it recently.

Having checked the mailing list I mainly found Solaris
specific build
problems. The GC memory leak issue is also discussed on this
list, but
doesn't seem to be a platform specific issue.

What I found regarding tuning of the libgcj garbage
collector via
environment variables mainly relates to debug info. For
example you may set
GC_LARGE_ALLOC_WARN_INTERVAL to some value to only get every
n-th warning
message of type "Repeated allocation of very large
block may lead to poor GC
performance and memory leak." Don't think this is a
real fix .-(

For testing purpose I did some "tuning" and set 

GC_MAXIMUM_HEAP_SIZE 64000000 (64 MB)
GC_PRINT_STATS 1
GC_LARGE_ALLOC_WARN_INTERVAL 1

Running the PyLucene indexing code again now results in
following GC
messages:
    
GC Warning: Out of Memory!  Trying to continue ...
GC Warning: Out of Memory!  Trying to continue ...
GC Warning: Out of Memory!  Trying to continue ...
GC Warning: Out of Memory!  Returning NIL!
Abort

The GC is consuming much heap size - so the process quickly
consumes 60 MB
heap space (total memory usage at that time: 131 MB). This
happens after
about 20mins indexing - when 28.000 objects have been
visited/indexed (the
whole batch job mainly consists of a loop calling
writer.addDocument(doc)).


GC stats show this consumption: 
 --%<--
Initiating full world-stop collection 300 after 14331640
allocd bytes
--> Marking for collection 300 after 14331640 allocd
bytes + 386496 wasted
bytes
Collection 300 finished ---> heapsize = 58777600 bytes
World-stopped marking took 150 msecs
Complete collection took 180 msecs
 --%<--

However it's unclear how far this info really helps to get
around the
initial (memory leak) problem. 

The main problem for poor [Py]Lucene users I see is that
it's difficult to
find out what's going on "behind the scenes" - and
how to influence this. I
understand that PyLucene is mainly a "Python
wrapper" for the Java Lucene
code which is GCJ-compiled. So that PyLucene will
"inherit" any problems
with memory allocation from the Java Code base. Furthermore
it obviously
suffers from any performance problem that the GCJ
compilation may include.

I guess the main "magic" of Lucene is in the way
the automatic merging of
segments is handled. This code probably maintains large data
structures and
may create/dispose quite a number of objects. On the GCJ
mailing list I read
that one problem with the libgcj GC is a performance and
memory penalty
compared to the Java 'builtin' GC. So the main question now
is if the
reported (memory leak) problem in PyLucene is caused by the
libgcj-GC (i.e.
is a GCJ bug) or is caused by the Lucene Java Code base...

> For problems with the 
> garbage collector you also may want to ask the 
> gclinux.hpl.hp.com mailing list.
> 
Thanks will do.

Regards
Thomas

_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

AW: MemoryError on Solaris - GC memory leak ?
country flaguser name
Germany
2007-05-23 07:00:07
> We had to rebuild GCJ 3.4.6 with LARGE_CONFIG defined
to 
> avoid this message.  I checked GCJ 4.2.0, and
LARGE_CONFIG 
> still doesn't seem to be defined by default.  The
comment 
> from 4.2.0's Makefile.direct still
> reads:
> ...

Aaron,

Thanks for the hint - will try this (we're currently running
a build).

> > If there's any other way to get rid of the GC
Warning (and memory 
> > leak) that would be of interest of course...
> 
> You could probably divide up your documents, and index,
say, 
> 50K in one process, exit, do the next 50K in a new
process, 
> etc., tuning the batch sizes as needed.  Inelegant, but
it'd 
> probably work.
> 

Well that would be an option - but would of course require
some more "batch
overhead" (like saving state and the like). 

Regards
Thomas

_______________________________________________
pylucene-dev mailing list
pylucene-devosafoundation.org
http://lists.osafoundation.org/mailman/listinfo/pylu
cene-dev

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )