List Info

Thread: żleak




żleak
user name
2007-01-05 15:06:57
Hi all,

I am using libxml2 for parsing html in python. I was thinking that libxml2 could be involved, so I modified one of the website python examples in order to process a revelant number of html files while I checked the memory comsuption with the top command.
And... yes! the program does increase the memory consumption till it finish.

Am I forgetting something in the code? Or there is something wrong with the python bindings....

Thank you, Cesar

Note1: I do nothing in the Callback
Note2: I have tried to use the cleanup functions after the 'ctxt = None' with the same results.

****************************************] The Code [****************************************

#!/usr/bin/python -u
import libxml2

#------------------------------------------------------------------------------


# Memory debug specific
libxml2.debugMemory(1)

#------------------------------------------------------------------------------

class callback:
  ;  def startDocument(self):
  ;          print "."       ;

    def endDocument(self):
      ;  pass

    def startElement(self, tag, attrs):
      ;  pass

    def endElement(self, tag):
        pass

    def characters(self, data):
        pass

    def warning(self, msg):
        pass

    def error(self, msg):
        pass

    def fatalError(self, msg):
        pass
       
#------------------------------------------------------------------------------
#------------------------------------------------------------------------------       
import os
import sys

programName = os.path.basename(sys.argv[0])

if len(sys.argv) != 2:
  print "Use: %s <dir html files>&quot; % programName
  sys.exit(1)
 
inputPath = sys.argv[1]
 &nbsp; 
if not os.path.exists (inputPath):
&nbsp; print "Error: directory does not exist";
  sys.exit(1)

inputFileNames = [] 
dirContent = os.listdir(inputPath)
for fichero in dirContent: &nbsp; 
  extension1=fichero.rfind(".htm")
  extension2=fichero.rfind(".html")
  dot = fichero.rfind(".")
&nbsp; extension = max(extension1,extension2)
 ; if extension != -1 and extension == dot:   ;    &nbsp;   
 &nbsp;    inputFileNames.append (fichero)
 &nbsp;   &nbsp; 
if len(inputFileNames) == 0:
  print "Error: no input files";
  sys.exit(1)
 &nbsp; &nbsp; &nbsp; 

handler = callback()
NUM_ITERS = 5
for i in range(NUM_ITERS):
&nbsp; for inputFileName in inputFileNames:
&nbsp; &nbsp; print inputFileName
 &nbsp;  inputFilePath = inputPath + inputFileName &nbsp;   &nbsp;   &nbsp; 
 &nbsp;  f = open(inputFilePath)
&nbsp;   data = f.read()&nbsp; &nbsp;
 &nbsp;  f.close()
  
 &nbsp;  ctxt = libxml2.htmlCreatePushParser(handler, "&quot;, 0, inputFileName) &nbsp;   &nbsp; 
 &nbsp;  ctxt.htmlParseChunk(data, len(data), 1)
 &nbsp;  ctxt = None   ; &nbsp; &nbsp; &nbsp; &nbsp;
 &nbsp; 

# Memory debug specific
libxml2.cleanupParser()
if libxml2.debugMemory(1) == 0:
 &nbsp;  print "OK&quot;
else:
 &nbsp;  print "Memory leak %d bytes"; % (libxml2.debugMemory(1))
 &nbsp;  libxml2.dumpMemory()
  
# Other cleanup functions&nbsp;  
#libxml2.cleanupCharEncodingHandlers()
#libxml2.cleanupEncodingAliases()
#libxml2.cleanupGlobals()
#libxml2.cleanupInputCallbacks()
#libxml2.cleanupOutputCallbacks() &nbsp; 
#libxml2.cleanupPredefinedEntities()   ;
żleak
user name
2007-01-05 15:26:39
On Fri, Jan 05, 2007 at 04:06:57PM +0100, Cesar Ortiz wrote:
> Hi all,
> 
> I am using libxml2 for parsing html in python. I was
thinking that libxml2
> could be involved, so I modified one of the website
python examples in order
> to process a revelant number of html files while I
checked the memory
> comsuption with the top command.

  Which is just a very wrong way to try to assert memory
leak.

> And... yes! the program does increase the memory
consumption till it finish.

  Can be perfectly normal to some point.

> libxml2.cleanupParser()
> if libxml2.debugMemory(1) == 0:
>    print "OK"
> else:
>    print "Memory leak %d bytes" %
(libxml2.debugMemory(1))
>    libxml2.dumpMemory()

Libxml2 wise that's the only serious way to check for leaks,
check that output.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/v
irtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillardredhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ |
Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
żleak
user name
2007-01-05 16:07:25
On Fri, 2007-01-05 at 10:26 -0500, Daniel Veillard wrote:
> On Fri, Jan 05, 2007 at 04:06:57PM +0100, Cesar Ortiz
wrote:
> > Hi all,
> > 
> > I am using libxml2 for parsing html in python. I
was thinking that libxml2
> > could be involved, so I modified one of the
website python examples in order
> > to process a revelant number of html files while I
checked the memory
> > comsuption with the top command.
> 
>   Which is just a very wrong way to try to assert
memory leak.
> 
> > And... yes! the program does increase the memory
consumption till it finish.
> 
>   Can be perfectly normal to some point.
> 
> > libxml2.cleanupParser()
> > if libxml2.debugMemory(1) == 0:
> >    print "OK"
> > else:
> >    print "Memory leak %d bytes" %
(libxml2.debugMemory(1))
> >    libxml2.dumpMemory()
> 
> Libxml2 wise that's the only serious way to check for
leaks, check that output.

Daniel is right on the money here. A few other comments:

The python bindings for libxml2 are not "pythonic"
in the sense they do
not automatically manage the lifetime of python objects. You
must
explicitly free some of the libxml2 objects which is
something python
programmers are not used to and may as a consequence
overlook thus
producing excessive memory use and leaks.

Top and ps are very poor ways to evaluate memory usage, they
often
contain misleading information due to a host of reasons.
You're better
off reading the /proc filesystem. Here is a tool which will
format that
information in a pleasant way.
http:/
/people.redhat.com/berrange/mem-monitor/


One can also be fooled when investigating memory usage with
python as
python creates everything on the heap, not just objects.,
Every piece of
python code which gets loaded directly and indirectly
(including all the
doc strings) is allocated as an object, python programs use
a lot of
memory and large memory use is not an indicator of leaks.
-- 
John Dennis <jdennisredhat.com>

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xmlgnome.org
http://mai
l.gnome.org/mailman/listinfo/xml
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )