List Info

Thread: Parsing email with large attachment




Parsing email with large attachment
country flaguser name
Singapore
2007-09-03 23:27:04
Hi ,

I want to use the email package to parse emails with
attachments upto 
1GB. However I find that python crashes with a Memory error
traceback
while parsing the email with even a 300MB attachment at this
point :

self._cur.set_payload(EMPTYSTRING.join(lines))    -->
feedparser.py

I have the email contents in a file and the code is like (
on 
python2.5, winxp ) :

self.msg = email.message_from_file(self.stream)
...
...

         #Check if any attachments at all
         if self.msg.get_content_maintype() != 'multipart':
             print 'No attachments in message'
             return

         for part in self.msg.walk():
             # multipart/* are just containers
             if part.get_content_maintype() == 'multipart':
                 continue

             is_attachment =
part.get('Content-Disposition')
             if is_attachment is None :
                 #body = part.get_payload(decode=True)
                 #print 'Body' , body
                 continue

             filename = part.get_filename()
             counter = 1
             print 'Filename' , filename
             if not filename:
                filename = 'part-%03d%s' % (counter, 'bin')
                counter += 1
             att_path = os.path.join(detach_dir, filename)
             #Check if its already there
             if not os.path.isfile(att_path) :
                 fp = open(att_path, 'wb')
                 fp.write(part.get_payload(decode=True))
                 fp.close()


My machine has 2GB RAM so memory is not a problem and it
seems python 
tries to allocate a large memory chunk while doing a list 
concatenation operation.
Also it seems that peak memory used for parsing and
extracting the 
attachment is three times the attachment size :
1) 2x used for parsing
2) 1x used for extracting it

The only way to fix this seems to be rewriting the parser to
not load 
the attachment into memory at all and maybe write it to a
file , pass 
the file pointer to set_payload and decode the attachment in
small 
chunks in get_payload instead of loading the entire file.
Subclass message to accept a file pointer in set_payload,
etc...

Is there any other way to fix it , maybe compile python with
some 
flags to allow list concatenation to access a larger amount
of memory.

Thanks,
Vijay


_______________________________________________
Email-SIG mailing list
Email-SIGpython.org
Your options: http://mail.python.org/mailman/options/em
ail-sig/nessto%40sharedlog.com

Re: Parsing email with large attachment
country flaguser name
United States
2007-09-04 06:51:53
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sep 4, 2007, at 12:27 AM, Vijay Rao wrote:

> The only way to fix this seems to be rewriting the
parser to not load
> the attachment into memory at all and maybe write it to
a file , pass
> the file pointer to set_payload and decode the
attachment in small
> chunks in get_payload instead of loading the entire
file.
> Subclass message to accept a file pointer in
set_payload, etc...

We've long talked about adding an API to allow the parser to
store  
attachment data externally instead of in memory.  We've
never gotten  
past the "yes, that would be a good idea" stage
though.  Care to  
propose an API and work on an implementation?

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRt1G2XEjvBPtnXfVAQJ6nwQAqR8XlNL9cg2Q+2sgGv740PkTtBNP
enqQ
IVIu2MGHJYibvM7LrHF24MMEnXi80t1+JUQff/HhAn9jTjF2N02jtS+q/nig
SyY/
08+YNmud9vgaOrGOOd1HAYIkYiCiv2YBUbhetJnsoV9dWS24Psp445qJl6/N
vtdD
fh2Ipz9cfys=
=MIrO
-----END PGP SIGNATURE-----
_______________________________________________
Email-SIG mailing list
Email-SIGpython.org
Your options: http://mail.python.org/mailman/options/em
ail-sig/nessto%40sharedlog.com

Re: Parsing email with large attachment
country flaguser name
Singapore
2007-09-05 21:44:15
Hi ,

Yes I would like to propose an API and work on the
implementation.
Any pointers on where to get started ?

Vijay


At 07:51 PM 9/4/2007, Barry Warsaw wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On Sep 4, 2007, at 12:27 AM, Vijay Rao wrote:
>
>>The only way to fix this seems to be rewriting the
parser to not load
>>the attachment into memory at all and maybe write it
to a file , pass
>>the file pointer to set_payload and decode the
attachment in small
>>chunks in get_payload instead of loading the entire
file.
>>Subclass message to accept a file pointer in
set_payload, etc...
>
>We've long talked about adding an API to allow the
parser to store
>attachment data externally instead of in memory.  We've
never gotten
>past the "yes, that would be a good idea"
stage though.  Care to
>propose an API and work on an implementation?
>
>- -Barry
>
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.4.7 (Darwin)
>
>iQCVAwUBRt1G2XEjvBPtnXfVAQJ6nwQAqR8XlNL9cg2Q+2sgGv740PkT
tBNPenqQ
>IVIu2MGHJYibvM7LrHF24MMEnXi80t1+JUQff/HhAn9jTjF2N02jtS+q
/nigSyY/
>08+YNmud9vgaOrGOOd1HAYIkYiCiv2YBUbhetJnsoV9dWS24Psp445qJ
l6/NvtdD
>fh2Ipz9cfys=
>=MIrO
>-----END PGP SIGNATURE-----

_______________________________________________
Email-SIG mailing list
Email-SIGpython.org
Your options: http://mail.python.org/mailman/options/em
ail-sig/nessto%40sharedlog.com

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )