Posted here: http://wsgi.org/wsgi/Specifications/avoiding_serializa
tion
Text copied below for discussion:
:Title: Avoiding Serialization When Stacking Middleware
:Author: Ian Bicking <ianb colorstudy.com>
iscussio
ns-To: Python Web-SIG <web-sig python.org>
:Status: Proposed
:Created: 06-03-2007
.. contents::
Abstract
--------
This proposal gives a strategy for avoiding unnecessary
serialization
and deserialization of request and response bodies. It does
so by
attaching attributes to ``wsgi.input`` and the ``app_iter``,
as well as
a new environment key ``x-wsgiorg.want_parsed_response``.
Rationale
---------
Output-transforming middleware often has to parse the
upstream content,
transform it, then serialize it back to a string for output.
The
original output may have already been in the parsed form
that the
middleware wanted. Or there may be more middleware that
does similar
transformations on the same kind of objects.
The same things apply to the parsing of ``wsgi.input``,
specifically
parsing form data. A similar strategy is presented to avoid
unnecessarily reparsing that data.
Specification
-------------
WSGI applications (or middleware) can return an app_iter
that not only
serializes the output, but also has extra attributes. An
attribute is
given here, ``app_iter.x_wsgiorg_parsed_response`` which is
a
function/method that takes one argument, the
"type" of object that you
want to receive. It may return that type of object, or None
(meaning it
cannot produce that type of object). Consumers should fall
back on
normal parsing of the response if the method does not exist,
or returns
None.
Similarly the ``environ['wsgi.input']`` object may have the
same method,
with the same meaning.
WSGI applications that want to lazily serialize their output
have a
problem: they probably cannot calculate ``Content-Length``
without doing
the actual serialization. Browsers typically want to know
about
``Content-Length``, but WSGI middleware seldom cares, since
it just can
get the content from app_iter regardless of its length.
WSGI middleware
that will transform the output can set
``environ['x-wsgiorg.want_parsed_response'] = True`` to give
this hint
to the application. Applications are thus encouraged to
only lazily
serialize their output when that key is present and true.
(There is no
equivalent concept for ``wsgi.input``.)
The object returned by ``.x_wsgiorg_parsed_response()`` may
be modified
in-place by the WSGI middleware using that object.
Producers should
make a copy if they do not want consumers modifying the
object.
Example
--------
Two examples are provided: one for output, and one for
input.
The output transformation parses the page with
``lxml.etree.HTML`` (from
the `lxml <http://codespeak.net/l
xml/>`_ library) and replaces all
``<i>`` tags with ``<em>`` tags. First we show
the middleware::
import lxml.etree
class EmTagMiddleware(object):
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
parent_wants_parsed =
environ.get('x-wsgiorg.want_parsed_response')
environ['x-wsgiorg.want_parsed_response'] =
True
written_output = []
captured_headers = []
def repl_start_response(status, headers,
exc_info=None):
if exc_info:
raise exc_info[0], exc_info[1],
exc_info[2]
captured_headers[:] = [status, headers]
return written_output.append
app_iter = self.app(environ,
repl_start_response)
parsed = None
if captured_headers and not written_output:
method = getattr(app_iter,
'x_wsgiorg_parsed_response',
None)
if method:
parsed = method(lxml.etree._Element)
if parsed is None:
# Have to manually parse, because:
# a) start_response was called lazily
# b) the start_response writer was used
# c) app_iter.x_wsgiorg_parsed_response
didn't exist
# d) that method returned None
try:
for item in app_iter:
written_output.append(item)
finally:
if hasattr(app_iter, 'close'):
app_iter.close()
parsed =
self.parse_body(''.join(written_output))
status, headers = captured_headers
new_body = self.transform_body(parsed)
for i in range(len(headers)):
if headers[i][0].lower() ==
'content-length':
del headers[i]
break
if parent_wants_parsed:
new_app_iter =
self.make_app_iter(new_body)
else:
serialized_body = serialize(new_body)
headers.append(('Content-Length',
str(len(serialized_body))))
new_app_iter = [serialized_body]
return new_app_iter
def parse_body(self, body):
return lxml.etree.HTML(body)
def transform_body(self, root):
for el in root.xpath('//i'):
el.tag = 'em'
return root
def make_app_iter(self, body):
return LazyLXML(body)
def serialize(element):
return lxml.etree.tostring(element)
class LazyLXML(object):
def __init__(self, body):
self.body = body
self.have_yielded = False
def __iter__(self):
return self
def next(self):
if self.have_yielded:
raise StopIteration
self.have_yielded = True
return serialize(self.body)
def x_wsgiorg_parsed_response(self, type):
if type is lxml.etree._Element:
return self.body
return None
Here's a simpler example for parsing normal form inputs in
``wsgi.input``::
import cgi
import urllib
from cStringIO import StringIO
def parse_form(environ):
content_type = environ.get('CONTENT_TYPE', '')
assert content_type in
['application/x-www-form-urlencoded',
'multipart/form-data']
wsgi_input = environ['wsgi.input']
method = getattr(wsgi_input,
'x_wsgiorg_parsed_response', None)
if method:
parsed = method(cgi.FieldStorage)
if parsed is not None:
return parsed
form = cgi.FieldStorage(fp=wsgi_input,
environ=environ,
keep_blank_values=True)
environ['wsgi.input'] = FakeFormInput(form)
return form
class FakeFormInput(object):
def __init__(self, form):
self.form = form
self.serialized = None
def x_wsgiorg_parsed_response(self, type):
if type is cgi.FieldStorage:
return self.form
return None
def read(self):
if self.serialized is None:
self._serialize()
return self.serialized.read()
def readline(self, *args):
if self.serialized is None:
self._serialize()
return self.serialized.readline(*args)
def readlines(self, *args):
if self.serialized is None:
self._serialize()
return self.serialized.readlines(*args)
def __iter__(self):
if self.serialized is None:
self._serialize()
return iter(self.serialized)
def _serialize(self):
# XXX: Doesn't deal with file uploads, and
multipart/form-data generally
data = urllib.urlencode(self.form.list, True)
self.serialized = StringIO(data)
Problems
--------
Obviously the code is not simple, but this is the nature of
WSGI
output-transforming middleware. Ideally a framework of some
sort would
be used to construct this kind of middleware.
Something that replaces ``wsgi.input`` (like the example)
may change the
``CONTENT_LENGTH`` of the request; normalization alone may
change the
length, even if the data is the same (e.g., there are
multiple ways to
urlencode a string). However, there's no way without
actually
serializing to determine the proper length. Ideally
requests like this
should allow simply reading to the end of the object,
without needing a
``CONTENT_LENGTH`` restriction (this is not true for socket
objects).
Ideally something like ``CONTENT_LENGTH="-1"``
would indicate this
situation (simply a missing ``CONTENT_LENGTH`` generally
means ``0``).
Another option is to set it to 1 and simply return the
entire serialized
response all at once. ``cgi.FieldStorage`` actually
protects against
this. Or set it to a very very large value, and allow
reading past the
end (returning ``""``). This is likely to work
with most consumers.
I'm not sure what effect -1 will have on different code.
Other Possibilities
-------------------
* You could simply parse everything ever time.
* You could pass data through callbacks in the environment
(but this can
break non-aware middleware).
* You can make custom methods and keys for each case.
* You can use something other than WSGI.
I think this specification offers advantages over all these
options.
Open Issues
-----------
Should "type" be the class object? A string
describing the type?
Things like ``lxml.etree._Element`` are a little unclean,
since the
*actual* class isn't a public object (only the factory
function
``lxml.etree.Element``). Also, there are occasionally times
when
multiple classes implement the same interface.
The boolean ``environ['x-wsgiorg.want_parsed_response']``
doesn't really
give any idea of what *kind* of object you want. This is
actually
something of a problem, because sometimes it's impossible to
give that
kind of object. For instance, if you want to transform
images you might
want the PIL object for the image. But if the response is
HTML there's
no way to give this type. Similarly if you are transforming
HTML then
images don't mean anything to you, and you probably *do*
want them to
come out as normal. And potentially *both* a image
transformer and an
HTML transformer are in the stack. Should that key actually
hold a list
of types that are of interest?
``x_wsgiorg_parsed_response`` isn't a very good name for the
method on
``wsgi.input``, as it's not a response.
_______________________________________________
Web-SIG mailing list
Web-SIG python.org
Web SIG: http://www.python.
org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/bo
nd%40yahoo.com
|
At 08:08 PM 3/6/2007 -0600, Ian Bicking wrote:
>Posted here: http://wsgi.org/wsgi/Specifications/avoiding_serializa
tion
>
>Text copied below for discussion:
>
>
>:Title: Avoiding Serialization When Stacking Middleware
>:Author: Ian Bicking <ianb colorstudy.com>
> iscussio
ns-To: Python Web-SIG <web-sig python.org>
>:Status: Proposed
>:Created: 06-03-2007
>
>.. contents::
>
>Abstract
>--------
>
>This proposal gives a strategy for avoiding unnecessary
serialization
>and deserialization of request and response bodies. It
does so by
>attaching attributes to ``wsgi.input`` and the
``app_iter``, as well as
>a new environment key
``x-wsgiorg.want_parsed_response``.
>
>Rationale
>---------
>
>Output-transforming middleware often has to parse the
upstream content,
>transform it, then serialize it back to a string for
output. The
>original output may have already been in the parsed form
that the
>middleware wanted. Or there may be more middleware that
does similar
>transformations on the same kind of objects.
HTTP already includes a mechanism for specifying what types
are accepted by
a content consumer: the "Accept" header. You can
always add other values
to it to indicate the parsed values you can accept.
Of course, this doesn't really work well with WSGI - you
want the result to
actually *be* WSGI... so you can use the WSGI way of doing
this, which is
to have a standard wrapper for the specific content type you
want to use.
The wrapper (as with the wsgi "file wrapper")
simply puts a WSGI face on a
non-WSGI result body, converting it to an iterator of
strings, and holding
other attributes known to the middleware or other
application object.
This could be implemented as an environ key containing a
mapping from types
to wrapper functions. Middleware that wants a type just
copies the mapping
and overwrites any entries it cares about. Applications
that want to
return a non-serialized result just look up the type (using
__mro__ order)
to find an applicable wrapper.
Notice that this approach doesn't require any special
protocol for these
wrappers -- just WSGI. It's simpler to specify, and simpler
to implement
than what you propose, while addressing some of the open
issues.
Yes, it does have some problems with interface vs.
implementation. ISTM
that trying to solve that problem is effectively asking to
revive or
reinvent PEP 246, however. But we could explicitly allow
the use of type
names instead of the actual types.
>The same things apply to the parsing of ``wsgi.input``,
specifically
>parsing form data. A similar strategy is presented to
avoid
>unnecessarily reparsing that data.
I would rather offer an optional 'get_file_storage()' method
or some such
as a blessed WSGI extension, than have such an open-ended
"get whatever you
want from the input object" concept floating around. A
strategy which
reinvents half of PEP 246 (the *old* PEP 246, before it
became almost as
complicated as WSGI) seems like overkill to me.
>Obviously the code is not simple, but this is the nature
of WSGI
>output-transforming middleware.
Something I'd like to fix in WSGI 2.0, by getting rid of
both
"start_response" and "write", but that's
a discussion for another time.
>Other Possibilities
>-------------------
>
>* You could simply parse everything ever time.
>* You could pass data through callbacks in the
environment (but this can
>break non-aware middleware).
>* You can make custom methods and keys for each case.
>* You can use something other than WSGI.
And you can use the established WSGI method for adding
semantics to a
response, using a middleware-supplied wrapper. I think this
is actually
the best alternative.
In truth, it could be as simple as using the class's
fully-qualified name
as an environ key (perhaps with a prefix or suffix), with
the value being a
wrapper for objects implementing that protocol. No
x-foobar-wsgiorg-whatchamacallit cruft needed.
And, it's lightweight enough of a concept to be expressed as
a simple "best
practice" design pattern.
_______________________________________________
Web-SIG mailing list
Web-SIG python.org
Web SIG: http://www.python.
org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/bo
nd%40yahoo.com
|