From: Robert Sparks <rjsparks nostrum.com>
The BNF in 3261 says the following:
extension-header = header-name HCOLON header-value
header-value = *(TEXT-UTF8char / UTF8-CONT / LWS)
This is intended to be the catch-all field for all future
extensions
- older parsers working against this BNF shouldn't barf
when we introduce a new header field.
Now, we may have new fields in the future that look
like:
NewHeader = new-header-name HCOLON quoted-string
And down inside quoted-string, we get:
quoted-string = SWS DQUOTE *(qdtext / quoted-pair )
DQUOTE
qdtext = LWS / %x21 / %x23-5B / %x5D-7E
/ UTF8-NONASCII
quoted-pair = "" (%x00-09 / %x0B-0C
/ %x0E-7F)
The whole situation is rather icky. I can see five
problems:
1. header-value generates solo UTF8-CONT, the extension
bytes of UTF-8
characters, which are the range x80-BF. Why this is so is
unclear --
the syntax cannot generate a solo UTF-8 initial byte which
would
govern the extension byte, but the syntax also does not
admit the
(single-byte) encodings of a lot of the characters in the
ISO-8859-*
character sets, so the syntax does not permit embedding the
one-byte
ISO-8859 encodings. It appears to me that the inclusion of
UTF8-CONT
in the production is unintended.
2. quoted-string admits (most of) x00-1F even though
extension-header
does not.
3. Since quoted-string is used in many defined headers, we
are
already in the position of having defined headers that
cannot be
parsed as extension-header as a catch-all mechanism.
4. Given that there is no common character encoding within
which all
of these productions can be uniformly interpreted, the only
overall
description that can be given of the encoding of SIP headers
is
"*OCTET". And yet SIP headers are not intended to
be a binary
protocol.
5. In quoted-string, a backslash is permitted to quote any
ASCII
character, but not any Unicode character x80 or higher.
(Despite that
the backslash is not used to quote a solo UTF-8 initial
byte.) This
leads to the peculiar result that some letters used in
(e.g.) French
can be preceeded by backslash in quoted-string, but others
cannot.
Based on RFC 3261 section 7:
SIP is a text-based protocol and uses the UTF-8 charset
(RFC 2279 [7]).
my understanding is that the intention for SIP headers is
that they
are sequences of Unicode characters encoded using UTF-8. I
see no
reason to abandon that principle and I've not heard of any
instance
where anyone has done so deliberately.
To hold to that principle and clean up the above problems,
the BNF
would need to be revised to be:
extension-header = header-name HCOLON header-value
header-value = *(TEXT-UTF8char / LWS)
quoted-string = SWS DQUOTE *(qdtext / quoted-pair )
DQUOTE
qdtext = LWS / %x21 / %x23-5B / %x5D-7E
/ UTF8-NONASCII
quoted-pair = "" (%x20-7E / UTF8-NONASCII)
or equivalently
quoted-pair = "" (SP / TEXT-UTF8char)
Relative to other proposals, in a sense I'm proposing that
extension-header and quoted-string be contracted so that
they
coincide, as I think that makes all the rules consistent
and
conceptually coherent and does not exclude any current
usage. This
leaves me in the dark why interoperability problems have
been seen.
Robert, can you show us some examples?
Dale
_______________________________________________
Sip mailing list https://ww
w1.ietf.org/mailman/listinfo/sip
This list is for NEW development of the core SIP Protocol
Use sip-implementors cs.columbia.edu for questions on current
sip
Use sipping ietf.org for new developments on the application of
sip
|