|
List Info
Thread: Proposal for supporting WC file content encoding
|
|
| Proposal for supporting WC file content
encoding |

|
2006-03-26 19:52:17 |
Dear Subversion-dev,
I'm proposing to add functionality for
"handling" encoding in the text
content which Subversion handles.
I've read the discussion on UTF-16 support (as referenced
in
http://subversion.tigris.org/issues/show_bug.cgi?id=2194
), and I've been
lacking locale aware content encoding myself (in fact, I
assumed it was
present already...), and think I the feature can be
implemented without
changing too much of the client code. I hope I'm not just
stating the
obvious, but I think it could be done in these few steps:
1) Property support for specifying text encoding
2) ASCII-based encoding conversion support in the WC library
(along with
current
EOL and keyword handling in 'subst')
3) Extend support to "non-ASCII-based" encodings
(like UTF-16, EBCDIC):
4) Adding auto-property support for text encoding
Each step is independent from the next, so that we could
stop at any
time, no "big-bang" is neccesary. The proposal
is based on four main
principles:
* Its free if you don't need it
* Don't surprise the user
* UTF-8 is the new ASCII ("1112060 code point should
be enough for
everybody")
* Must be backwards compatible
The idea is to normalize all text which uses the feature to
UTF-8 and
then only convert to/from specified encodings at the
outermost level of
the WC handling, like this:
WC file contents (user specified encoding or locale's)
||
Enriched contents (UTF-8, w/keywords and/or EOL trans)
||
Pristine contents (UTF-8, as stored in FS)
'svn diff' between WC and pristine would convert the WC
file up to the
"enriched" level before feeding to the diff
libraries (Not sure how this
would
be handled for external diff packages, it might have to save
to a temp.
file)
The server (RA level) would only see the UTF-8 versions and
would not
need any changes. The client would detect encoding by
looking at the
propery, and act accordingly. Old clients would not know
this and only
see a UTF-8 file.
Further details about the four steps:
Ad 1) Property support for specifying text encoding:
I propose that we introduce a new property for text files
called
svn:text-encoding (or mabye svn:text-encoding-style, or
perhaps just
svn:encoding)?
This can take three kinds of values:
- The name of a specific encoding, like ISO-8859-1 or UTF-8
- The special value 'native'
- Empty or missing (the default)
The idea is that IF svn:text-encoding is specified, then the
WC library
and the clients in general are responsible for converting to
and from to
the specified format (with 'native' being the system's
default
encoding), and that the RA level only ever sees UTF-8 for
these text
resources. The encoding is said to be "managed".
This follows the style of svn:eol-style and needs support in
roughly the
same places.
The 'native' mode is interesting for the case where text
files (like
Java source files) do not carry the encoding with them (like
e.g. XML does).
If the text-encoding is not set, then the encoding is
"unmanaged" in
that it works like it does today.
Ad 2) ASCII-based encoding conversion support in the WC
library:
The first step in supporting this would be to add the
support into the
WC and client libraries. For 8-bit (ASCII-based) encodings,
the basic
support of this doen't touch the diff support, which at
this point
already makes some assumptions about the encoding, as far as
I can tell.
I think the "streamy" API in svn_subst.c can be
layered with the
encoding support. Also diff output should be reflect these
encoding
changes, to show "encoding
only" changes:
Index: cool-stuff/todo.txt
============================================================
=======
--- cool-stuff/todo.txt (revision 42, ISO-8859-1)
+++ cool-stuff/todo.txt (working copy, UTF-8)
svn:text-encoding = UTF-8
Property changes on: cool-stuff/todo.txt
____________________________________________________________
_______
Name: svn:text-encoding
- ISO-8859-1
+ UTF-8
There are some edge cases to be considered, when the
text-encoding
changes from "unmanaged" to
"managed" (or back), where the diff engine
would pick up all kinds of "bogus" text changes.
This may need special
attention.
Another edge case: Some commit logic should be present to
check that a
"managed" file being checked in is in fact valid
in the said encoding
(so that careless handling of file encoding won't
inadvertently break
the repository data).
Ad 3) Extend this support to "non-ASCII-based"
encodings (like UTF-16,
EBCDIC):
Actually this may not be a big issue at all, if the
conversions are
added at the right level, since the main diffing engine
would always
work on UTF-8 (in fact, it would always work on 8-bit
oriented streams
separated by LFs, just like it does now).
The only change I can think of right now is the fixed width
keyword
substitution, which today works on bytes, but could work
fine on
characters if the knowledge was there.
Ad 4) Adding auto-property support for text encoding:
Plenty of options exist: BOM detection, detecting of UTF-8
leading/trailing bytes, checking for XML declarations, etc.
There should
also be a configuration setting for preferring the native
encoding
over the detected one (if the detector sees a file encoded
with the
encoding which is also the current native encoding).
How does this sound?
-Jesper
------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribe subversion.tigris.org
For additional commands, e-mail: dev-help subversion.tigris.org
|
|
| Proposal for supporting WC file content
encoding |

|
2006-03-28 13:16:05 |
Jesper Steen Møller wrote:
>
> I'm proposing to add functionality for
"handling" encoding in the text
> content which Subversion handles.
This proposal looks generally quite promising, with the
potential to introduce
some useful and practical behaviours, but I'm not exactly
sure what you are
aiming to achieve. You wrote about the implementation
method that you have
chosen, but did not say what you want users to be able to
do, or why. What are
the user-oriented goals? To help describe the goals, it
might be helpful to
include some "use cases", i.e. realistic
concrete examples (like transcripts)
that demonstrate the various ways in which the user can
interact with this feature.
[...]
> 'svn diff' between WC and pristine would convert the
WC file up to the
> "enriched" level before feeding to the diff
libraries (Not sure how this would
> be handled for external diff packages, it might have to
save to a temp. file)
So 'svn diff' would display its output in UTF-8 regardless
of the encoding of
the files. I can see how this could be useful for people
wanting a visual
display of changes, especially when the diff includes files
with different
encodings. Was that one of your goals? However, people
often want to use the
output of "svn diff" as the input to a standard
"patch" program, and this would
prevent that from working.
There are already other ways in which diff output best
suited for viewing is
not the best output for using with "patch", such
as whether to display a
file-rename as an all-lines-deleted diff and an
all-lines-added diff, or just
as a statement saying that the file was renamed. Maybe we
need to introduce a
mode switch for "svn diff": human-readable mode
versus "patch" mode, or
preferably "svn patch" mode versus
"conventional patch" mode.
> The server (RA level) would only see the UTF-8 versions
and would not
> need any changes.
When the RA method uses HTTP, I imagine some people will
want the server to be
able to serve the file to generic HTTP clients (web
browsers) in its native
(non-UTF8) encoding.
- Julian
------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribe subversion.tigris.org
For additional commands, e-mail: dev-help subversion.tigris.org
|
|
| Proposal for supporting WC file content
encoding |

|
2006-03-28 23:45:46 |
Julian Foad <julianfoad btopenworld.com> writes:
> Jesper Steen Møller wrote:
>> I'm proposing to add functionality for
"handling" encoding in the
>> text content which Subversion handles.
>
> This proposal looks generally quite promising, with the
potential to
> introduce some useful and practical behaviours, but
I'm not exactly
> sure what you are aiming to achieve. You wrote about
the
> implementation method that you have chosen, but did not
say what you
> want users to be able to do, or why. What are the
user-oriented
> goals? To help describe the goals, it might be helpful
to include
> some "use cases", i.e. realistic concrete
examples (like transcripts)
> that demonstrate the various ways in which the user can
interact with
> this feature.
For example if someone were to use
svn:encoding="iso-8859-1" to
produce a working file in iso-8859-1 the file in the working
copy
would be exactly the same as it is today without your new
feature.
The svn:encoding="native" would have some
effect, but it's not clear
to me how useful it would be. You mentioned Java source; I
don't know
a great deal about Java but ISO C source code can also, in
theory, be
written in any encoding. While such source can be converted
from one
encoding to another automatically it usually requires human
review to
ensure that the meaning of the code is preserved.
--
Philip Martin
------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribe subversion.tigris.org
For additional commands, e-mail: dev-help subversion.tigris.org
|
|
| Proposal for supporting WC file content
encoding |

|
2006-03-29 05:52:28 |
Julian Foad wrote:
> Jesper Steen Møller wrote:
>
>> I'm proposing to add functionality for
"handling" encoding in the
>> text content which Subversion handles.
>
> This proposal looks generally quite promising, with the
potential to
> introduce some useful and practical behaviours, but
I'm not exactly
> sure what you are aiming to achieve. You wrote about
the
> implementation method that you have chosen, but did not
say what you
> want users to be able to do, or why. What are the
user-oriented
> goals? To help describe the goals, it might be helpful
to include
> some "use cases", i.e. realistic concrete
examples (like transcripts)
> that demonstrate the various ways in which the user can
interact with
> this feature.
Sure enough. The case which made me think about this in the
first place
is in fact a problem seen with CVS in the Eclipse WTP
project, where
most developers were working with their Java source files on
Windows and
some developers (and in the concrete example, a build
environment) were
using some Unix/Linux with UTF-8 as the native charset. The
Java
compiler expects to see the native encoding, and the build
failed. Many
other applications (like GCC, etc) expect this behaviour.
This is a situation that is not likely to go away just yet.
While I was drafting a proposal for adding a property just
for
svn:text-encoding-style = native (mimicking the EOL stuff),
it occurred
to me that I was just dealing with a speicalized case, and I
saw that
people were also requesting text support for UTF-16 and
UTF-32, but that
it was argued that Subversion basically dealt with text as
byte-oriented
character data.
By allowing svn:text-encoding-style = native |
<encoding-name> this
would come, almost for free, since current diff/merge
functionality
would be pretty much retained (since we'd normalize to
UTF-8 before
operating on the files).
> [...]
>
>> 'svn diff' between WC and pristine would convert
the WC file up to the
>> "enriched" level before feeding to the
diff libraries (Not sure how
>> this would
>> be handled for external diff packages, it might
have to save to a
>> temp. file)
>
>
> So 'svn diff' would display its output in UTF-8
regardless of the
> encoding of the files. I can see how this could be
useful for people
> wanting a visual display of changes, especially when
the diff includes
> files with different encodings. Was that one of your
goals? However,
> people often want to use the output of "svn
diff" as the input to a
> standard "patch" program, and this would
prevent that from working.
It could encode back into the desired text format (on
output), so you'd
have the same result as when diffing two WC versions of the
file.
You'd get an ambiguity when diffing between a
"managed" and "unmanaged"
encoding, though.
> There are already other ways in which diff output best
suited for
> viewing is not the best output for using with
"patch", such as whether
> to display a file-rename as an all-lines-deleted diff
and an
> all-lines-added diff, or just as a statement saying
that the file was
> renamed. Maybe we need to introduce a mode switch for
"svn diff":
> human-readable mode versus "patch" mode, or
preferably "svn patch"
> mode versus "conventional patch" mode.
Yes, that's one useful approach. I will have a look at the
most
important corner cases.
>> The server (RA level) would only see the UTF-8
versions and would not
>> need any changes.
>
> When the RA method uses HTTP, I imagine some people
will want the
> server to be able to serve the file to generic HTTP
clients (web
> browsers) in its native (non-UTF8) encoding.
Yes, even that could be improved:
Today: Everything is just marked with some default (is it
Apache's
default, I seem to get ISO-8859-1 for everything?)
The proposal:
Simple solution:
If svn:text-encoding is not set, send some default like
today.
If svn:text-encoding is set - set charset=UTF-8 and it will
work (since
that's how it's stored).
Advanced solution:
If svn:text-encoding is not set, send some default like
today.
If svn:text-encoding is set:
1. Obey the client's setting of Accept-Charset
2. If svn:text-encoding is set to an encoding that the
server supports,
convert (if required) and send that encoding.
3. If the native encoding is requested, allow the server to
decide which
that would be (I don't really see the sense in this).
-Jesper
------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribe subversion.tigris.org
For additional commands, e-mail: dev-help subversion.tigris.org
|
|
| Proposal for supporting WC file content
encoding |

|
2006-03-30 21:26:05 |
Hi Subversion-developers
Philip Martin wrote:
>The svn:encoding="native" would have some
effect, but it's not clear
>to me how useful it would be. You mentioned Java
source; I don't know
>a great deal about Java but ISO C source code can also,
in theory, be
>written in any encoding. While such source can be
converted from one
>encoding to another automatically it usually requires
human review to
>ensure that the meaning of the code is preserved.
>
>
I'll add one additional pointer to a use case for this
proposal
(particularly the native bit):
<https://bugs.eclipse.org/bugs/show_bug.cgi?id=133239>
The desire to move to Subversion has come up a number of
times, it seems.
-Jesper
------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribe subversion.tigris.org
For additional commands, e-mail: dev-help subversion.tigris.org
|
|
[1-5]
|
|