List Info

Thread: Proposal for supporting WC file content encoding




Proposal for supporting WC file content encoding
user name
2006-03-26 19:52:17
Dear Subversion-dev,

I'm proposing to add functionality for
"handling" encoding in the text 
content which Subversion handles.
I've read the discussion on UTF-16 support (as referenced
in 
http://subversion.tigris.org/issues/show_bug.cgi?id=2194
), and I've been 
lacking locale aware content encoding myself (in fact, I
assumed it was 
present already...), and think I the feature can be
implemented without 
changing too much of the client code. I hope I'm not just
stating the 
obvious, but I think it could be done in these few steps:

1) Property support for specifying text encoding
2) ASCII-based encoding conversion support in the WC library
(along with 
current
EOL and keyword handling in 'subst')
3) Extend support to "non-ASCII-based" encodings
(like UTF-16, EBCDIC):
4) Adding auto-property support for text encoding

Each step is independent from the next, so that we could
stop at any 
time, no "big-bang" is neccesary. The proposal
is based on four main 
principles:
 * Its free if you don't need it
 * Don't surprise the user
 * UTF-8 is the new ASCII ("1112060 code point should
be enough for 
everybody")
 * Must be backwards compatible

The idea is to normalize all text which uses the feature to
UTF-8 and 
then only convert to/from specified encodings at the
outermost level of 
the WC handling, like this:

   WC file contents (user specified encoding or locale's)
         ||
  Enriched contents (UTF-8, w/keywords and/or EOL trans)
         ||
   Pristine contents (UTF-8, as stored in FS)

'svn diff' between WC and pristine would convert the WC
file up to the
"enriched" level before feeding to the diff
libraries (Not sure how this 
would
be handled for external diff packages, it might have to save
to a temp. 
file)

The server (RA level) would only see the UTF-8 versions and
would not 
need any changes. The client would detect encoding by
looking at the 
propery, and act accordingly. Old clients would not know
this and only 
see a UTF-8 file.
Further details about the four steps:

Ad 1) Property support for specifying text encoding:

I propose that we introduce a new property for text files
called
svn:text-encoding (or mabye svn:text-encoding-style, or
perhaps just 
svn:encoding)?
This can take three kinds of values:
 - The name of a specific encoding, like ISO-8859-1 or UTF-8
 - The special value 'native'
 - Empty or missing (the default)

The idea is that IF svn:text-encoding is specified, then the
WC library 
and the clients in general are responsible for converting to
and from to 
the specified format (with 'native' being the system's
default 
encoding), and that the RA level only ever sees UTF-8 for
these text 
resources. The encoding is said to be "managed".
This follows the style of svn:eol-style and needs support in
roughly the 
same places.
The 'native' mode is interesting for the case where text
files (like 
Java source files) do not carry the encoding with them (like
e.g. XML does).
If the text-encoding is not set, then the encoding is
"unmanaged" in 
that it works like it does today.

Ad 2) ASCII-based encoding conversion support in the WC
library:

The first step in supporting this would be to add the
support into the 
WC and client libraries. For 8-bit (ASCII-based) encodings,
the basic 
support of this doen't touch the diff support, which at
this point 
already makes some assumptions about the encoding, as far as
I can tell. 
I think the "streamy" API in svn_subst.c can be
layered with the 
encoding support. Also diff output should be reflect these
encoding 
changes, to show "encoding
only" changes:

Index: cool-stuff/todo.txt
============================================================
=======
--- cool-stuff/todo.txt (revision 42, ISO-8859-1)
+++ cool-stuff/todo.txt (working copy, UTF-8)
svn:text-encoding = UTF-8

Property changes on: cool-stuff/todo.txt
____________________________________________________________
_______
Name: svn:text-encoding
   - ISO-8859-1
   + UTF-8

There are some edge cases to be considered, when the
text-encoding 
changes from "unmanaged" to
"managed" (or back), where the diff engine 
would pick up all kinds of "bogus" text changes.
This may need special 
attention.
Another edge case: Some commit logic should be present to
check that a 
"managed" file being checked in is in fact valid
in the said encoding 
(so that careless handling of file encoding won't
inadvertently break 
the repository data).

Ad 3) Extend this support to "non-ASCII-based"
encodings (like UTF-16, 
EBCDIC):

Actually this may not be a big issue at all, if the
conversions are 
added at the right level, since the main diffing engine
would always 
work on UTF-8 (in fact, it would always work on 8-bit
oriented streams 
separated by LFs, just like it does now).
The only change I can think of right now is the fixed width
keyword 
substitution, which today works on bytes, but could work
fine on 
characters if the knowledge was there.

Ad 4) Adding auto-property support for text encoding:

Plenty of options exist: BOM detection, detecting of UTF-8 
leading/trailing bytes, checking for XML declarations, etc.
There should 
also be a configuration setting for preferring the native
encoding
over the detected one (if the detector sees a file encoded
with the 
encoding which is also the current native encoding).

How does this sound?

-Jesper

------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribesubversion.tigris.org
For additional commands, e-mail: dev-helpsubversion.tigris.org

Proposal for supporting WC file content encoding
user name
2006-03-28 13:16:05
Jesper Steen Møller wrote:
> 
> I'm proposing to add functionality for
"handling" encoding in the text 
> content which Subversion handles.

This proposal looks generally quite promising, with the
potential to introduce 
some useful and practical behaviours, but I'm not exactly
sure what you are 
aiming to achieve.  You wrote about the implementation
method that you have 
chosen, but did not say what you want users to be able to
do, or why.  What are 
the user-oriented goals?  To help describe the goals, it
might be helpful to 
include some "use cases", i.e. realistic
concrete examples (like transcripts) 
that demonstrate the various ways in which the user can
interact with this feature.

[...]
> 'svn diff' between WC and pristine would convert the
WC file up to the
> "enriched" level before feeding to the diff
libraries (Not sure how this would
> be handled for external diff packages, it might have to
save to a temp. file)

So 'svn diff' would display its output in UTF-8 regardless
of the encoding of 
the files.  I can see how this could be useful for people
wanting a visual 
display of changes, especially when the diff includes files
with different 
encodings.  Was that one of your goals?  However, people
often want to use the 
output of "svn diff" as the input to a standard
"patch" program, and this would 
prevent that from working.

There are already other ways in which diff output best
suited for viewing is 
not the best output for using with "patch", such
as whether to display a 
file-rename as an all-lines-deleted diff and an
all-lines-added diff, or just 
as a statement saying that the file was renamed.  Maybe we
need to introduce a 
mode switch for "svn diff": human-readable mode
versus "patch" mode, or 
preferably "svn patch" mode versus
"conventional patch" mode.


> The server (RA level) would only see the UTF-8 versions
and would not 
> need any changes.

When the RA method uses HTTP, I imagine some people will
want the server to be 
able to serve the file to generic HTTP clients (web
browsers) in its native 
(non-UTF8) encoding.


- Julian

------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribesubversion.tigris.org
For additional commands, e-mail: dev-helpsubversion.tigris.org

Proposal for supporting WC file content encoding
user name
2006-03-28 23:45:46
Julian Foad <julianfoadbtopenworld.com> writes:

> Jesper Steen Møller wrote:
>> I'm proposing to add functionality for
"handling" encoding in the
>> text content which Subversion handles.
>
> This proposal looks generally quite promising, with the
potential to
> introduce some useful and practical behaviours, but
I'm not exactly
> sure what you are aiming to achieve.  You wrote about
the
> implementation method that you have chosen, but did not
say what you
> want users to be able to do, or why.  What are the
user-oriented
> goals?  To help describe the goals, it might be helpful
to include
> some "use cases", i.e. realistic concrete
examples (like transcripts)
> that demonstrate the various ways in which the user can
interact with
> this feature.

For example if someone were to use
svn:encoding="iso-8859-1" to
produce a working file in iso-8859-1 the file in the working
copy
would be exactly the same as it is today without your new
feature.

The svn:encoding="native" would have some
effect, but it's not clear
to me how useful it would be.  You mentioned Java source; I
don't know
a great deal about Java but ISO C source code can also, in
theory, be
written in any encoding.  While such source can be converted
from one
encoding to another automatically it usually requires human
review to
ensure that the meaning of the code is preserved.

-- 
Philip Martin

------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribesubversion.tigris.org
For additional commands, e-mail: dev-helpsubversion.tigris.org

Proposal for supporting WC file content encoding
user name
2006-03-29 05:52:28
Julian Foad wrote:

> Jesper Steen Møller wrote:
>
>> I'm proposing to add functionality for
"handling" encoding in the 
>> text content which Subversion handles.
>
> This proposal looks generally quite promising, with the
potential to 
> introduce some useful and practical behaviours, but
I'm not exactly 
> sure what you are aiming to achieve.  You wrote about
the 
> implementation method that you have chosen, but did not
say what you 
> want users to be able to do, or why.  What are the
user-oriented 
> goals?  To help describe the goals, it might be helpful
to include 
> some "use cases", i.e. realistic concrete
examples (like transcripts) 
> that demonstrate the various ways in which the user can
interact with 
> this feature.

Sure enough. The case which made me think about this in the
first place 
is in fact a problem seen with CVS in the Eclipse WTP
project,  where 
most developers were working with their Java source files on
Windows and 
some developers (and in the concrete example, a build
environment) were 
using some Unix/Linux with UTF-8 as the native charset. The
Java 
compiler expects to see the native encoding, and the build
failed. Many 
other applications (like GCC, etc) expect this behaviour.

This is a situation that is not likely to go away just yet.

While I was drafting a proposal for adding a property just
for 
svn:text-encoding-style = native (mimicking the EOL stuff), 
it occurred 
to me that I was just dealing with a speicalized case, and I
saw that 
people were also requesting text support for UTF-16 and
UTF-32, but that 
it was argued that Subversion basically dealt with text as
byte-oriented 
character data.

By allowing svn:text-encoding-style = native |
<encoding-name> this 
would come, almost for free, since current diff/merge
functionality 
would be pretty much retained (since we'd normalize to
UTF-8 before 
operating on the files).

> [...]
>
>> 'svn diff' between WC and pristine would convert
the WC file up to the
>> "enriched" level before feeding to the
diff libraries (Not sure how 
>> this would
>> be handled for external diff packages, it might
have to save to a 
>> temp. file)
>
>
> So 'svn diff' would display its output in UTF-8
regardless of the 
> encoding of the files.  I can see how this could be
useful for people 
> wanting a visual display of changes, especially when
the diff includes 
> files with different encodings.  Was that one of your
goals?  However, 
> people often want to use the output of "svn
diff" as the input to a 
> standard "patch" program, and this would
prevent that from working.

It could encode back into the desired text format (on
output), so you'd 
have the same result as when diffing two WC versions of the
file.
You'd get an ambiguity when diffing between a
"managed" and "unmanaged" 
encoding, though.

> There are already other ways in which diff output best
suited for 
> viewing is not the best output for using with
"patch", such as whether 
> to display a file-rename as an all-lines-deleted diff
and an 
> all-lines-added diff, or just as a statement saying
that the file was 
> renamed.  Maybe we need to introduce a mode switch for
"svn diff": 
> human-readable mode versus "patch" mode, or
preferably "svn patch" 
> mode versus "conventional patch" mode.

Yes, that's one useful approach. I will have a look at the
most 
important corner cases.

>> The server (RA level) would only see the UTF-8
versions and would not 
>> need any changes.
>
> When the RA method uses HTTP, I imagine some people
will want the 
> server to be able to serve the file to generic HTTP
clients (web 
> browsers) in its native (non-UTF8) encoding.

Yes, even that could be improved:
Today: Everything is just marked with some default (is it
Apache's 
default, I seem to get ISO-8859-1 for everything?)
The proposal:

Simple solution:
If svn:text-encoding is not set, send some default like
today.
If svn:text-encoding is set - set charset=UTF-8 and it will
work (since 
that's how it's stored).

Advanced solution:
If svn:text-encoding is not set, send some default like
today.
If svn:text-encoding is set:
1. Obey the client's setting of Accept-Charset
2. If svn:text-encoding is set to an encoding that the
server supports, 
convert (if required) and send that encoding.
3. If the native encoding is requested, allow the server to
decide which 
that would be (I don't really see the sense in this).

-Jesper

------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribesubversion.tigris.org
For additional commands, e-mail: dev-helpsubversion.tigris.org

Proposal for supporting WC file content encoding
user name
2006-03-30 21:26:05
Hi Subversion-developers

Philip Martin wrote:

>The svn:encoding="native" would have some
effect, but it's not clear
>to me how useful it would be.  You mentioned Java
source; I don't know
>a great deal about Java but ISO C source code can also,
in theory, be
>written in any encoding.  While such source can be
converted from one
>encoding to another automatically it usually requires
human review to
>ensure that the meaning of the code is preserved.
>  
>
I'll add one additional pointer to a use case for this
proposal 
(particularly the native bit):

<https://bugs.eclipse.org/bugs/show_bug.cgi?id=133239>


The desire to move to Subversion has come up a number of
times, it seems.

-Jesper

------------------------------------------------------------
---------
To unsubscribe, e-mail: dev-unsubscribesubversion.tigris.org
For additional commands, e-mail: dev-helpsubversion.tigris.org

[1-5]

about | contact  Other archives ( Real Estate discussion Medical topics )