|
This page can be safely skipped by
readers concerned only with ASCII source texts.
Background
Because we do not yet have any personal experience with
non-ASCII Unicode
characters at the time of this writing, we considered specifying that
source text for E 0.9 be restricted
to ASCII, and therefore that all non-ASCII source characters may only
be expressed using Backslash-u Decoding. However,
this would be too great a burden on non-English-based programmers wishing
to use E. We obtain a very
similar effect indirectly.
The following text from the Java
Language Specification (the JLS) effectively defines a character
encoding form for Unicode:
The Java programming language specifies a standard way
of transforming a program written in Unicode into ASCII that changes
a program into a form that can be processed by ASCII-based tools. The
transformation involves converting any Unicode escapes in the source
text of the program to ASCII by adding an extra u -- for example, \uxxxx
becomes \uuxxxx -- while simultaneously converting non-ASCII characters
in the source text to a \uxxxx escape containing a single u.
The JLS defines a Unicode escape effectively as
'\\' 'u'+ <hexDigit> <hexDigit> <hexDigit> <hexDigit>
where the total number of slashes (if any) immediately preceding
this sequence is even.
[Spec] The following Bug MUST
be fixed: What if a non-ASCII character occurs immediately after an
odd number of slashes? The above encoding will produce a Unicode escape
sequence immediately following this odd number of slashes, which will
therefore no longer be considered an actual Unicode escape. Is this also
a bug in the JLS?
Written out at one byte per resulting ASCII character, this
encoding form also defines a character
encoding scheme. We call this encoding form/scheme UTF-J2,
since the Unicode escape defined above can only represent a 16 bit (2
byte) code point. The same section of the JLS also defines two ways of
decoding such text back into a sequence of 16 bit code points. The first
reverses the above encoding with no loss of information:
The exact Unicode source can later be restored from this ASCII form
by converting each escape sequence where multiple u's are present to
a sequence of Unicode characters with one fewer u, while simultaneously
converting each escape sequence with a single u to the corresponding
single Unicode character.
The other decoding method simply decodes each Unicode escape into the
Unicode code point it encodes. The first decoding method would be used
to preserve appearance of the source to those using Unicode editors and
mixing Unicode characters with Unicode escape sequences. We call this
first decoding method a UTF-J2 presentational decode, and consider
it no further. The second would be used prior to all other forms of further
processing, which we call simply a UTF-J2 decode.
UTF-J4
To handle any Unicode character, we extend the above scheme by defining
a Unicode escape to be a sequence of characters accepted either by the
above pattern, or:
'\\' 'u'+ '{' '0' 'x' <hexDigit>+ '}'
We call this extended encoding scheme UTF-J4. A UTF-J4 encode,
when generating a Unicode escape for a non-ASCII code point, SHOULD always
use the first form for 16 bit code points, and SHOULD always use the shortest
encoding in the second form for supplemetary characters.
Rationale: Pleasing Regularities
In the second form of Unicode escape, we include the '0' 'x'
prefix so the string between the curlies will appear to be a numeric
literal. This leaves us open to eventually allowing, for example,
a character name to appear between the curlies instead of a hex code
point.
For purposes of specification, we suppose the following functions
- utfJ4Encode(CodePoint[]) -> AsciiByte[]
- utfJ4Decode(CodePoint[]) -> CodePoint[]
- utf8Decode(UTF8Byte[]) -> CodePoint[]
[Src] The octet sequence input to utf8Decode MAY optionally
begin with the UTF-8 BOM sequence: 0xEF 0xBB 0xBF, which
utf8Decode MUST skip.
Since ASCII is the 7-bit byte subset of both UTF-8 code units and
Unicode code points, we consider AsciiByte[] to be a subtype
of both CodePoint[] and UTF8Byte[].
For all sequences of Unicode code points u:
utfJ4Decode(u) == utfJ4Decode(utfJ4Encode(u)) == ... # and so
on, for any number of UTF-J4 encodings prior to the UTF-J4 decoding.
Therefore, given that we're going to do a utfJ4Decode prior
to further processing, we don't care whether our input is the true
source, or is a UTF-J4 encoding of the source. (If we change the spec
below to track source positions on one of the representations prior
to the utfJ4Decode, then these alternatives would no longer
be strictly equivalent, so under some circumstances we would care.)
From Bytes (Octets) to Source Text
The double-decode above yields the same result as
utfJ4Decode(utfJ4Encode(utf8Decode(f))).
If f is in ASCII, then utfJ4Decode(f) also yields the
same result.
-
[Src] When f is a sequence of octets to be decoded into
source, utfJ4Encode(utf8Decode(f))) SHOULD be in Wysiwyg-ASCII
Format.
-
[Src] When a source language's grammar uses matched brackets to indicate
nesting structure, source text in this language SHOULD use spaces
for indentation to signal this nesting structure accurately to the
human eye. Further, source text SHOULD NOT include any tab characters
at all.
-
When rendering text in a fixed width font, tab characters SHOULD
be rendered as whitespace extending to the next modulo-8 tab stop.
[Advisor] An advisor therefore SHOULD alert reviewers of violations
of the above Src RECOMMENDATIONS.
Depending on the density of Unicode escape sequences, the UTF-J4
encoding of the source may or may not be adequately readable for a
review. If this format is adequately readable, reviewers are advised
to look at a rendering of this encoding in a font in which ASCII printing
characters may be easily distinguished. For example, the following
are distinct ASCII printing characters, and should each be unambiguously
recognizable:
1l|!oO0`'
If Raven the reviewer is looking at a readable UTF-J4 encoding of
conforming sources in Wysiwyg-ASCII format, in a font in which all
ASCII printing characters are unambiguously recognizable, then Raven
has grounds for some confidence that the appearance of the text encodes
all the meaning of the text as it will be interpreted by a conforming
language processor. Of course, Arthur the author can still write code
that will confuse Raven the reviewer. But we hope we've made it hard
for Arthur to also confuse Raven about whether she's confused. If
Raven knows she's confused, she can simply reject Arthur's code.
Newline Canonicalization
Once we have source text that passes the above checks, the
following transformations are then applied, logically in order, to create
the source text used for lexical analysis:
-
MS-DOS Newline Canonicalization. All
occurrences of the sequence '\r' '\n' (or CRLF) are replaced with
'\n' (LF).
-
Mac OS <= 9 Newline Canonicalization.
All remaining occurrences of '\r' (or CR) are replaced with '\n' (LF).
-
Line and Column Numbering. Line and
column numbers designate positions in the source text after the above
steps. The first line is line number 1. The first column is column
number 0.
(The JLS also says that newline canonicalization
happens after interpreting Unicode escapes. Is this really true? It seems
silly, but I'd rather follow Java's lead on this than to try reversing
the order. What does Java do about source positions? Does it say anywhere?)
Only BMP Characters
-
[Src] Following the above double-decode, the source text MUST consist
only of a sequence of Unicode encoded
characters.
-
[Src] As of E 0.9, source
text MUST contain only BMP
characters, i.e., only those Unicode encoded
characters whose code
points fit within 16 bits. (From this,
it would seem that UCS-2 characters might be what I mean, but I'm
not sure.)
(Is this too strict? Should we say instead only that source text MUST
contain only 16-bit code points and MUST NOT contain surrogate code points?
Should we demote the other RULES to RECOMMENDATIONS? That would seem to
be the minimal restriction needed to satisfies the following issue.)
Rationale: Indecision is the
mother of convention
Unicode has had a complex but understandable history.
As of the Unicode 3.0 standard or so, it was thought that Unicode
could fit all the world's characters into a 16 bit character set.
Based on this, the Java and Python language s defined a "char"
as 16 bits. Java provided good support for handling Unicode, and became
a leading platform for developing Unicode-ready software. Unfortunately,
the Unicode consortium found that 16 bits was too tight, and expanded
Unicode into a 21-bit character set. It was then unclear what to do
about legacy formerly-Unicode-ready libraries. The litmus test is
indexing:
How does one interpret a source position? What is a counting unit
for determining the length of a string? Currently, the dominant approaches
are:
Java further defines and uses "Modified
UTF-8" rather than standard UTF-8. In Java's modified
UTF-8, a supplementary characters is represented by UTF-8 encoding
each of the surrogate code points in the UTF-16 encoding of
the character. This is explicitly forbidden by the Unicode
spec (D36):
Because surrogate code points are not Unicode scalar
values, any UTF-8 byte sequence that would otherwise map to
code points D800..DFFF is ill-formed.
We thanks David Hopwood for pointing
this out. |
-
The XPath and
Python way (see PEP
0263, PEP
261): A counting unit is a Unicode encoded character.
-
The DOM
and Java
1.5 way: A counting unit is a UTF-16 code unit. A Java char
no longer represents a character -- it represents a UTF-16 code
unit.
-
IBM's ICU library supports
both, although it's heavily biased towards the Java way.
Although the XPath and Python approach is clearly more
right (and is recommended
by CharMod), we wish to postpone choosing sides until it's clear
who the winner will be. Therefore
-
[Spec] The E 0.9 specs
must be downward compatible from any of the above choices.
-
[Producer][Validator][Advisor] Until a decision is made, programs
written to handle text SHOULD be compatible with any of these
choices being made in the future.
The E 0.9
requirement that the source text MUST contain only BMP
characters implies that it MUST NOT contain any
-
supplementary
characters -- characters whose code points are larger than
16 bits, i.e., are in the range 0x1_0000..0x10_FFFF.
-
surrogate_code_points
-- code points in the range 0xD800 through 0xDFFF. The general
category of these is "Cs".
-
undesignated
code points -- also called reserved or unassigned code points.
These are either noncharacters,
or code points whose interpretation is not yet specified as of
that version of Unicode. The general category of these is "Cn".
-
private-use
code points -- those whose interpretation will not be specified
by the Unicode consortium. The general category for these is "Co".
A validator MUST therefore statically reject source text containing
code points that are not encodings of BMP characters.
? pragma.syntax("0.8")
? def makeChar := <import:java.lang.makeCharacter>
? def isBMPChar(codePoint :(0..0x10_FFFF)) :boolean {
>
>
> if (codePoint > 0xFFFF) {
>
>
> return false
> }
>
> def ch := codePoint.asChar()
> def cat := ch.getCategory()
>
>
>
> > return !(["Cs", "Cn", "Co"].contains(cat))
> }
# value: <isBMPChar>
Source Text SHOULD be in NFC
[Src] source text SHOULD conform to CharMod
and CharNorm. In particular,
it SHOULD be in Unicode
Normalized Form C (NFC), and SHOULD NOT contain Characters
not Suitable for use With Markup.
(Should we further recommend that source text be include
normalized or fully
normalized? What would these mean in this context?)
Rationale: Caught in the Web
E is a distributed
programming language. E
code is often mobile code. Therefore, it could be considered like
a kind of web content, even though it is not a kind of markup. For
possible ease of integration with other tools, and to reduce cases,
it would be good to stay within the W3C's character model.
|
|