ERights Home data / common-syntax 
Back to: Representing Characters On to: Lexing

The Wysiwyg-ASCII Format


This document specifies what is means for a sequence of ASCII characters to be in Wysiwyg-ASCII Format. This spec is used as a component of the E family Common Syntactic Elements spec and adopts its Conformance language.

Typical use: The check as to whether text conforms to this Wysiwyg-ASCII Format is expected to occur, for E family languages, after UTF-J4 Encoding (to produce ASCII containing Unicode escape sequences), and prior to UTF-J4 Decoding and newline canonicalization.

No Invisible Control Characters

[Src] Wysiwyg-ASCII text MAY contain the whitespace characters:

  • ' ' (space),
  • '\n' (linefeed or newline), and
  • '\r' (carriage return)

but MUST NOT contain other control codes. (Characters whose general category is "Cc".) In particular, Wysiwyg-ASCII text MUST NOT contain any '\t' (tab) characters.

Ephemeral Newline Canonicalization

In the typical use of this spec, carriage returns will disappear later during newline canonicalization. Therefore, unfortunately, we perform a essentually[1] the same newline canonicalization calculation here, whose results are thrown away once conformance to Wysiwyg-ASCII is determined. The following steps occur after this ephemeral canonicalization.

No Trailing Whitespace

[Src] Wysiwyg-ASCII text MUST NOT contain the sequence ' ' '\n' (space, immediately followed by newline).

[Src] Wysiwyg-ASCII text MUST end with a '\n'.

[Src] This last newline MUST NOT be immediately preceded by whitespace.

Rationale: If you can't see it, you probably don't want it

For text that is known to pass these checks, when rendered in a fixed width font in which each ASCII printing character is distinctly recognizable, a reviewer can know from the rendering of the text precisely what are the contents of the text. As a good litmus test, if we render in an OCRable font, an accurate OCR of the printed form should yield exactly the original Wysiwyg-ASCII text (after ephemeral newline canonicalization).


[1] It doesn't necessarily give the same results as the actual newline canonicalization, since the actual one is performed after UTF-J4 decoding, which can introduce, for example, new carriage return characters. [Src] Source text SHOULD not engage in this practice. Therefore, an advisor SHOULD issue an informative warning for all such cases. (This all lends further weight to the argument that newline canonicalization should happen between UTF-J4 encoding and decoding. Does Java really do it after decoding Unicode escapes?)

 
Unless stated otherwise, all text on this page which is either unattributed or by Mark S. Miller is hereby placed in the public domain.
ERights Home data / common-syntax 
Back to: Representing Characters On to: Lexing
Download    FAQ    API    Mail Archive    Donate

report bug (including invalid html)

Golden Key Campaign Blue Ribbon Campaign