Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Charmap or simple style rendering

For the charmap or simple style rendering, a solution could be to use ScriptShape with psa->s.fDisplayZWG = True to reliably know if glyphs are missing for a codepoint and also handle precomposed characters that might be rendered with several glyphs. No need to lookup the cmap table then, since we can simply filter out control codepoints / non-printable ones, as we draw each codepoint individually, anyways.

There should also be no problem caused due to complex shaping, or bidi rendering, since, again, we handle each codepoint individually, and therefore call ScriptShape for each codepoint, individually, as well. So no interactions between codepoints would occur.

We might however not get positional forms of glyphs, which could look weird in Arabic. However having positional forms without bidi rendering would cause weird results anyways.

Additionally, we should use GetGlyphOutline to get the real metrics of a glyph, like the diaeresis U+308 in Calibri. ScriptShape will return dimensions that are slightly off. We would also need to correct for any negative offsets. Refer to thoughts and exchange with author of BabelMap.

Finally, since we can get more than one glyph for a codepoint (as mentioned above for precomposed characters like ë), using GlyphOutline in this case will not work. But it is a pretty safe assumption that combining marks will always be single glyphs -- then again, I just found a font (estre.ttf -- Estrangelo Edessa Standard) where it decomposes such a combining mark (one dot above and below the base character).
https://docs.microsoft.com/en-us/typogr ... es_ae#ccmp
In Syriac, the character 0x0732 is a combining mark that has a dot above AND a dot below the base character. To avoid multiple glyph variants to fit all base glyphs, the character is decomposed into two glyphs...a dot above and a dot below. These two glyphs can then be correctly placed using GPOS.
Inspecting the GSUB table in this font, there is a rule that replaces 0x0732 with 0x073C and 0x073F (the first entry in the 'GSUB' lookup list).

So the better solution would be to check if the first glyph (which are always given in visual order, i.e., left to right) has a negative offset, and then compensate this offset as usual using the formula inspired from BabelMap. Following glyphs would not need to be compensated for, since the first glyph would now (after the correction) have an advance width larger than zero, and essentially work like a base glyph.

Glyph reordering should also not be a problem, since this would already have been done by ScriptShape for the glyph array it returns (i.e., glyphs would be in visual order, i.e., ltr).

Complex rendering

A possible solution is to implement code to manually look up a code point using the cmap table of a font (selected into a device context) using GetFontData. To avoid problems with weird encodings, such as of Symbol fonts or Macintosh, uncommon Arabic encodings, etc. (as mentioned in OpenType documentation) do this only for Unicode characters outside the BMP, and rely on ScriptGetCMap otherwise.

But the initial problem remains that ScriptShape may return wrong glyph indices if psa->s.fDisplayZWG = False. We could explicitly render text by removing control characters first, but they have a formatting effect or an effect on bidi, and therefore would alter the result in other unwanted ways.
This is fine for the hex editor text column, but not for the single line edit control I wanted to make (and use for the search dialog or the data inspector). We would need to at least differentiate the type of control characters that are drawn individually like in SciTE. So maybe CR/LF/NUL/HT etc. could be drawn like this, but not special Unicode control characters that affect bidi direction, such as RLO and PDF.

When looking of the general Unicode categories, the only problematic ones are
Z = Separator -- Zs | Zl | Zp
C = Other -- Cc | Cf | Cs | Co | Cn
Separators are simply:
  • Zs = Space_Separator -- a space character (of various non-zero widths)
    • But it might still be useful to display them optionally as control chars, to make their meaning more obvious, e.g., NO-BREAK SPACE cannot be distinguished from a normal space visually, same holds for the non-breaking FIGURE SPACE.
      I.e., a kind of "paragraph" mode like in word processors, which shows space characters explicitly could be helpful, but besides the additional informational value, rendering of space characters is unproblematic)
  • Zl = Line_Separator (U+2028 LINE SEPARATOR only)
  • Zp = Paragraph_Separator (U+2029 PARAGRAPH SEPARATOR only)
I.e., all of the Z chars should be rendered like SciTE does for control chars (but it should be only optional, for Zs chars).

Cc is C0 and C1, so the "classical" control chars of Latin1. Cs = Surrogate, Co = PrivateUse, Cn = Unassigned. So again all things I would want to render like in SciTE. The only category that remains is Cf = Format, which needs to be looked at in more detail.

TODO: used wrong file => incomplete list of characters, instead use https://www.unicode.org/Public/8.0.0/uc ... deData.txt
Cf again falls into several categories (see https://www.unicode.org/Public/8.0.0/ucd/PropList.txt ):
  • ZWNBSP or BOM
    • FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;
  • Bidi_Control
    • 200E..200F ; Pattern_White_Space # Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK (contained in the above, but this time Pattern_White_Space)
  • Join_Control, which is ZWNJ and ZWJ
    • codepoints whose rendering would be activated with psa->s.fDisplayZWG = True but also affect neighboring glyphs
      • SCRIPT_CONTROL.fFlags:
        • fLinkStringBefore -- Equivalent to presence of ZWJ before string
        • fLinkStringAfter -- Equivalent to presence of ZWJ after string
      • SCRIPT_ANALYSIS.fFlags:
        • fLinkBefore -- Implies there was a ZWJ before this item
          fLinkAfter -- Implies there is a ZWJ following this item.
      • SCRIPT_VISATTR.fFlags:
        • fZeroWidth -- Blank, ZWJ, ZWNJ etc, with no width
    • 200C..200D ; Other_Grapheme_Extend # Cf ZERO WIDTH NON-JOINER..ZERO WIDTH JOINER (same as just above! but this time Other_Grapheme_Extend)
  • Hyphen -- only SOFT HYPHEN (U+00AD)
  • 2061..2064 ; Other_Math # Cf [4] FUNCTION APPLICATION..INVISIBLE PLUS
  • Some deprecated ones
Considering the categories above, Bidi_Control is not relevant, since it is handled by ScriptItemize and when ordering runs, but is not relevant anymore in ScriptShape which only deals with runs, where each codepoint within such a run has the same reading order and script.

Soft Hyphen should just be rendered as individual codepoint as well as Other_Math (we don't support math typesetting in a normal edit control, obviously). The deprecated ones have no more function, so they can be ignored (or actually, rendered as control chars in SciTE).

The only remaining issue are the Join_Control codepoints: ZWJ and ZWNJ (German Wikipedia link, because it explains the most intuitive use in Arabic).

The use of ZWJ is to enable ligatures even at run boundaries or to enable joining glyphs but without ligatures, using a combination of ZWJ and ZWNJ.
This should be possible to achieve by breaking runs up (ZWNJ) and using a combination of the flags that have an equivalent effect to ZWJ (as mentioned above), or by breaking up runs to remove controls codepoints (so they don't get rendered), but enabling the ZWJ flag where necessary.

But if ZWJ would cause ligatures, breaking up runs would disable them, no matter if the ZWJ flags are set or not, since ligatures means one glyph or at least different glyphs that overlay, i.e., cannot be broken up into two runs that are rendered individually. See Devanagari and Kannada examples of ligatures in: https://en.wikipedia.org/wiki/Zero-width_joiner
So this remains an incomplete solution.

So a few control characters would remain in the string, even when calling ScriptShape.
TODO: think about this and look up the mozilla bug report linked above, and follow the links to find the final solution with cmap, which makes no sense when considering the ë to multiple glyphs issue. So maybe their solution is simply incomplete, as you would need a more complex handling of the ttf/otf file.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Filtering out invalid text, e.g., for Thai, is not done by default by Uniscribe and just a possibility (from http://archives.miloush.net/michkap/arc ... 06789.html ):
The code that "filters" these characters sits in code called by the EDIT control that checks for two things:

Does the script of the given text "disallow illegal sequences" as described by the SCRIPT_PROPERTIES->fRejectInvalid from GetScriptProperties TRUE, and
is the SCRIPT_LOGATTR->fInvalid of the given character also TRUE?

If both are, while you are typing, this code that is not in Uniscribe itself will fail the attempt to insert the text, and it will beep.

Obviously that doesn't work so well for text that is already present (how do you scold someone for illegal text alreay typed?), so in that case Uniscribe will just do as it is told. And it will of course include the 'empty circle" that implies a missing base character.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

from https://www.adobe.com/products/type/ado ... ssary.html
Terminology
  • Face or font face or typeface
  • Family or font family or typeface family
    • Adobe: A collection of faces that were designed and intended to be used together. For example, the Garamond family consists of roman and italic styles, as well as regular, semi-bold, and bold weights. Each of the style and weight combinations is called a face.
    • CSS: https://developer.mozilla.org/de/docs/W ... ont-family
      • A family name or generic family name
  • Font
    • Adobe: One weight, width, and style of a typeface. Before scalable type, there was little distinction between the terms font, face, and family. Font and face still tend to be used interchangeably, although the term face is usually more correct.
    • Wikipedia: In metal typesetting, a font was a particular size, weight and style of a typeface.
    • Me: Same as face, but with an additional size, and under Windows (LOGFONT) with possibly added decorations, such as underline and strikeout.
  • Typeface
    • Adobe: The letters, numbers, and symbols that make up a design of type. [Added note]: The term type is used generally to mean letters and other characters assembled into pages for printing or other means of reproduction.] A typeface is often part of a type family of coordinated designs. The individual typefaces are named after the family and are also specified with a designation, such as italic, bold or condensed.
    • Wikipedia: A typeface is the overall design of lettering; the design can include variations, such as extra bold, bold, regular, light, italic, condensed, extended, etc. Each of these variations of the typeface is a font.
  • Font name or face name or typeface name
    • In the WinAPI or at least LOGFONT, face name and font name indeed refer to a specific font face, not to the family. But it will revert to using the family name if the facename cannot be found.
    • After testing: facename is indeed NOT the family name in LOGFONT. It can distinguish "Calibri" and "Calibri Italic" properly. But will also treat it as family name, when no facename matches. For example, there is no facename/fontname "Yu Gothic", only "Yu Gothic Regular", "Yu Gothic Light", and "Yu Gothic Bold". Yet when using "Yu Gothic" as facename it properly selects "Yu Gothic Regular". But it also properly selects "Yu Gothic Bold", the bold version of the "Yu Gothic" family, when the full facename "Yu Gothic Bold" is entered (no need to additionally set the lfWeight property).
    • https://docs.microsoft.com/en-us/typogr ... _font_list The table in this link validates that fontname means font face name, since another column is explicitly called family.
    • HTML4.01: Has a font tag, with an attribute face, that expects a font name. Functionally, it searches for font family names, not font face names, i.e., it searches for "Calibri" not "Calibri Italic". Tested under Firefox, which indeed cannot find "Calibri Italic": it keeps the default font, i.e., Times New Roman when "Calibri Italic" is specified as font face, but changes the font to as expected, when entering "Calibri".
      It seems face name, and font name stood for font family name in the past, for HTML.
This overview shows that the terms are really used in slightly varying ways, and the confusion about the exact meaning is not by chance.
Typeface is viewed by Adobe as synonym for font face, but by Wikipedia as synonym for font family. Deprecated HTML is somewhere in-between and uses facename and fontname to really mean font family, possibly due to the interpretation of typeface as family (similar to Wikipedia).
MS seems to make the same mistake as HTML, but really uses facename and fontname to refer to specific face, such as "Calibri Italic". However if no facename matches, it will revert to using it as font family name, which creates this misunderstanding about what lfFaceName means.

In short, the clearest, even not consistently used terms are font as a specific instance where everything is specified (usually the size, too, but not always). Font face is always without the size specified, and font family is a collection of font faces. Finally, font name or face name refers to a font face, but some systems mean the font family (font tag in HTML) and others (WinAPI) really do mean font face, but fall back to font family name if no font face name matches.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Thanks for the idea, but as you can read in the link, it is just a direct text output without editing capability. It also does not solve the many problems mentioned in this forum topic, that result from such a simple display approach, when you really would need to synch the byte array (left column) and text display.
Post Reply