For the charmap or simple style rendering, a solution could be to use ScriptShape with psa->s.fDisplayZWG = True to reliably know if glyphs are missing for a codepoint and also handle precomposed characters that might be rendered with several glyphs. No need to lookup the cmap table then, since we can simply filter out control codepoints / non-printable ones, as we draw each codepoint individually, anyways.
There should also be no problem caused due to complex shaping, or bidi rendering, since, again, we handle each codepoint individually, and therefore call ScriptShape for each codepoint, individually, as well. So no interactions between codepoints would occur.
We might however not get positional forms of glyphs, which could look weird in Arabic. However having positional forms without bidi rendering would cause weird results anyways.
Additionally, we should use GetGlyphOutline to get the real metrics of a glyph, like the diaeresis U+308 in Calibri. ScriptShape will return dimensions that are slightly off. We would also need to correct for any negative offsets. Refer to thoughts and exchange with author of BabelMap.
Finally, since we can get more than one glyph for a codepoint (as mentioned above for precomposed characters like ë), using GlyphOutline in this case will not work. But it is a pretty safe assumption that combining marks will always be single glyphs -- then again, I just found a font (estre.ttf -- Estrangelo Edessa Standard) where it decomposes such a combining mark (one dot above and below the base character).
https://docs.microsoft.com/en-us/typogr ... es_ae#ccmp
Inspecting the GSUB table in this font, there is a rule that replaces 0x0732 with 0x073C and 0x073F (the first entry in the 'GSUB' lookup list).In Syriac, the character 0x0732 is a combining mark that has a dot above AND a dot below the base character. To avoid multiple glyph variants to fit all base glyphs, the character is decomposed into two glyphs...a dot above and a dot below. These two glyphs can then be correctly placed using GPOS.
So the better solution would be to check if the first glyph (which are always given in visual order, i.e., left to right) has a negative offset, and then compensate this offset as usual using the formula inspired from BabelMap. Following glyphs would not need to be compensated for, since the first glyph would now (after the correction) have an advance width larger than zero, and essentially work like a base glyph.
Glyph reordering should also not be a problem, since this would already have been done by ScriptShape for the glyph array it returns (i.e., glyphs would be in visual order, i.e., ltr).
Complex rendering
A possible solution is to implement code to manually look up a code point using the cmap table of a font (selected into a device context) using GetFontData. To avoid problems with weird encodings, such as of Symbol fonts or Macintosh, uncommon Arabic encodings, etc. (as mentioned in OpenType documentation) do this only for Unicode characters outside the BMP, and rely on ScriptGetCMap otherwise.
But the initial problem remains that ScriptShape may return wrong glyph indices if psa->s.fDisplayZWG = False. We could explicitly render text by removing control characters first, but they have a formatting effect or an effect on bidi, and therefore would alter the result in other unwanted ways.
This is fine for the hex editor text column, but not for the single line edit control I wanted to make (and use for the search dialog or the data inspector). We would need to at least differentiate the type of control characters that are drawn individually like in SciTE. So maybe CR/LF/NUL/HT etc. could be drawn like this, but not special Unicode control characters that affect bidi direction, such as RLO and PDF.
When looking of the general Unicode categories, the only problematic ones are
Separators are simply:Z = Separator -- Zs | Zl | Zp
C = Other -- Cc | Cf | Cs | Co | Cn
- Zs = Space_Separator -- a space character (of various non-zero widths)
- But it might still be useful to display them optionally as control chars, to make their meaning more obvious, e.g., NO-BREAK SPACE cannot be distinguished from a normal space visually, same holds for the non-breaking FIGURE SPACE.
I.e., a kind of "paragraph" mode like in word processors, which shows space characters explicitly could be helpful, but besides the additional informational value, rendering of space characters is unproblematic)
- But it might still be useful to display them optionally as control chars, to make their meaning more obvious, e.g., NO-BREAK SPACE cannot be distinguished from a normal space visually, same holds for the non-breaking FIGURE SPACE.
- Zl = Line_Separator (U+2028 LINE SEPARATOR only)
- Zp = Paragraph_Separator (U+2029 PARAGRAPH SEPARATOR only)
Cc is C0 and C1, so the "classical" control chars of Latin1. Cs = Surrogate, Co = PrivateUse, Cn = Unassigned. So again all things I would want to render like in SciTE. The only category that remains is Cf = Format, which needs to be looked at in more detail.
TODO: used wrong file => incomplete list of characters, instead use https://www.unicode.org/Public/8.0.0/uc ... deData.txt
Cf again falls into several categories (see https://www.unicode.org/Public/8.0.0/ucd/PropList.txt ):
- ZWNBSP or BOM
- FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;
- Bidi_Control
- 200E..200F ; Pattern_White_Space # Cf [2] LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK (contained in the above, but this time Pattern_White_Space)
- Join_Control, which is ZWNJ and ZWJ
- codepoints whose rendering would be activated with psa->s.fDisplayZWG = True but also affect neighboring glyphs
- SCRIPT_CONTROL.fFlags:
- fLinkStringBefore -- Equivalent to presence of ZWJ before string
- fLinkStringAfter -- Equivalent to presence of ZWJ after string
- SCRIPT_ANALYSIS.fFlags:
- fLinkBefore -- Implies there was a ZWJ before this item
fLinkAfter -- Implies there is a ZWJ following this item.
- fLinkBefore -- Implies there was a ZWJ before this item
- SCRIPT_VISATTR.fFlags:
- fZeroWidth -- Blank, ZWJ, ZWNJ etc, with no width
- SCRIPT_CONTROL.fFlags:
- 200C..200D ; Other_Grapheme_Extend # Cf ZERO WIDTH NON-JOINER..ZERO WIDTH JOINER (same as just above! but this time Other_Grapheme_Extend)
- codepoints whose rendering would be activated with psa->s.fDisplayZWG = True but also affect neighboring glyphs
- Hyphen -- only SOFT HYPHEN (U+00AD)
- 2061..2064 ; Other_Math # Cf [4] FUNCTION APPLICATION..INVISIBLE PLUS
- Some deprecated ones
Soft Hyphen should just be rendered as individual codepoint as well as Other_Math (we don't support math typesetting in a normal edit control, obviously). The deprecated ones have no more function, so they can be ignored (or actually, rendered as control chars in SciTE).
The only remaining issue are the Join_Control codepoints: ZWJ and ZWNJ (German Wikipedia link, because it explains the most intuitive use in Arabic).
The use of ZWJ is to enable ligatures even at run boundaries or to enable joining glyphs but without ligatures, using a combination of ZWJ and ZWNJ.
This should be possible to achieve by breaking runs up (ZWNJ) and using a combination of the flags that have an equivalent effect to ZWJ (as mentioned above), or by breaking up runs to remove controls codepoints (so they don't get rendered), but enabling the ZWJ flag where necessary.
But if ZWJ would cause ligatures, breaking up runs would disable them, no matter if the ZWJ flags are set or not, since ligatures means one glyph or at least different glyphs that overlay, i.e., cannot be broken up into two runs that are rendered individually. See Devanagari and Kannada examples of ligatures in: https://en.wikipedia.org/wiki/Zero-width_joiner
So this remains an incomplete solution.
So a few control characters would remain in the string, even when calling ScriptShape.
TODO: think about this and look up the mozilla bug report linked above, and follow the links to find the final solution with cmap, which makes no sense when considering the ë to multiple glyphs issue. So maybe their solution is simply incomplete, as you would need a more complex handling of the ttf/otf file.