Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Some other useful references regarding Uniscribe:

https://microsoft.public.win32.programm ... -uniscribe
Explains the exact parameters to have proper font fallback and linking.

https://web.archive.org/web/20150824200 ... ilang.aspx
Uniscribe subdivides strings of characters into items (a character string having all the same script and direction attributes), runs (portions of an item that have continuous formatting attributes), and clusters (script-defined, indivisible character groupings). The client builds runs based on its own stored formatting attributes and on the item boundaries obtained by calling the Uniscribe ScriptItemize API.
Complex script languages are broken into clusters by ScriptShape. Character reordering always occurs within cluster boundaries. The clusters themselves are guaranteed to advance monotonically in the reading order [=visual order].
Only scripts that have the property fComplex should be shaped with the script returned by ScriptItemize. All other runs may be merged and shaped with SCRIPT_UNDEFINED. If there are characters not supported by the font, SCRIPT_UNDEFINED will not fail with USP_E_SCRIPT_ NOT_IN_FONT. Missing glyphs will usually be displayed as an empty rectangle. An application can determine if a codepoint is supported by a font by calling ScriptGetFontProperties to obtain the default glyph index, and ScriptGetCMap to look up font glyphs for Unicode codepoints.
Note caveats though regarding surrogate pairs and characters made of several glyphs.
If this fallback strategy fails, the sample application restores the original style and changes the script field of the itemization analysis to SCRIPT_UNDEFINED (the only publicized script number). SCRIPT_UNDEFINED causes ScriptShape to bypass shaping and use the 1:1 codepoint to glyph mappings from the font CMAP table. Most likely this will display the missing glyph for each character in the run. (The missing glyph is usually represented as an empty rectangle.)
Lines containing one or more runs are constructed by measuring the runs in logical order until a run causes the line to overflow. The overflowing run is passed to BreakRun, which determines a suitable wordbreak position. BreakRun uses ScriptGetLogicalWidths to convert the glyph widths returned by ScriptPlace into character widths. ScriptGetLogicalWidths returns virtual character widths ordered one for one with the logical character buffer. These widths are summed to identify the physical end of line as a logical character position.
Pictures explaining the meaning of GLYPHMETRICS:
https://www.cnblogs.com/shangdawei/arch ... 66762.html
http://marupeke296.com/WINT_GetGlyphOutline.html
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Useful tools (each of them have different info, or present it better in some way, in order of usefulness for obtaining detailed glyph info):
DTL OTMaster 3.7 Light for viewing glyph properties and TTF/OTF tables.
SIL ViewGlyph
Type light 3.2
BabelPad
FontForge
BirdFont is really slow
MS VOLT does not seem to load tables properly

Essential OpenType tables:

https://docs.microsoft.com/en-us/typogr ... c/chapter2
https://docs.microsoft.com/en-us/typogr ... /spec/glyf
https://docs.microsoft.com/en-us/typogr ... /spec/gpos
https://docs.microsoft.com/en-us/typogr ... /spec/gsub
https://docs.microsoft.com/en-us/typogr ... /spec/base

https://docs.microsoft.com/en-us/typogr ... /spec/hmtx
https://docs.microsoft.com/en-us/typogr ... 80/os2ver2

https://scripts.sil.org/IWS-Chapter08
Changing the advance width will alter the positioning of every glyph on a line after the changed glyph. Changing the left side bearing will only alter the placement of an individual glyph. In right to left scripts, glyphs still are described using a left to right coordinate system.
https://scripts.sil.org/IWS-AppendixC
https://scripts.sil.org/CatRenderingPrinciples

https://docs.microsoft.com/en-us/typogr ... maxcontext
usMaxContext: The maximum length of a target glyph context for any feature in this font. For example, a font which has only a pair kerning feature should set this field to 2. If the font also has a ligature feature in which the glyph sequence ‘f f i’ is substituted by the ligature ‘ffi’, then this field should be set to 3. This field could be useful to sophisticated line-breaking engines in determining how far they should look ahead to test whether something could change that effects the line breaking. For chaining contextual lookups, the length of the string (covered glyph) + (input sequence) + (lookahead sequence) should be considered.
Overall text rendering
An essential insight is that as a first step the cmap table is used to map codepoints to glyphs, no matter if we have a complex script with ligatures, contextual shaping (e.g., positional glyph alternatives), or glyph reordering and similar.
Some Unicode APIs have bugs and limitations regarding the cmap table, i.e., only support UCS-2 instead of UTF-16 or codepoints outside of the BMP. But if the font supports certain codepoints, they exist in the cmap, if the API is bogus or not.

In the following steps sequences of glyphs get replaced with substitutes, if necessary, for example for the fi ligature, or to get the right positional forms of an Arabic letter (start, mid or end of a word).

As such a proper character table display, or simple rendering of text, with no reading order consideration, would simply use the cmap, which always returns the default glyph for the codepoints (eventhough the font file may have alternatives that can be selected/found using other tables), and chain them next to each other.

The only issue is that some fonts will not behave as expected, and still try to combine glyphs. For example U+308 (diaeresis) has an advance width of 0 but a negative left bearing, and will therefore draw over the glyph on it's left. When we want to have a character table like display, we have to compensate for that, and measure the glyphs black box instead, and then add some spacing around it (or just center it the space given for each cell in the grid).

When rendering normal text, though, as a simple sequence of glyphs, we cannot center in some arbitrary cell. But as we lack proper left and right spacing around the glyph, since it's left bearing was "abused" for offsetting the glyph towards the right, we need another way to determine what spacing should be added around the glyph.

TODO: how to do that best, without choosing an arbitrary space is still to be seen.

In the simple text drawing mode, we would also ignore variation selectors, on purpose, since they should show up as control characters and not affect the rendering of previous ones. This mode should allow to easily debug text strings!
https://docs.microsoft.com/en-us/typogr ... -sequences

For Arabic it would probably be useful to have a mode with and without ligatures, but one that still uses positional substitutes. UniScribe offers options to disable ligatures while still doing positional substitution. TODO: need more testing of what features get disabled, by checking features of OpenType and when they disappear depending on options set in UniScribe APIs. Unfortunately, disabling some features trigger bugs in Uniscribe, at least old versions.

With a better understanding of OpenType, it may be possible to draw glyphs individually, by looking them up myself using a parsed cmap, to avoid triggering the bugs.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

From https://docs.microsoft.com/en-us/typogr ... /spec/hmtx
In a font with TrueType outlines, xMin and xMax values for each glyph are given in the 'glyf' table. The advance width (“aw”) and left side bearing (“lsb”) can be derived from the glyph “phantom points”, which are computed by the TrueType rasterizer; or they can be obtained from the 'hmtx' table. In a font with CFF or CFF2 outlines, xMin (= left side bearing) and xMax values can be obtained from the CFF / CFF2 rasterizer. From those values, the right side bearing (“rsb”) is calculated as follows:

rsb = aw - (lsb + xMax - xMin)
Adapting this to GLYPHMETRICS (which is in pixels! so equal (=) below really means corresponds):
rsb = gmCellIncX - gmBlackBoxX - gmptGlyphOrigin.X
where
lsb = gmptGlyphOrigin.X
aw = gmCellIncX
xMax - xMin = gmBlackBoxX


Again, see the quote from the post above (https://scripts.sil.org/IWS-Chapter08):
Changing the advance width will alter the positioning of every glyph on a line after the changed glyph. Changing the left side bearing will only alter the placement of an individual glyph.
It means that rsb is not necessary to compute the position of the glyph or the spacing, only lsb and aw are needed: lsb for positioning the glyph relative to the current position, and aw to update the current position. rsb merely helps to determine the whitespace after the glyph (and can be computed as shown above), it might possibly be useful when rendering from right to left. But knowing the whitespace is not really needed since we know the position of the next glyph due to nextpos=currentpos+aw.

Maybe that info helps to compute proper whitespace left and right of a glyph, when aw=0, as mentioned in a TODO in the post above.
My conclusions so far, after looking through several glyphs in Calibri (using "Type light 3.2"): diaeresis (the non-combining version U+00A8) has proper lsb and rsb (both 132 Funits). Other characters have very little lsb and rsb, such as "x" (41).
I should identify characters with 0 aw (gmCellIncX), then replace it with gmBlackBoxX and add left and right to it a space of about 160 FUnits (seems to be a good average from looking at chars in Calibri). Also lsb and rsb should be the same, since the glyph is shown in isolation (when not a diaeresis or zero width glyph anymore) and should not suggest being closer to the left or the right, like maybe a comma (,) or a period (.), which are closer to the left glyph.
Additionally I have to neutralize any negative lsb by increasing the currentpos by -lsb (or leaving it unchanged if lsb >= 0). Then inc currentpos by 160 FUnits, drawing the glyph, and finally increasing currentpos by gmBlackBoxX + 160 FUnits.

That should work out nicely. TODO: what if lsb is negative but aw is not 0? What if lsb=0 but aw < gmBlackBoxX (i.e., rsb < 0) ? Should we allow overlapping glyphs? What if it is just a kerning effect or similar?

https://docs.microsoft.com/en-us/typogr ... -to-pixels
Converting FUnits to pixels

Values in the em square are converted to values in the pixel coordinate system by multiplying them by a scale. This scale is:

pointSize * resolution / ( 72 points per inch * units_per_em )

where pointSize is the size at which the glyph is to be displayed, and resolution is the resolution of the output device. The 72 in the denominator reflects the number of points per inch.

For example, assume that a glyph feature is 550 FUnits in length on a 72 dpi screen at 18 point. There are 2048 units per em. The following calculation reveals that the feature is 4.83 pixels long.

550 * 18 * 72 / ( 72 * 2048 ) = 4.83
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Glyph positioning

For TrueType 1.0
gpos_fig4c[1].gif
gpos_fig4c[1].gif (5.16 KiB) Viewed 16673 times
For OpenType:
From https://docs.microsoft.com/en-us/typogr ... h-opentype
X and Y values specified in OpenType fonts for placement operations are always within the typical Cartesian coordinate system (origin at the baseline of the left side), regardless of the writing direction.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

General Uniscribe terminology and concepts

From https://docs.microsoft.com/de-de/window ... -uniscribe
Then the application shapes the code points for each range into glyphs, which it can subsequently position and render.
That means ScriptShape or shaping is the translation from codepoints to glyphs (including possible substitutions of individual glyphs or glyph runs, depending on context (=surrounding glyphs), to obtain positional forms, such as ligatures, or varying letter glyphs depending on in-word position in Arabic).
ScriptPlace positions glyphs, by computing advance widths and offsets (for combining marks). And finally, ScriptTextOut renders the glyphs.

From http://archives.miloush.net/michkap/arc ... 54279.html
There is a help topic on MSDN that captures the essence of the complexity entitled About Complex Scripts, which starts off with a very succinct summary:
A complex script has at least one of the following attributes:
  • Allows bidirectional rendering.
  • Has contextual shaping.
  • Has combining characters.
  • Has specialized word-breaking and justification rules.
  • Filters out illegal character combinations.
  • Is not supported in the core Windows fonts and therefore might require font fallback.
In some complex scripts, the order of the glyphs might be quite different from the order of the underlying Unicode characters they represent. See About Complex Scripts for more detail.
Now the above is great if you are familiar with a language that uses the Hebrew, Thai, or Arabic script, but is not so useful if you only know about a language like English; it therefore goes on and gives other examples, some of which a wider range of people can identify with:
Bidirectional rendering refers to the script's ability to handle text that reads both left-to-right and right-to-left. For example, in the bidirectional rendering of Arabic, the default reading direction for text is right-to-left, but for some numbers, it is left-to-right. Processing a complex script must account for the difference between the logical (keystroke) order and the visual order of the glyphs. In addition, processing must properly deal with caret movement and hit testing. The mapping between screen position and a character index for, say, selection of text or caret display requires knowledge of the layout algorithms.

Contextual shaping occurs when a script's characters change shape depending on the characters that surround them. This occurs in English cursive writing when a lowercase "l" changes shape depending on the character that precedes it such as an "a" (connects low to the "l") or an "o" (connects high). Arabic is a script that exhibits contextual shaping.

Combining characters or ligatures are characters that join into one character when placed together. One example is the "ae" combination in English; it is sometimes represented by a single character. Arabic is a script that has many combining characters.

Specialized word break and justification refers to scripts that have complex rules for dividing words between lines or justifying text on a line. Thai is such a script.

Filtering out illegal character combinations occurs when a language does not allow certain character combinations. Thai is such a script.
And suddenly people kind of like that there are controls (like RichEdit) and libraries like (Uniscribe and GDI+) and operating systems (like Windows) that deal with most of these issues automatically. There are some basic conventions upon which RichEdit, Uniscribe, and GDI+ basically agree:
  • When moving through text with an arrow key, move through one text element at a time, since the user often thinks of them as a single character.
  • When selecting text, select entire text elements rather than pieces of them.
  • When forced to break lines, try to break at word boundaries; if that is not possible then at least try to break at text element boundaries so that the integrity of what the user thinks of as a character is not destroyed.
  • When hitting the DELETE button to delete the in front of the cursor, delete the entire text element.
  • When hitting the BACKSPACE button to delete behind the cursor, usually delete just the code point since the user may have typed it in that way and may be unhappy in the case of typos to lose multiple code points (though more sophisticated processors like RichEdit will properly delete surrogate pairs in their entirety since they were almost certainly not typed separately).
Non-relevant but true comment from the link above:
I have also known people who speak one of the languages that is impacted (such as Arabic or Hindi or Tamil) who think of it as being somehow insulting to call their language "complex". But I can promise that no insult is intended -- this is just a recognition that some languages use scripts that require more effort to support correctly. I find certain complex scripts to appear to me to be the most aesthetically pleasing, enough that I am a little afraid to learn how to read a language like Thai since not knowing now to read it allows even something as mundane as a grocery list to appear beautiful to me.

Potential problem:
Filtering out illegal character combinations occurs when a language does not allow certain character combinations. Thai is such a script.
If Uniscribe filters such character combinations, this could interfer with HxD, since it allows random character sequences, that don't necessarily result in logical text, since the original data might not even be text (just an attempt at interpreting a sequence of bytes as such).
If possible, such filtering should be disabled.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Terminal emulators have similar requirements to code editors or hex editors (from http://behdad.org/text/ ):
Terminal emulators with support for complex text are very weird hybrids. On the one hand terminal emulators have to lay text out in a predefined grid in a predefined way, which is in conflict with many aspects and requirements of complex text, on the other hand users demand support for complex text in their terminals. It gets uglier when you think about bidirectional text, say, inside a console text editor. Nonetheless, it is fair to say that such hybrids do not put any new demands on the shaping engine. gnome-terminal currently has no support for complex text other than combining marks. Konsole has bidirectional text support. Apple's Terminal App has at least bidi support as well as Arabic shaping support, not sure about other complex text. Update (Jan 18, 2010): The terminal mode (term and ansi-term) in recent versions of Emacs can render complex text, including Indic.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

https://docs.microsoft.com/en-us/window ... peopentype
If the eScript member of SCRIPT_ANALYSIS is set to SCRIPT_UNDEFINED, shaping is disabled. In this case, ScriptShapeOpenType displays the glyph that is in the font cmap table. If no glyph is in the table, the function indicates that glyphs are missing.
However it still seems to use positional glyphs for scripts such as Arabic, but not ligatures (as it wouldn't look them up in substitution tables). Why does it look up positional glyphs, though?

Since uBidiLevel and fRTL may disagree sometimes (see test examples in PrintAndHandleGraphemeClusters.dproj), it is good to know what to rely on, the documentation for ScriptShapeOpenType is very clear about it, and says fRTL is what matters for shaping (so I assume for placing as well):
ScriptShapeOpenType sequences clusters uniformly within the run, and sequences glyphs uniformly within a cluster. It uses the value of the fRTL member of SCRIPT_ANALYSIS, from ScriptItemizeOpenType, to identify if sequencing is left-to-right or right-to-left.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Subdividing a sequence of codepoints into clusters and lines/paragraphs

From https://docs.microsoft.com/en-us/window ... -uniscribe
The breakdown of an item into ranges is somewhat arbitrary, although a range should consist of one or more consecutive script-defined, indivisible character groupings, called "clusters." For European languages, a cluster typically corresponds to a single code page character or Unicode code point, and consists of a single glyph. However, in languages such as Thai, a cluster is a grouping of glyphs and corresponds to multiple consecutive characters or code points. For example, a Thai cluster might contain a consonant, a vowel, and a tone mark. So that it does not break clusters, the application typically should either use the longest ranges it can or use its own lexical information to break between ranges in places that are not in the middle of a cluster.
It is not clear if a grapheme cluster and a Uniscribe cluster are the same. Since a "Thai cluster might contain a consonant, a vowel, and a tone mark" I am not sure if that is really the same as a grapheme cluster, which usually only considers combining marks, CR/LF pairs, Hangul syllables, SpacingMark, Extend codepoint categories.

Edit: checking https://en.wikipedia.org/wiki/Thai_script#Vowels some of the combining versions at least, are listed in GraphemeBreakProperty-8.0.0.txt as having the Extend and Mc (Non-Spacing Mark) property. The codepoint THAI CHARACTER SARA AM has the SpacingMark and Lo (Other Letter) property.

Given that both Extend and SpacingMark are prevented from breaking in the GraphemeClusterBreak algorithm, by rules GB9 and GB9a, at least for Thai, it seems grapheme clusters and Uniscribe clusters seem to mean the same.


Understanding what clusters are is important to find "paragraphs" that can be passed to Uniscribe. In real this will not be paragraphs and not even lines, but text segments that hopefully can be shaped without additional context. However it might cause problems with bidi layout, if we don't have an entire paragraph. This will always be a problem, though, since a paragraph my span the entire file, in degenerate cases (since files are potentially random data).

The Unicode line breaking algorithm may still be useful:
https://www.unicode.org/reports/tr14/tr ... #Algorithm

Potential problems by breaking in the wrong place:
from https://docs.microsoft.com/de-de/window ... -uniscribe
An application that uses complex scripts has the following problems with a simple approach to layout and display.
  • The width of a complex script character depends on its context. It is not possible to save the widths in simple tables.
  • Breaking between words in scripts such as Thai requires dictionary support. For example, no separator character is used between Thai words.
  • Arabic, Hebrew, Persian, Urdu, and other bidirectional text languages require reordering before display.
  • Some form of font association is often required to easily use complex scripts.
The fact that Uniscribe uses the paragraph as the display unit helps the application deal adequately with these complex script issues.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

from https://docs.microsoft.com/en-us/typogr ... yout-fonts
Text processing with OpenType Layout fonts

A text-processing client follows a standard process to convert the string of characters entered by a user into positioned glyphs. To produce text with OpenType Layout fonts:
  1. Using the 'cmap' table in the font, the client converts the character codes into a string of glyph indices.
  2. Using information in the GSUB table, the client modifies the resulting string, substituting positional or vertical glyphs, ligatures, or other alternatives as appropriate.
  3. Using positioning information in the GPOS table and baseline offset information in the BASE table, the client then positions the glyphs.
  4. Using design coordinates the client determines device-independent line breaks. Design coordinates are high-resolution and device-independent.
  5. Using information in the JSTF table, the client justifies the lines, if the user has specified such alignment.
  6. The operating system rasterizes the line of glyphs and renders the glyphs in device coordinates that correspond to the resolution of the output device.
Throughout this process the text-processing client keeps track of the association between the character codes for the original text and the glyph indices of the final, rendered text. In addition, the client may save language and script information within the text stream to clearly associate character codes with typographical behavior.
Judging from the above description and the Uniscribe API documentation (especially about the generated output of each function), ScriptShape will take care of point 1 and 2, ScriptPlace takes care of point 3, ScriptJustify takes care of point 5, and finally, ScriptTextOut handles point 6.
Point 4 is not handled by ScriptBreak in the described way. Instead, it has to be implemented manually using information returned by ScriptPlace, such as advance widths, and possible breaking opportunities as given by ScriptBreak. The computation will also not be done in design coordinates, but in device coordinates (screen pixels).

Itemizing a character string based on the script is also not listed above, which assumes all will be rendered in one font. I.e., the above description just describes handling one item returned by ScriptItemize.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Possibly, this may allow to coax Uniscribe to merely use a direct character code to glyph mapping using the cmap table, given SCRIPT_UNDEFINED is used, to completely disable complex shaping. But it's not completely clear if this holds.

It may also be possible to enable Uniscribe to do some glyph substitution, but not others, such as substituting positional forms, but not ligatures, nor base characters fused with their combining marks.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Clusters in shaping <> grapheme clusters
from https://harfbuzz.github.io/clusters.htm ... nd-shaping
In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible unit. A single letter or symbol can be a cluster of its own. Other clusters correspond to longer subsequences of the input code points — such as a ligature or conjunct form — and require the shaper to ensure that the cluster is not broken during the shaping process.

A cluster is distinct from a grapheme, which is the smallest unit of meaning in a writing system or script.


The definitions of the two terms are similar. However, clusters are only relevant for script shaping and glyph layout. In contrast, graphemes are a property of the underlying script, and are of interest when client programs implement orthographic or linguistic functionality.

For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine — even though the two original, underlying letters remain separate graphemes.
An example would be the fi ligature that would be one cluster in script shaping, but still to graphemes (f and i). And it is also not a grapheme cluster, as can be tested with IsGraphemeClusterBoundaryInbetween().
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Reliably detecting if a codepoint can be represented by a font

After reading many articles/posts/bug reports, including larger parts of the OpenType specification and font rendering techniques, and inspecting and testing font files with "DTL OTMaster 3.7 Light" (and some other font tools), I finally get a clearer picture of how it should be done.

Each TrueType (or OpenType) font has a codepoint to glyph map, called cmap, available in various encodings. As mentioned in posts higher up in this thread, the first step of rendering text is to lookup a codepoint in this cmap table to get a glyph id for each codepoint. These glyph ids may then be replaced by other glyph ids, using the font's substitution table and its rules. For example, a sequence of glyph ids can be replaced by a single glyph id, in case of ligatures. Or in the case of a precomposed character, it may first be mapped to a single glyph id through cmap, then mapped two several glyph ids, which when rendered together represent the precomposed character properly.

For example, ë (e with a trema), might not be available in the font, but the individual glyphs for e and trema exist, and so a codepoint ë would get mapped to two glyphs, one for e and one for ◌̈. That's why simply looking up a codepoint in cmap is not enough, and why ScriptGetCMap will be insufficient.

But the post from Michael S. Kaplan Determining if a font is gonna get it done is somewhat confusing about the whole topic.

It describes four problems:
  1. Testing if a font contains the necessary glyph or not is not reliable
    • We (Mozilla/Firefox) are using Uniscribe for text rendering now and everything works pretty well except in one case where we're trying to determine if a font had all the glyphs to render the string or not.

      Certain fonts are giving us trouble. Constantia being a good offender.
      https://bugzilla.mozilla.org/show_bug.cgi?id=376300
  2. The workaround of setting psa->s.fDisplayZWG = True introduces other problems
  3. ScriptGetCMap does not work with surrogate pairs (it treats each surrogate individually), which means that in this case we need to look up codepoints in cmap with another function
  4. Some codepoints may be rendered as a combination of glyphs, such as precomposed characters like ë, so we still need to use ScriptShape in those cases.
    • From ScriptGetCMap documentation:
      Some code points can be rendered by a combination of glyphs, as well as by a single glyph, for example, 00C9; LATIN CAPITAL LETTER E WITH ACUTE. In this case, if the font supports the capital E glyph and the acute glyph, but not a single glyph for 00C9, ScriptGetCMap shows that 00C9 is unsupported. To determine the font support for a string that contains these kinds of code points, the application can call ScriptShape. If the function returns S_OK, the application should check the output for missing glyphs.
According to http://archives.miloush.net/michkap/arc ... 50680.html "GetGlyphIndices does not handle chars outside BMP (true in XP SP2 and an older Vista build)". The suggested solution to use ScriptGetCMap is invalid, since the documentation (maybe it was updated) states that:
The function [ScriptGetCMap] assumes a 1:1 relationship between the elements in the input and output arrays. However, the function does not support this relationship for UTF-16 surrogate pairs. For a surrogate pair, the function does not retrieve the glyph index for the supplementary-plane character.
In other words, ScriptGetCMap really only supports UCS-2 not UTF-16.

An interesting side note of http://archives.miloush.net/michkap/arc ... 50680.html is:
in truth GDI handles nothing off the BMP anyway, so it is hardly an artificial split -- if you are using supplementary characters, you are using complex scripts as far as GDI is concerned
While not always true, it hints at having to carefully check for correct handling of codepoints in the supplementary plane.

So while GetGlyphIndices is explicitly said to lack UTF-16 support in the link above, this may apply to GetGlyphOutline as well.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

So when considering this whole chaos, the solution is a bit complicated. Neither ScriptShape nor ScriptGetCMap will properly look up a glyph (or glyph sequence) for each codepoint. And GetGlyphOutline or GetGlyphIndicies are limited to the BMP.

We have two goals:
  • Finding missing (=default) glyphs in a sequence of glyphs for a run, so we can find a substitution font
    • This has another sub-problem, as there was a Mozilla bug that said disabling complex shaping, which is necessary when unable to find another font which has the missing glyphs, would trigger a bug in Uniscribe. TODO: find this bug report and integrate here
  • Find a glyph or glyph sequence to properly render a codepoint in the non-complex rendering mode or "charmap mode"
    • Precomposed codepoints may be rendered by a font just fine, but require several glyphs -- we still need to use ScriptShape (but the bug mentioned above might cause problems)
    • Render control chars or non printable chars -- we have a basic (if unfinished) implementation of that
Post Reply