Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

As in theory there can be arbitrarily many combining characters following a base code point, somehow a reasonable limit must be introduced, especially in a hex editor that might interpret random data, or data that is not actually text. This is important for performance reasons, but also to avoid strange cases making the hex editor unresponsive, due to endless parsing/searching for the combining character sequence end.

"Abuse" of combining sequences lead to so called Zalgo text: https://stackoverflow.com/questions/657 ... -text-work

The following Unicode report deals with exactly this issue, and suggests a limit of 30 combining characters (or more correctly, 30 non-starter code points), which results in a grapheme cluster of 31 code points when also counting the starting character/code point:
http://unicode.org/reports/tr15/#Stream ... ext_Format

The following link has a brief recap of Unicode, with the most relevant points, and was the first source I found regarding the defined limit of combining characters:
https://mathias.gaunard.com/unicode/doc ... _sequences

Zalgo text or similar use of many combining characters might cause rendering issues in HxD, even when we limit the combining sequence to the number mentioned above, because a line's height might exceeded the normal/average line height. Somehow we have to limit this or clip it, or otherwise handle it. Simply visually clipping the combining characters would falsify the information, though. On the other hand a hex editor should provide a clear and controllable view, without ambiguous or unpredictable behavior.
Weird effects/bugs in script engines, might cause combining characters to appear before the first code point that was printed, e.g., it will decorate the hex column that comes before the text column. Therefore the text column needs to be clipped. We have to make sure clipping does not cause cutting off text in italic, bold, or similar issues that cause slightly larger text than expected. We have to test if TextWidth() and similar functions return the correct dimensions for styled text (or multi-script / multi-font text).
An example that causes buggy horizontal rendering (i.e., decoration before the x position of character "a"):

Code: Select all

ä̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̚
(In Firefox it just shows a long list of appended diaresis, but GDI TextOut will stack them upon each other and place the last combining char U+031A far in front of the letter "a" (roughly 31/32 average width characters in front of it).

Possible solutions for Zalgo text as implemented in gedit, SciTE, and the console/terminal: https://stackoverflow.com/questions/510 ... o-use-them

The idea is essentially to limit how much room combining marks can take up as glyphs. The limit is the font height, plus maybe some small padding, but usually fonts reserve space for diacritical marks. Since a hex editor should have a fixed line height, we can use the one defined by the font (for a certain font size) and possibly add some slight margin on top of that (margin should be relative to font size). Everything else will be cut off, or, depending on the text renderer (and not controllable by us), be drawn on top of each other, as shown in the pictures in the link above.

Possibly useful Uniscribe links:
https://stackoverflow.com/questions/tagged/uniscribe
https://stackoverflow.com/questions/540 ... y-the-font
http://clootie.ru/delphi/download_vcl.html#usp10

Complex string processing examples (HxD will not support bidi text and have selection only go in one direction, as mentioned in earlier posts, but these examples might be useful to test grapheme cluster handling):
http://www.winedt.com/uniscribe.html

Displaying text in Uniscribe (all the necessary steps): https://docs.microsoft.com/de-de/window ... -uniscribe

Uniscribe versions:
https://scripts.sil.org/cms/scripts/pag ... beVersions

"Uniscribe: The Missing Documentation & Examples":
https://maxradi.us/documents/uniscribe/

Uniscribe in Wine: https://wiki.winehq.org/Uniscribe
Lists of which scripts are complex (CJK is not, but Hangul is), and how font fallback is implemented for ScriptString*.
Might be worth a look to know how to implement font fallback when using the "manual" non-ScriptString* APIs from Uniscribe, that don't do any font substitution.
See also: https://docs.microsoft.com/de-de/window ... t-fallback
WebKit's solution: https://stackoverflow.com/questions/168 ... t-language
https://superuser.com/questions/396160/ ... t-fallback
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Combination and Application of Combining Marks have more detail on grapheme clusters and the difference to a combining character sequence, but also specify how to render and handle text segmentation.
The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
A grapheme cluster is similar, but not identical to a combining character sequence. A combining character sequence starts with a base character and extends across any subsequent sequence of combining marks, nonspacing or spacing. A combining character sequence is most directly relevant to processing issues related to normalization, comparison, and searching.
Using grapheme base characters to find the start of a cluster boundary is not reliable, see notes under https://www.unicode.org/reports/tr29/tr29-27.html#GB10:
The Grapheme_Base and Grapheme_Extend properties predated the development of the Grapheme_Cluster_Break property. The set of characters with Grapheme_Extend=Yes is the same as the set of characters with Grapheme_Cluster_Break=Extend. However, the Grapheme_Base property proved to be insufficient for determining grapheme cluster boundaries. Grapheme_Base is no longer used by this specification.
Warning: Eventhough these notes come from the official Unicode documentation, they are not entirely correct if you review the normative definition of grapheme break properties: https://www.unicode.org/Public/UCD/late ... operty.txt
There spacing as well as non-spacing code points are listed as extending code points (which define a non-boundary), while above quotes claim this to be a difference between combining character sequences and grapheme clusters.

Therefore it is best to only rely on the algorithm outlined in https://www.unicode.org/reports/tr29/tr29-27.htm

The limit of 31 code points for a combining character sequence (including the starter code point) is not directly applicable to grapheme cluster length. But we still make this approximation, as this seems a reasonable limit in general, for creating complex composed characters. (It allows to pick reasonable buffer sizes, and should cater for most linguistic and technical uses, as mentioned in the link in the post above, talking about combining character sequence limits.)
j7n
Posts: 11
Joined: 28 Jan 2019 18:26

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by j7n »

Inserting bytes (changing file size) is usually not an option, if I am, for example, editing/translating a program file. I'd have to keep track of how many bytes were deleted or inserted and compensate for it, which could lead to a mistake. And, if the file size is large, writing many megabytes to disk.

Maybe variable length UTF overwrite could be made to work over a selection, possibly spanning multiple rows. Then the editor is free to shift bytes within this selection, typing starts at the beginning of the selection (don't require it to be drawn backwards) and is possible until the selection is full. No such restriction in Insert mode. This is related to the other plan for a structural view, but the text editing should be very quick, without creating any template definition in advance if I never plan to edit this data again. This editing mode could even work in 8-bit mode, to make deletions faster. Currently, no other function happens if I type over a selection, and it is cleared.

The name of Unicode is indeed not important. BMP Only is fine. Maybe I'm wrong, but I don't think you need to consiser which code points are defined, apart from surrogates that have a special meaning and maybe combining characters. This is more for text converters and font editors. If the font has a symbol for the code, let it show. Option to disable combining character grouping also in UTF-8 would be useful.

I get the impression that Unicode was practically very useful at the start, unifying all the windows- and dos- encodings. But now they have ran out of things to do, and just add complexity for fun: digbats, reversed text, smiley faces for the web.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

j7n wrote: 20 Feb 2019 02:57 This is related to the other plan for a structural view, but the text editing should be very quick, without creating any template definition in advance if I never plan to edit this data again. This editing mode could even work in 8-bit mode, to make deletions faster. Currently, no other function happens if I type over a selection, and it is cleared.
This could be either a string entry in the data inspector, or what I actually thought of as part of a structure view:
You just have a toolbar, where you can define string synchronization points (start pos of a string for Shift-JIS, or start and length, to limit how many bytes can be modified with optional padding), similar to how you format text in a word processor, i.e., setting the caret to a position or selecting a region, then pressing a button.

Similarly, a quick and easy way to highlight portions and assign them datatypes could be implemented. So you don't formally define a whole structure like with a programming language, but do it in place, iteratively, and only where you need it.

Such a defined section allows to reedit strings, if you made mistakes, and don't have to rely on a selection that will go away as soon as you make a new one.
When adding shortcut keys, this could be even more efficient, and barely different from starting to edit right away. For example, you could have a heuristic that tries to guess the string length, and allows you to accept it with another button press, or change it before confirming. A bit like template autocomplete in IDEs, where you fill in the template parameters by pressing the tab key, or just accept the guessed defaults.
j7n wrote: 20 Feb 2019 02:57 I get the impression that Unicode was practically very useful at the start, unifying all the windows- and dos- encodings. But now they have ran out of things to do, and just add complexity for fun: digbats, reversed text, smiley faces for the web.
Unicode has grown very complex, indeed. I won't support emoticons for now, and probably no RTL text either. I am currently implementing it, and seeing what makes most sense as I go.
Tak_HK
Posts: 1
Joined: 23 Jun 2019 13:43

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Tak_HK »

Hi,
As a Chinese using Big 5/GB18030, I would like to suggest a Hex/character presentation for your consideration:

Original text:
港交所(00388)與 MSCI 簽授權協議 擬推MSCI中國A股指數期貨

UTF-8:

Code: Select all

Offset (h)
00000000  E6B8AF E4BAA4 E68980 28 30 30 33 38 38 29     港交所(00388)
00000010  E88887 20 4D 53 43 49 20 E7B0BD E68E88        與 MSCI 簽授
0000001F  E6AC8A E58D94 E8ADB0 20 E693AC E68EA8         權協議 擬推
0000002F  4D 53 43 49 E4B8AD E59C8B 41 E882A1 E68C87    MSCI中國A股指
00000040  E695B8 E69C9F E8B2A8                          數期貨
UTF-16:

Code: Select all

Offset (h)
00000000  6e2f 4ea4 6240 0028 0030 0030 0033 0038     港交所(0038
00000010  0038 0029 8207 0020 004d 0053 0043 0049     8)與 MSCI
00000020  0020 7c3d 6388 6b0a 5354 8b70 0020 64ec      簽授權協議 擬
00000030  63a8 004d 0053 0043 0049 4e2d 570b 0041     推MSCI中國A
00000040  80a1 6307 6578 671f 8ca8                    股指數期貨
Personally, the easy cross referencing of character to Hex code is more important than a grid layout (which I believe was also the easy cross referencing for single byte characters). The UTF-8 example (not aligned because of variable width characters) show a nominal 16-byte layout with space separated byte(s) corresponding to the characters. Each line may have 15/16/17 bytes depending on the occurrence of multi-byte characters. This makes it easy to edit because the user can see clearly when he/she is editing a multi-byte character.
For fixed width code like UTF-16, grid pattern can be maintained.
Thank you.

Regards,
Tak

Moderation: Put in code tags to preserve alignment.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

I made a prototype for rendering, that uses separator lines instead of removing spaces. I think it becomes hard to read/separate the hex pairs, when they are grouped together per grapheme ("character"), and there are no spaces between those hex pairs.

traditional-chinese-utf-8-segmentation.png
traditional-chinese-utf-8-segmentation.png (8.69 KiB) Viewed 22444 times

You could change the separator line colors, or maybe tune the display a bit. But this representation has the advantage to keep the grid like layout of the hex column.

Maybe it would be better to add a separator line at the end of each hex column line, when a grapheme ends there as well. For example, in the first line, "29" is a single byte encoding ")". Since there is a separator line on the next hex column line, I didn't want to have one as well directly after "29", to not be redundant. But maybe it's visually easier to read, since you don't need to go to the next line to be sure if the bytes encoding the grapheme end there or not.

A grapheme encoding that wraps around, can be seen at the end of the 2nd line: "E6 AC 8A", with only "E6" being on the 2nd line and the rest on the 3rd line. Since there are no separator lines, it's visually clear that all those hex pairs form one unit.

Let me know what you think.

P.S.: Don't mind the . for the space, this is just a mockup, not a complete rendering handling all the details, where a space would be rendered as a white block (="empty space").
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Here it is with the aforementioned vertical line delimiters, also at the end of line (compared to appearing at the start of the next line only, as in the previous post).
traditional-chinese-utf-8-segmentation-end.png
traditional-chinese-utf-8-segmentation-end.png (8.78 KiB) Viewed 22367 times
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Self-synchronizing properties of UTF-8 (and lack of it in UTF-16)

UTF-8 is self-synchronizing at byte-level. UTF-16 is not self-synchronizing at byte-level, but at code-unit-level (i.e., UInt16 units).

Here is a simple example of this issue for UTF-16LE (assuming a file made of 4 bytes, given in hexadecimal notation):

Code: Select all

10 20 20 30
  • If you start at offset 0, the two bytes (10 20) result in the valid Unicode code point U+2010, which is a hyphen: ‐
    • Then the next valid code unit is at offset 2 (20 30), which is U+3020: 〠
  • If you start at offset 1, the two bytes (20 20) result in the valid Unicode code point U+2020, which is a dagger: †
    • Then the next valid code would be at offset 3 (30), but one byte alone does not make a valid UTF-16 code unit. So we would drop this byte or represent it with a replacement character, indicating this encoding error.
Quick and dirty summary:
10 20 20 30
can be interpreted as
‐〠 with a starting offset of 0, or as
†� with a starting offset of 1.
Both options are entirely valid interpretations regarding the UTF-16LE encoding alone. Which one was intended depends on how UTF-16LE strings are aligned in memory/file/disk, and this can keep changing throughout the stream.

Problems that may arise (and solutions)
  • Regex searching with PCRE will require a UTF-16 encoded string as input, so a starting alignment needs to be set before searching.
    • Solution: A second search would have to be performed with a one byte offset, compared to the starting position of the first search.
  • Similarly, for display there needs to be an offset option, because the string alignment is unknown and can change throughout the file.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Zalgo text exposes glyph layout rendering bugs under Windows
Under Windows, no matter if using GDI's TextOut(), Uniscribe, or DirectWrite, there will be significant glyph layout bugs for Zalgo text, when using certain fonts. Notably, the commonly used fixed-pitch font Consolas triggers those bugs.
Firefox, that has its own glyph layout algorithms (probably using Pango for glyph layout and Harfbuzz for glyph shaping, but using Windows APIs for actual text drawing), does not suffer from the same errors, even when using Consolas.

All programs under Windows, be it Visual Studio (code editor), SciTE (Scintillia), Notepad, WordPad have issues with erroneously placing glyphs for diacritical marks in horizontally wrong positions. Firefox only clips diacritical marks if the available vertical space is limited, but does not err on glyph layout (above mentioned programs will clip but also put "overflowing" glyphs for diacritical marks in front of the actual base character (really, its glyph).

The following shows a rendering of the word "Claudio" where each letter C, l, a, u, d, i, and o have several diacritical marks applied -- so many that there will not be enough vertical space to print them all, when using the default line height for the given font size. The correct behavior is to clip overflowing glyphs, or to assign more vertical space. Firefox renders without glyph layout bugs, when enough vertical space is available, and does clipping, when necessary due to space constraints, such as when selecting text.

Correct (Consolas 12pt, Firefox) - no clipping, and no glyph layout bugs:
Claudio-Consolas-12pt-Firefox.png
Claudio-Consolas-12pt-Firefox.png (1.13 KiB) Viewed 21567 times
Incorrect (Consolas 12pt, WordPad) - no clipping, but glyph layout bugs:
Claudio-Consolas-12pt-WordPad.png
Claudio-Consolas-12pt-WordPad.png (783 Bytes) Viewed 21567 times
The drawing bugs in WordPad, SciTE etc. are less obvious when there are no leading spaces, because the misplaced leading glyphs will then just be clipped.
The only correct glyph layout of the diacritical marks is to stack them vertically, not to prepend them or otherwise make them overflow horizontally.

Since we make a hex editor, we want to have a fixed line height, so when there are too many diacritical marks stacked upon each other to fit in one line, they should be clipped, as to not overflow into lines that follow below.

Unfortunately, there is no way to fix the glyph layout bugs, without pulling in large dependencies, such as Pango, or Harfbuzz, which means text rendering will suffer from misplaced diacritical marks, that will overlay text or space preceding them, instead of stacking upon the intended base characters they were meant for. I.e., it will result in messy, less or unreadable text.

While this cannot be fixed, its effect can be limited, by clipping the text column so that no diacritical marks will overflow into the hexadecimal column.

No clipping (TextOut()), HxD prototype (Consolas 11pt):
Claudio-Consolas-11pt-HxD-Prototype-No-Clipping.png
Claudio-Consolas-11pt-HxD-Prototype-No-Clipping.png (7.67 KiB) Viewed 21563 times
Clipping (TextRect()), HxD prototype (Consolas 11pt):
Claudio-Consolas-11pt-HxD-Prototype-Clipping.png
Claudio-Consolas-11pt-HxD-Prototype-Clipping.png (7.55 KiB) Viewed 21563 times
Looking closely, it can be seen that there is also vertical overflow, for example the fourth line ("d") hides part of the first line ("au"), because the line height is not sufficient. However the line height is chosen based on the font's suggested line height, and as we don't want variable line height, like in word processors, the only option would be to increase the overall line height, and otherwise also clip vertically.

The second picture does clip horizontally, and vertically.

Interestingly, clipping also changes the glyph layout (a Windows text rendering bug). Notice how in the second line, the "l" has additional pixels directly under it, as opposed to the non-clipped version. Zooming in makes it obvious, those additional pixels are not from the line below (too close to the l).

Another related issue is that text will render differently, if we have line break in front of each of the base character (C, l, a, u, d, i, o), because clipping will remove some diacritical marks (or actually, their glyphs), while when rendering an entire line, the misplaced diacritical marks will remain, and overlay/stack upon the wrong base characters/glyphs.
Claudio-Consolas-11pt-HxD-Entire-Line-Vert-And-Horizontal-Clipping.png
Claudio-Consolas-11pt-HxD-Entire-Line-Vert-And-Horizontal-Clipping.png (509 Bytes) Viewed 21558 times
The entire line rendering has additional pixels on top of the base characters/glyphs, and looks similar to the WordPad rendering above, but has clipping applied cutting off overflowing glyphs (as intended).
It is apparent how the rendering is different from the one with line breaks, eventhough with proper diactricial layout, it shouldn't differ.

Chosen fonts affect rendering bugs
As rendering bugs depend on the chosen font, a simple solution would be to require fonts that don't trigger any bugs. Most variable-pitch fonts seem to work well, such as Arial, Calibri, or Segoe UI. Others do not work well. Ms Sans Serif or Lucidia Bright (but also its fixed-pitch cousin Lucidia Console) will render black/empty boxes for the diacritical marks, instead of stacking them on the base glyph. But that's acceptable, since at least it doesn't create unreadable text or wrong overlays. Segoe (without the UI) misplaces some glyphs, but at least creates no wrong overlays; rendering will however slightly vary when selecting in WordPad (this seems to be due to different clipping, when selecting, vs., when printing the whole -- maybe a Uniscribe effect, where the rendering is decomposed using ScritpItemize and ScriptPlace). Fortunately, Consolas does not seem to render differently when selecting.

At first Liberation Mono seems to work, but it has selection drawing issues. (See below, it seems to affect every font.)
Source Code Pro has ok rendering (no erroneous overlays making reading hard, though some diacritical marks are misplaced), but has many empty boxes.

Microsoft Sans Serif works well, besides the selection clipping issues. But maybe they can be mitigated when using Uniscribe and drawing the glyphs with clipping disabled (only clipping for the entire line, not individual glyphs or groups of glyphs).

Roboto Cn completey ignores diacritical marks, and renders the text simply as "Claudio", with layout or other signs that some glyphs might be missing.

Tahoma works good as well, but has selection clipping, same as Segoe UI, Arial, and Calibri after further checks. Hopefully, it can be fixed with manual use of Uniscribe, instead of relying on TextOut or TextRect and friends.

All this analysis makes clear how important proper font selection is.

The conclusion is that good font selection is important, but it might also be desirable to render the hex column and the text column with different fonts. Variable-pitch fonts seem to have better Unicode support in general, and it might allow for more natural rendering of some scripts, such as Arabic.
Therefore, the hex column could be drawn with a fixed-pitch font, while the text column could be drawn with a variable-pitch font.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Rendering combining characters separately
As can be seen from the previous explanations, text rendering is quite complex, and the association between code points and rendered output is not visually obvious. In hex editors this association is often desired, since a hex editor is seen as a sort of table, where the cell from one "sub-table" (hexdecimal) should match the cell from another "sub-table" (text).

This general notion is challenged by Unicode due to many conceptional changes, such as variable-width encodings, variable-width glyphs ("double width" glyphs in Chinese vs. "single width" glyphs in Latin vs. Arabic with its ligatures), but also combining characters, as mentioned above.

While the caret can be moved to detect segments, such as grapheme clusters (e.g., ä, which is made of "a" and a diaeresis), it cannot be used to navigate the combining characters. Normal text editing is not designed to do such fine-grained navigation, and only allows deleting combining characters individually with the backspace key.

A possible solution would be to provide a navigation similar the ones available in visual formula editors (horizontal as well as vertical navigation to go into all sub and super-script parts). However, this would interfere with easily navigating between lines, and make navigation in general quite complex. As many diacritical marks will be clipped off anyways, due to the fixed line height constraint, this makes even less sense.

A good solution would be to provide an alternative rendering mode, where each codepoint is rendered individually, and not grapheme clusters as combined glyphs. This would almost restore the table-assumption of hex editors, except for the variable-width encoding issues, which I addressed with gray vertical separators (see posts above).

This could be done by drawing code points individually, or adding https://en.wikipedia.org/wiki/Dotted_circle for combining characters, for clarification. Orphan combining characters (those more than 30 code points away from the base character) should definitely be drawn in their own "character" cell, with or without a dotted circle as reference (for small font sizes, without is probably best). This ensures proper individual rendering and avoids accidentally combining it with other glyphs (such as other preceding diacritics), depending on selection for example.
Uniscribe needs to be used for that, with ScriptItemize or ScriptPlace.

Having two rendering options (visually combined grapheme clusters and ligatures vs. individually drawn code points) allows to have readable output (with grapheme clusters rendered as one entity/combined glyph) but also allow detailed editing and precise understanding of the underlaying data or possible errors in files, that can then be clearly diagnosed. Switching back and forth between these renderings would help further a clear understanding and ease of use.

An issue might be RTL text, such as Arabic. I believe ligatures can only be applied when rendering right to left. But that will have to be seen, later.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Advantage of solving UTF-8 first

When all the issues are solved for UTF-8, from text output to the screen, to navigation, editing, searching/replacing and printing, most of the solutions will be applicable to other encodings. This is because many problems will be solved on the code point level, not the code unit level.

So only the code unit layer must be adapted for each encoding, while the rest will be solved in Unicode, generally.

Especially for searching that would mean there is a layer that transforms a raw byte stream into a sequence of code points, then search would be performed on these code points. Alternatively, because PCRE requires either UTF-8 or UTF-16 (and not UTF-32), there would be an on-the-fly conversion from source encoding to UTF-8, upon which the other operations are performed.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

From http://archives.miloush.net/michkap/arc ... 11340.html:
Fun wrappers around text rendering like Uniscribe, TabbedTextOutW, and DrawTextExW, will end up being treated in XP SP2, Vista, and other recent platforms as a complex script, while both the simpler (e.g. SetWindowTextW) and lower level (e.g. ExtTextOutW) functions will treat it like it is not.
So DrawTextExW and ExtTextOutW render text differently, at least when the linked post was made, maybe new Windows versions are different. The point remains, that DrawTextExW should be tested against, too, to see what is a correct or faulty rendering, or which is less faulty and can be relied on more...

In the end I'll use UniScribe anyways, but for tests comparing rendering, like a few posts up, it would make sense to include DrawTextExW.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Interesting problem for Arabic text:
More on cursor movement
وِوِوِوِوِوِوِوِوِوِوِوِ
Notepad (in current Windows versions, at least 8 and up) behaves correctly, regarding backspace and delete key, as you would expect: delete key removing an entire grapheme cluster, backspace each codepoint/combining char individually.

However Firefox and Notepad disagree on bidi handling. Pos1/Home and End behave differently in FF and Notepad (then again, the text box in the post linked above, FF seems to behave the same, maybe it depends on what the default bidi setting is, which FF keeps as LTR even if the whole text in the search box is Arabic, yet it sets it to RTL for the test text box in the linked post). Prefer Notepad logic, since HxD is a Windows application.

Also from the linked post:
in all cases other than the BACKSPACE character (for the reasons I describe here), you would want to have movement jump the text element boundaries, which would be those two characters you mentioned....
The ligature fi mentioned in the comments is a single codepoint, not a composed one, and can only be decomposed (compatibility) into the ASCII character f and i. So fi would be removed by one stroke of backspace and delete, entirely. And f and i would be removed normally, like any other ASCII code point. In other words, such ligatures don't need special handling, since they are non-decomposable individual code point.

Are ligatures supposed to be thought of as 'single characters' ?
Offhand I might really expect NO in this case (since there really is a difference between a base character/its attached combing characters and two base characters that happen to have a ligature defined in a particular font).
Caret movement:
The Windows logical-caret is by far easier to implement than the visual caret apparently recommended by Apple (in Mac OS), Sun (in Java) and by the independent i18n teams for projects like GTK+. It's not impossible that they're all relying on an incorrect intuition, but it does seem more likely that the early Windows BiDi code was just lazy...
the left arrow moves right, and vice versa.
Confirmed. Notepad inverts the meaning of left arrow and right arrow key for RTL text, while Firefox does not. However Pos1/Home and End behave equally in Notepad and FF, which makes sense. Currently, I prefer Firefox's behavior, it is less confusing. Have yet to test behavior for mixed bidi text.

See also: http://archives.miloush.net/michkap/arc ... 11194.html
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

More interesting comments from James Brown from catch22:
Well I'm still confused about this whole thing :-) I'm now looking at Word 2003 and notice that it has the same behaviour as Notepad. I am using the following Unicode code-point from the Arabic script:

064a 064f 0633 0627 0648 0650 064a

I have included a link to 2 GIFs on my site which illustrate this string rendered in Word2003:

http://www.catch22.net/tuts/img/a1.gif

For the image above, I set the "Show Diacritics" option in Word's Complex-Scripts option-page tab. The diacritics (or whatever they are!!) are shown in orange.

http://www.catch22.net/tuts/img/a2.gif

This image shows the mouse-selection has moved half-way into one of the glyph clusters which contains the diacritic.

With Word, the cursor-keys move cluster-by-cluster. However the mouse allows the caret to be placed in the middle of clusters.

The low-level Uniscribe functions (ScriptShape, ScriptXtoCP) allow the caret to be placed mid-cluster. I still can't figure out if this is right or not. I'm trying to replicate this behaviour in my unicode text-editor - it's easy to get the caret placed mid-cluster (because that's what Uniscribe always does, I can't seem to tell it not to). The difficulty is rendering the glyph-cluster to appear as two "characters" even though it is really one (using two code-points though)...I have to draw the cluster twice, over the top of each other, using clipping to get the desired effect...its nasty. But someone at Microsoft obviously thinks its right because Word, and any app which uses ScriptString API (that includes Notepad) exhibits this behaviour. Help!

Michael, would you like to comment on this further?
and
ok I'm getting closer to understanding how the caret is getting placed in the middle of a cluster. This is a quote from the MSDN docs on Uniscribe in the section 'Notes on ScriptXtoCP and ScriptCPtoX':

"Cluster information in the logical cluster array is used to share the width of a cluster of glyphs equally among the logical characters they represent....For Arabic and Hebrew, caret positions are interpolated with clusters."

So what this means is, for certain scripts (Arabic+Hebrew I guess!), the caret position is obtained by dividing the cluster-width by the number of code-points which make up that cluster, and the caret is then "snapped" to one of these finer-grain boundaries. I guess this is also how text-renderers know how to draw a selection-highlight part-way through a cluster.

well I understand the mechanism, but still don't understand why it is necessary to split a discrete cluster into segments like this. Presumably it only makes sense for Arabic+Hebrew.
Maël
Site Admin
Posts: 1454
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Side-node: UTF-8 Everywhere is not a good solution

While doing research I stumbled once again on the UTF-8 everywhere wish: http://utf8everywhere.org
std::string and char* variables are considered UTF-8, anywhere in the program.
Among many reasons why UTF-8 does not solve the issues of transparently supporting Unicode, the assumption above is a major reason.
A lot of legacy software does not assume to use a multi-byte encoding, but a single-byte encoding. As such it will perform text processing on such strings in ways that break a UTF-8 string. This issue can be seen on many websites, which shows that the implicit assumption for UTF-8 everywhere does not work (especially after converting legacy data, by merely assuming all byte strings are UTF-8 encoded).
In practice, every byte string handling and data has to be verified individually, assuming opaque data often ends is wrong assumptions and bugs.

This is true for a lot of Latin1 and Windows-1252 encoded text, that is/was quite widespread in the Western world. Forcing to convert to UTF-16 will make bad handling of strings obvious, and will not hide it by assuming most text is ASCII anyways.

ASCII will fail, too, visibly which is good. UTF-8 favors failing silently. Still, the issues of UTF-16 support, that dates back to treating it as UCS-2 is a problem, but less of a problem that badly treating single byte character sets, since UCS-2 code is less common than code assuming plain ASCII or single byte character sets/encodings. The latter code will be forced to be corrected, due to a datatype change for char and the string type, and rendering errors if still treating strings/chars as single byte encodings.

Assuming that most code just treats strings as an opaque datatype, does not reflect the reality of most programs. Unix has case insensitive filesystems, Windows does not (NTFS stores the case, but the Windows API ignores the case). UTF-16 or UTF-8, in both cases a notion of case conversion is necessary, i.e., not an opaque data type. Similar for some programming languages and identifiers.

Also more modern programming languages, such as Julia, use non-ASCII characters for operators, i.e., the part of the program that is not just "user-data". HTML has ASCII-only tags, such that UTF-8 may have an advantage there. In future, this limitation to ASCII to only user-data will decrease.
Post Reply