Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1124
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 16 Feb 2019 22:01

As in theory there can be arbitrarily many combining characters following a base code point, somehow a reasonable limit must be introduced, especially in a hex editor that might interpret random data, or data that is not actually text. This is important for performance reasons, but also to avoid strange cases making the hex editor unresponsive, due to endless parsing/searching for the combining character sequence end.

"Abuse" of combining sequences lead to so called Zalgo text: https://stackoverflow.com/questions/657 ... -text-work

The following Unicode report deals with exactly this issue, and suggests a limit of 30 combining characters (or more correctly, 30 non-starter code points), which results in a grapheme cluster of 31 code points when also counting the starting character/code point:
http://unicode.org/reports/tr15/#Stream ... ext_Format

The following link has a brief recap of Unicode, with the most relevant points, and was the first source I found regarding the defined limit of combining characters:
https://mathias.gaunard.com/unicode/doc ... _sequences

Zalgo text or similar use of many combining characters might cause rendering issues in HxD, even when we limit the combining sequence to the number mentioned above, because a line's height might exceeded the normal/average line height. Somehow we have to limit this or clip it, or otherwise handle it. Simply visually clipping the combining characters would falsify the information, though. On the other hand a hex editor should provide a clear and controllable view, without ambiguous or unpredictable behavior.
Weird effects/bugs in script engines, might cause combining characters to appear before the first code point that was printed, e.g., it will decorate the hex column that comes before the text column. Therefore the text column needs to be clipped. We have to make sure clipping does not cause cutting off text in italic, bold, or similar issues that cause slightly larger text than expected. We have to test if TextWidth() and similar functions return the correct dimensions for styled text (or multi-script / multi-font text).
An example that causes buggy horizontal rendering (i.e., decoration before the x position of character "a"):

Code: Select all

ä̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̈̚
(In Firefox it just shows a long list of appended diaresis, but GDI TextOut will stack the upon each other and place the last combining char U+031A far in front of the letter "a" (roughly 31/32 average width characters in front of it).

Possible solutions for Zalgo text as implemented in gedit, SciTE, and the console/terminal: https://stackoverflow.com/questions/510 ... o-use-them

The idea is essentially to limit how much room combining marks can take up as glyphs. The limit is the font height, plus maybe some small padding, but usually fonts reserve space for diacritical marks. Since a hex editor should have a fixed line height, we can use the one defined by the font (for a certain font size) and possibly add some slight margin on top of that (margin should be relative to font size). Everything else will be cut off, or, depending on the text renderer (and not controllable by us), be drawn on top of each other, as shown in the pictures in the link above.

Possibly useful Uniscribe links:
https://stackoverflow.com/questions/tagged/uniscribe
https://stackoverflow.com/questions/540 ... y-the-font
http://clootie.ru/delphi/download_vcl.html#usp10

Complex string processing examples (HxD will not support bidi text and have selection only go in one direction, as mentioned in earlier posts, but these examples might be useful to test grapheme cluster handling):
http://www.winedt.com/uniscribe.html

Displaying text in Uniscribe (all the necessary steps): https://docs.microsoft.com/de-de/window ... -uniscribe

Uniscribe versions:
https://scripts.sil.org/cms/scripts/pag ... beVersions

"Uniscribe: The Missing Documentation & Examples":
https://maxradi.us/documents/uniscribe/

Uniscribe in Wine: https://wiki.winehq.org/Uniscribe
Lists of which scripts are complex (CJK is not, but Hangul is), and how font fallback is implemented for ScriptString*.
Might be worth a look to know how to implement font fallback when using the "manual" non-ScriptString* APIs from Uniscribe, that don't do any font substitution.
See also: https://docs.microsoft.com/de-de/window ... t-fallback
WebKit's solution: https://stackoverflow.com/questions/168 ... t-language
https://superuser.com/questions/396160/ ... t-fallback

Maël
Site Admin
Posts: 1124
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 17 Feb 2019 17:12

Combination and Application of Combining Marks have more detail on grapheme clusters and the difference to a combining character sequence, but also specify how to render and handle text segmentation.
The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
A grapheme cluster is similar, but not identical to a combining character sequence. A combining character sequence starts with a base character and extends across any subsequent sequence of combining marks, nonspacing or spacing. A combining character sequence is most directly relevant to processing issues related to normalization, comparison, and searching.
Using grapheme base characters to find the start of a cluster boundary is not reliable, see notes under https://www.unicode.org/reports/tr29/tr29-27.html#GB10:
The Grapheme_Base and Grapheme_Extend properties predated the development of the Grapheme_Cluster_Break property. The set of characters with Grapheme_Extend=Yes is the same as the set of characters with Grapheme_Cluster_Break=Extend. However, the Grapheme_Base property proved to be insufficient for determining grapheme cluster boundaries. Grapheme_Base is no longer used by this specification.
Warning: Eventhough these notes come from the official Unicode documentation, they are not entirely correct if you review the normative definition of grapheme break properties: https://www.unicode.org/Public/UCD/late ... operty.txt
There spacing as well as non-spacing code points are listed as extending code points (which define a non-boundary), while above quotes claim this to be a difference between combining character sequences and grapheme clusters.

Therefore it is best to only rely on the algorithm outlined in https://www.unicode.org/reports/tr29/tr29-27.htm

The limit of 31 code points for a combining character sequence (including the starter code point) is not directly applicable to grapheme cluster length. But we still make this approximation, as this seems a reasonable limit in general, for creating complex composed characters. (It allows to pick reasonable buffer sizes, and should cater for most linguistic and technical uses, as mentioned in the link in the post above, talking about combining character sequence limits.)

j7n
Posts: 11
Joined: 28 Jan 2019 18:26

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by j7n » 20 Feb 2019 02:57

Inserting bytes (changing file size) is usually not an option, if I am, for example, editing/translating a program file. I'd have to keep track of how many bytes were deleted or inserted and compensate for it, which could lead to a mistake. And, if the file size is large, writing many megabytes to disk.

Maybe variable length UTF overwrite could be made to work over a selection, possibly spanning multiple rows. Then the editor is free to shift bytes within this selection, typing starts at the beginning of the selection (don't require it to be drawn backwards) and is possible until the selection is full. No such restriction in Insert mode. This is related to the other plan for a structural view, but the text editing should be very quick, without creating any template definition in advance if I never plan to edit this data again. This editing mode could even work in 8-bit mode, to make deletions faster. Currently, no other function happens if I type over a selection, and it is cleared.

The name of Unicode is indeed not important. BMP Only is fine. Maybe I'm wrong, but I don't think you need to consiser which code points are defined, apart from surrogates that have a special meaning and maybe combining characters. This is more for text converters and font editors. If the font has a symbol for the code, let it show. Option to disable combining character grouping also in UTF-8 would be useful.

I get the impression that Unicode was practically very useful at the start, unifying all the windows- and dos- encodings. But now they have ran out of things to do, and just add complexity for fun: digbats, reversed text, smiley faces for the web.

Maël
Site Admin
Posts: 1124
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 23 Feb 2019 16:58

j7n wrote:
20 Feb 2019 02:57
This is related to the other plan for a structural view, but the text editing should be very quick, without creating any template definition in advance if I never plan to edit this data again. This editing mode could even work in 8-bit mode, to make deletions faster. Currently, no other function happens if I type over a selection, and it is cleared.
This could be either a string entry in the data inspector, or what I actually thought of as part of a structure view:
You just have a toolbar, where you can define string synchronization points (start pos of a string for Shift-JIS, or start and length, to limit how many bytes can be modified with optional padding), similar to how you format text in a word processor, i.e., setting the caret to a position or selecting a region, then pressing a button.

Similarly, a quick and easy way to highlight portions and assign them datatypes could be implemented. So you don't formally define a whole structure like with a programming language, but do it in place, iteratively, and only where you need it.

Such a defined section allows to reedit strings, if you made mistakes, and don't have to rely on a selection that will go away as soon as you make a new one.
When adding shortcut keys, this could be even more efficient, and barely different from starting to edit right away. For example, you could have a heuristic that tries to guess the string length, and allows you to accept it with another button press, or change it before confirming. A bit like template autocomplete in IDEs, where you fill in the template parameters by pressing the tab key, or just accept the guessed defaults.
j7n wrote:
20 Feb 2019 02:57
I get the impression that Unicode was practically very useful at the start, unifying all the windows- and dos- encodings. But now they have ran out of things to do, and just add complexity for fun: digbats, reversed text, smiley faces for the web.
Unicode has grown very complex, indeed. I won't support emoticons for now, and probably no RTL text either. I am currently implementing it, and seeing what makes most sense as I go.

Tak_HK
Posts: 1
Joined: 23 Jun 2019 13:43

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Tak_HK » 23 Jun 2019 14:09

Hi,
As a Chinese using Big 5/GB18030, I would like to suggest a Hex/character presentation for your consideration:

Original text:
港交所(00388)與 MSCI 簽授權協議 擬推MSCI中國A股指數期貨

UTF-8:

Code: Select all

Offset (h)
00000000  E6B8AF E4BAA4 E68980 28 30 30 33 38 38 29     港交所(00388)
00000010  E88887 20 4D 53 43 49 20 E7B0BD E68E88        與 MSCI 簽授
0000001F  E6AC8A E58D94 E8ADB0 20 E693AC E68EA8         權協議 擬推
0000002F  4D 53 43 49 E4B8AD E59C8B 41 E882A1 E68C87    MSCI中國A股指
00000040  E695B8 E69C9F E8B2A8                          數期貨
UTF-16:

Code: Select all

Offset (h)
00000000  6e2f 4ea4 6240 0028 0030 0030 0033 0038     港交所(0038
00000010  0038 0029 8207 0020 004d 0053 0043 0049     8)與 MSCI
00000020  0020 7c3d 6388 6b0a 5354 8b70 0020 64ec      簽授權協議 擬
00000030  63a8 004d 0053 0043 0049 4e2d 570b 0041     推MSCI中國A
00000040  80a1 6307 6578 671f 8ca8                    股指數期貨
Personally, the easy cross referencing of character to Hex code is more important than a grid layout (which I believe was also the easy cross referencing for single byte characters). The UTF-8 example (not aligned because of variable width characters) show a nominal 16-byte layout with space separated byte(s) corresponding to the characters. Each line may have 15/16/17 bytes depending on the occurrence of multi-byte characters. This makes it easy to edit because the user can see clearly when he/she is editing a multi-byte character.
For fixed width code like UTF-16, grid pattern can be maintained.
Thank you.

Regards,
Tak

Moderation: Put in code tags to preserve alignment.

Maël
Site Admin
Posts: 1124
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 24 Jun 2019 04:33

I made a prototype for rendering, that uses separator lines instead of removing spaces. I think it becomes hard to read/separate the hex pairs, when they are grouped together per grapheme ("character"), and there are no spaces between those hex pairs.

traditional-chinese-utf-8-segmentation.png
traditional-chinese-utf-8-segmentation.png (8.69 KiB) Viewed 116 times

You could change the separator line colors, or maybe tune the display a bit. But this representation has the advantage to keep the grid like layout of the hex column.

Maybe it would be better to add a separator line at the end of each hex column line, when a grapheme ends there as well. For example, in the first line, "29" is a single byte encoding ")". Since there is a separator line on the next hex column line, I didn't want to have one as well directly after "29", to not be redundant. But maybe it's visually easier to read, since you don't need to go to the next line to be sure if the bytes encoding the grapheme end there or not.

A grapheme encoding that wraps around, can be seen at the end of the 2nd line: "E6 AC 8A", with only "E6" being on the 2nd line and the rest on the 3rd line. Since there are no separator lines, it's visually clear that all those hex pairs form one unit.

Let me know what you think.

P.S.: Don't mind the . for the space, this is just a mockup, not a complete rendering handling all the details, where a space would be rendered as a white block (="empty space").

Maël
Site Admin
Posts: 1124
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 28 Jun 2019 06:51

Here it is with the aforementioned vertical line delimiters, also at the end of line (compared to appearing at the start of the next line only, as in the previous post).
traditional-chinese-utf-8-segmentation-end.png
traditional-chinese-utf-8-segmentation-end.png (8.78 KiB) Viewed 41 times

Post Reply