Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 06 Feb 2019 16:49

This is going to become the summary thread for all the posts/topics related to this feature request.

Text column should support showing text decoded from UTF-8, UTF-16 and other multi-byte text encodings.

This request involves several issues regarding the implementation.

https://forum.mh-nexus.de/viewtopic.php?f=4&t=966

https://forum.mh-nexus.de/viewtopic.php?f=4&t=891

https://forum.mh-nexus.de/viewtopic.php?f=4&t=847

https://forum.mh-nexus.de/viewtopic.php?f=4&t=820

https://forum.mh-nexus.de/viewtopic.php?f=4&t=612

https://forum.mh-nexus.de/viewtopic.php?f=4&t=717

https://forum.mh-nexus.de/viewtopic.php?f=4&t=382

https://forum.mh-nexus.de/viewtopic.php?f=4&t=340

https://forum.mh-nexus.de/viewtopic.php?f=7&t=1007

Some details are here:
https://forum.mh-nexus.de/viewtopic.php?f=4&t=238

https://forum.mh-nexus.de/viewtopic.php?f=4&t=162

https://forum.mh-nexus.de/viewtopic.php?f=4&t=173


From https://forum.mh-nexus.de/viewtopic.php?f=4&t=996
j7n wrote:
28 Jan 2019 18:54
4) Selectable Unicode (UTF-16LE) encoding for the text column. If possible, avoid pulling foreign Eastern characters from other large fonts for this view, to maintain speed. Possibly offer a mode where the byte pair starts at +1 byte to decode strings starting at odd address.
j7n wrote:
30 Jan 2019 01:07
I understand that variable length multibyte strings are difficult (like UTF-8). I'm not sure I'd like the text column to change width at all. Maybe implementing just Windows NT fixed 2 byte Unicode is better than having none at all. This would permit reading strings in Windows executables and related resources, in most languages. The text column doesn't handle multi byte symbols anyway (Tab, CR-LF).
j7n wrote:
06 Feb 2019 10:13
For future Unicode decoding, please try to avoid showing symbols from Windows font substitution in Vista/Seven+, which result in cluttered and slow scrolling view:
gsL0BCx.png
gsL0BCx.png (86.18 KiB) Viewed 2209 times
I was convinced there was a topic about multi byte characters that discussed the details on the complications, but I can't find it anymore. Maybe it was a mail... (I'll add it later when I find it.)

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 07 Feb 2019 10:12

Complexities of correctly implementing Unicode text editing is shown here: https://www.catch22.net/tuts/neatpad and here https://www.catch22.net/tuts/neatpad/un ... xt-editing

This does however not handle the issues that come on top of that for a hex editor.

The non-straight forward relation between bytes (in hexadecimal) and Arabic text is shown nicely here: https://www.catch22.net/tuts/neatpad/in ... -uniscribe
editor1102[1].gif
editor1102[1].gif (18.31 KiB) Viewed 2179 times

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 07 Feb 2019 10:21

Quotes from e-mails:
The main issue is that the one character = one byte assumption wouldn't hold anymore (which is the basis of hex editors), so all the drawing code, hit testing, selection etc. wouldn't work.
Because of this, I haven't seen a hex editor yet, that does it properly.
Handling Unicode text properly is non trivial, regarding the drawing as well as well as text editing, because it's impossible to guarantee that every character has the same width (some languages are not made for fixed width fonts, such as arabic).
There are many other problems regarding how to present data in a meaningful way, because of how a hex editor presents x bytes per line, but sometimes a line may contain no character at all that can be shown (invalid or partial byte sequences) or the bytes of a character might be split upon several lines.
Searching would need to resynchronize, because in an arbitrary file, it's not guaranteed that all of it is text, so you might encounter invalid (in utf-8 sense) byte sequences.
All of this would require significantly reworking the drawing code.
Regarding encodings/character sets: multi-byte character sets are not supported on purpose currently, because you cannot always identify at what byte a character starts if you don't know where the entire text/string starts. That causes different interpretation of the bytes just depending on the current position in the file/editor. This causes drawing bugs, but also inconsistent behavior in search, sometimes a string can be found, sometimes not.

In short: multi-byte character sets are not trivial for a hex editor. Shift-JIS is a prime example of an encoding where you need to know the first byte of a text, you cannot just jump in the middle of a byte sequence and decode it correctly. UTF-8 etc. does allow this, but again this requires some important changes.
UTF-8 and other UTF-x encodings will definitely come, regarding the other old ones I have to see if that can be made to work predictably.
At the end of this mail I added some examples of why multi-byte encodings are a bad idea for hex viewers and editors.
The typical hex editor tabular layout was invented with single byte encodings in mind, and it proves impossible to extend properly.

Therefore an alternative view for such encodings has to be created, that really caters for the special needs of it. Probably requiring either a partial structural view, a way to wiggle around with the start addresses of strings until their decoding makes sense, and get closer to the variable length lines of text editors (both hex column and text column lines would be variable length) and having line breaks etc.
Also I write an editor, not a viewer, so I have to consider the special needs that arise from that, too.
End of mail:
A simple example is this:

The word "über" encoded as UTF-8 results in the bytes "C3 BC 62 65 72".
"C3 BC" are the two bytes needed for "ü".
Now assume this byte sequence starts just at the end of the hex column as shown below.

Code: Select all

xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx C3  xxxxxxxxxxxxxxxü
BC 62 65 72 yy yy yy yy yy yy yy yy yy yy yy yy   beryyyyyyyyyyyy
Now ü is shown on the first line, but its encoding is spread over two lines, therefore either there is a space free at the start of the next line or you have this:

Code: Select all

xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx C3  xxxxxxxxxxxxxxxü
BC 62 65 72 yy yy yy yy yy yy yy yy yy yy yy yy  beryyyyyyyyyyyyz
zz
In short, it is not possible to keep the tabular layout of hex editors, and all kind of presentation oddities will occur.

Another issue is that many multi-byte encoding are not self-synchronizing. UTF-8 is, but UTF-16 is not. That means it is impossible to decode a string properly if you start somewhere in the middle of its byte sequence. This is already problematic for text files, but there you could still scan from the start, even if this is inefficient, with certain encodings it's the only thing you can do.
Without structural definition, a hex editor is incapable of knowing where strings start, and therefore just starts decoding at the first offset visible in the hex view.
This can result in wrong decoding, but it can also result in weird behavior. Depending on where in the file you start your search you find certain strings or not (because the decoding depends on the start position). You can also have the odd case of using the search function to find a string, but the highlighted text of the supposed match in the text view does not display the text you searched, but another due to wrong decoding.

All of this would make it very unreliable and need lots of special considerations, also many affecting performance in ways weird to understand for the user.
Most hex editors just ignore such mistakes, but I like to have predictable results.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 07 Feb 2019 10:32

Comments to this were:
Right. Typically, editors I have seen would have free space on the start of the next line. When you highlight individual bytes, it will be clear what text it goes to, even if it may be on the line above. I think most would prefer the text be the portion that becomes variably aligned, while keeping the hex data in strict columns as normal.
As I mentioned earlier, the character alignment restarts any time a byte sequence is encountered that not found in the table. So, basically it re-syncs everytime it encounters data. Additionally, what we do with graphics decoding is allow two keys to adjust the encoding offset up or down by byte until it aligns in the fashion that the user declares decoding correctly.

j7n
Posts: 11
Joined: 28 Jan 2019 18:26

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by j7n » 07 Feb 2019 14:45

I have a Western-centric view on Unicode and feel that most of the complexity that comes with foreign scripts is not necessary in a hex editor. A "unicode 1.0 lite" would be enough (and maybe more useful). Text editors/viewers that can load 'binary' data and synchronize with any text inside it exist and can be used instead.

When it comes to characters that combine into one, such as "u" followed by a "combining diaeresis", it is more useful to see the two codes separately than a restored letter "ü" (most strings would already have Latin letters precomposed). Any invisible control characters are useful to see without interpretation, such as writing direction marks. Writing direction marks (which can occur randomly) shouldn't affect how a selection with the mouse work. This allows to diagnose problems when those symbols occur in codes like filenames, which look ordinary in complete text views.

For example, in WinHex I see a very basic Unicode. Maybe it is upgraded in later versions, but aside from glitchy graphic rendering, extra spaces and inability to select a font, this is sufficient. Combinining diacritics are not reconstructed. Writing direction marks are shown as arrows and not followed. Drawing a selection over the hebrew text in the text column behaves as over any other bytes (quite frustrating in a text editor).

I guess, with UTF-8, you would have to read back a few bytes before the current offset, and invalid characters can't be avoided on binary data.

Sorry for repeating some of these points (and not using image attach). I will stop now.
Attachments
Clipboard01.png
Clipboard01.png (25.48 KiB) Viewed 2220 times

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 07 Feb 2019 21:41

Thanks for your feedback (and attaching images).

Notepad++/Scintilla has a rather consistent selection behavior (always the same direction, no changes for RTL text), yet also renders RTL text in the right order. This would probably be ideal, yet of course it would require parsing ahead (possibly indefinitely, depending on file size/length of the RTL text).
Lister seems to be buggy when selecting. I am also not sure what the criterium is for the line break.
Writing direction marks are shown as arrows and not followed.
I could not see any in your example (though the font is quite small). Do you have any?

Some quick examples for direction marks I found are https://en.wikipedia.org/wiki/Right-to-left_mark and https://en.wikipedia.org/wiki/Arabic_letter_mark. But they would probably just be rendered like any control/non-printable characters currently, as well.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 08 Feb 2019 01:39

j7n wrote:
07 Feb 2019 14:45
Text editors/viewers that can load 'binary' data and synchronize with any text inside it exist and can be used instead.
Many of the feature requests I get (related to this summary) are from people who want to see text in their natural language. Especially when searching text, the found occurrences should closely resemble the search pattern.
I guess you can make some relaxing assumptions, such as regarding the direction of selection, but it should not be a raw view that is too close to a direct mapping to Code Units (in the sense of Unicode). I.e. UInt16 and UInt8 units, for UTF-16 and UTF-8 respectively, are too low level.
j7n wrote:
07 Feb 2019 14:45
When it comes to characters that combine into one, such as "u" followed by a "combining diaeresis", it is more useful to see the two codes separately than a restored letter "ü" (most strings would already have Latin letters precomposed).
Many Asian languages are not readable in the uncombined form. And I wouldn't want to read an u with an Umlaut separated from it. These characters get botched up enough everywhere.

Some examples, where you can see that non-combined characters are hard to read:

Code: Select all

ㅇㅏㄴㅈ  vs.  앉
ㅂㅏㄹㅂ  vs.  밟
ㄱㅏㅂㅅ  vs.  값
https://stackoverflow.com/questions/198 ... an-letters
http://www.programminginkorean.com/prog ... n-unicode/

Edit: I misunderstood the above examples. The Hangul jamos (glyphs on the left side) are only combined to hangul syllables (right side) through normalization, not through grapheme clustering. This is because none of the jamo code points above are of GraphemeClusterBreak type L, V, or T. The examples below that show Hangul clusters use Hangul syllables of the L(ead), V(owel), or T(rail) type, and therefore get handled as a cluster.

HxD will not normalize text, as this would be too far from the low level code unit representation and would actually transform bytes/code units, instead of treating them (selecting, caret position, etc.) and showing them as a cluster. But it will handle grapheme cluster examples of Hangul, given below (and explained above, i.e., LVT syllables).

Apparently CJK characters (sinographs/chinese characters/ideographs) are never composed of several code points: https://www.unicode.org/faq/han_cjk.html#16
TODO: seems not be entirely correct: https://en.wikipedia.org/wiki/Enclosed_ ... and_Months
However HxD will probably not support this.

According to GraphemeBreakProperty-8.0.0.txt there are no CJK characters that form grapheme clusters, but there are Hangul syllables that form Hangul syllable sequences (grapheme clusters). Many other Asian scripts, such as Tibetan or Devanagari, as well as middle eastern such as Arabic or Hebrew are listed as creating grapheme clusters.

Browsers such as Firefox, or text processing apps like WordPad/Notepad and LibreOffice will render a grapheme cluster as one unit, but allow to remove code points from the end of the code point sequence when pressing backspace. Each press of backspace deletes one code point. Pressing the del key in front of a code point sequence will delete the entire grapheme cluster at once.
Selection and caret movement also always handles the entire grapheme cluster.

Notepad and WordPad are inconsistent in handling Hangul syllable sequences:
ᄀᆢᇹ
renders as one grapheme cluster in all the apps mentioned above. But
will render as individual code points in WordPad and Notepad, yet render/edit fine in Firefox or LibreOffice. It seems Uniscribe is at fault for that, and it's not clear if it can be convinced to handle this grapheme cluster properly (with the right options/API calls). DirectWrite also cannot render it properly.
Relevant link for Win10: https://en.wikipedia.org/wiki/Uniscribe ... ing_Engine

Uniscribe rendering:
wrong-uniscribe-rendering.png
wrong-uniscribe-rendering.png (240 Bytes) Viewed 1781 times
Correct rendering:
correct-firefox-rendering.png
correct-firefox-rendering.png (365 Bytes) Viewed 1782 times

Firefox uses HarfBuzz for font shaping (yet DirectWrite/Uniscribe for rendering), which apparently works better in the last example above.
https://hg.mozilla.org/mozilla-central/ ... gfx/thebes has the relevant code, especially
https://hg.mozilla.org/mozilla-central/ ... zzShaper.h
and
https://hg.mozilla.org/mozilla-central/ ... Shaper.cpp

It however does not seem to be trivial to implement or link as library, and might be overkill considering all the additional effort that would be needed, even if this wrong rendering is a bit annoying.

Interesting read about complex scripts and shaping engines: http://tiro.com/John/Universal_Shaping_ ... POLabs.pdf


Other examples of grapheme clusters are:
नि
and:
ä
See BytesToValidUnicodeGraphemeClusters.dpr for details.

See also information about Korean (Hangul, which is not part of CJK):
http://www.unicode.org/faq/korean.html

See also: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?

More on grapheme clusters (what users would consider a character) and combining characters: http://unicode.org/faq/char_combmark.html
The FAQ also shows some examples of combining characters, that we can test on.
Especially relevant is: How are characters counted when measuring the length or position of a character in a string?, which provides also a good example for combining characters, that are not simple diacritics:

char_combmark_ex1[1].png
char_combmark_ex1[1].png (1.47 KiB) Viewed 1821 times
The image shows a grapheme cluster made of the base letter "Devanagari Letter Na (U+0928)" and the combining mark "Devanagari Vowel Sign I (U+093F)", among other code points.

A default grapheme cluster is specified in UAX #29, Unicode Text Segmentation, as well as in UTS #18, Unicode Regular Expressions.
For proper text segmentation and finding grapheme base characters (i.e., starting points), understanding grapheme clusters, and other related terms the glossary is useful.

Information about handling regex might be relevant as well, at least for search of handling code point sequences.





For analyzing encoding oddities, I would rather think extending the data inspector would be more useful, and extending it to a structural view. There is a UTF-8 Codepoint entry already, that will explicitly mention every encoding error. More could be done along the same line.

You could also show a different representation like browsers do, when there are encoding errors, such as using replacement characters for byte sequences, that don't result in a validly encoded string.

Lastly, it might be useful to have the ghost caret encompass/surround all the bytes that correspond to the currently highlighted glyph/insertion mark position. Similarly, when the hex column is active, the byte under the caret should result in a ghost caret in the text view, that (partially) highlights the corresponding glyph (how that partial highlighting is to be done is another question to figure out).
j7n wrote:
07 Feb 2019 14:45
Any invisible control characters are useful to see without interpretation, such as writing direction marks. Writing direction marks (which can occur randomly) shouldn't affect how a selection with the mouse work. This allows to diagnose problems when those symbols occur in codes like filenames, which look ordinary in complete text views.
Control characters and non-printable characters will probably just be represented like now: by a replacement character such as a dot.
For example, in WinHex I see a very basic Unicode. Maybe it is upgraded in later versions, but aside from glitchy graphic rendering, extra spaces
The WinHex authors certainly have an impressive list of features, but they are not a role model for HxD. I have different goals, and the Unicode support is a good example for that.
The GUI in the screenshot you showed really does not make much of an attempt to deal with any of the issues I mentioned. It also makes it hard to read the text, including latin based scripts that are spaced apart too widely.

To diagnose and fix encoding issues, it might be more valuable to have a separate tool, like the screenshot from Neatpad above, so you can truly and easily see (and possibly fix in case of errors) the correspondence of hexadecimal pairs to glyphs and vice versa. In this case you could work with a defined selection of bytes, which would avoid all the issues related to huge files, and wrongly assumed start-offsets within them.
If combined with a structure view, this could be an alternative view for a common text edit control. You could also easily play around with start offsets for multi-byte encodings like UTF-8 or SHIFT-JIS.
Implementing something like this as a global view for the entire file (i.e., the text column) would however result in many issues as mentioned earlier.

For the more global view (i.e. the text column) I'd like a representation that is reasonably well readable directly (such as printing combining characters as one glyph), without focusing too much on the byte/2-byte/etc. to glyph correspondence, which is not an obvious/regular table-like representation anyways. It should approximate a text editor view as much as reasonable.

Finding out the right level of interpretation for text rendering encodings, such as UTF-8 and SHIFT-JIS, so search (and showing search results) works, is still a thing to be figured out. Especially because the starting offset (of the search/string matching) will affect whether certain strings will be found or not. The solution is likely that the same search has to be performed in parallel from several starting offsets, especially if you start the search somewhere random in the file.

There might be an option to improve efficiency by searching only for one of the possible alignments/synchronization schemes, if you know for sure a certain alignment exists. But this is unlikely to be true in too many cases and would be a possible later optimization, not a first implementation.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 08 Feb 2019 02:17

If a representation is chosen that begins the next line with a "space"

Code: Select all

xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx C3  xxxxxxxxxxxxxxxü
BC 62 65 72 yy yy yy yy yy yy yy yy yy yy yy yy   beryyyyyyyyyyyy
it should not be rendered as a space, since it will not be obvious if it is a real space character or just an alignment helper. To show an alignment helper some kind of drawing that relates back to the ü in the line above it should be used. Or maybe the same glyph (ü) should be repeated in a blended out style in the next line instead of using a blank.

A better option is a rendering like this:
hex-utf-8-overhang.png
hex-utf-8-overhang.png (1.76 KiB) Viewed 2090 times
In this case there is never a space on the left side of the hex column, so all is left-aligned, but there can be an overhang on the right side, depending on how much space the glyphs take, or cause a ragged look, if some lines "compress" to more compact visual representations than their amount of bytes would in an ASCII encoding.

Another reason for the ragged look are overlong lines, that can appear when the sequence of glyphs is wider than an average line, such as a line of 16 latin glyphs (assuming a Latin-1 encoding and 16 bytes per line), or a line of 8 latin glyphs (assuming an UTF-16 encoding and 16 bytes per line).

In the pic above, ü is split over two lines, C3 still being on the same line as ü, but the other byte BC, that belongs to ü, doesn't fit anymore and wraps over to the next line. I call them overhang bytes, since they hang over the fixed (set by bytes per line) end of the (hex) line. For that reason highlighting such bytes, that still belong to the previous line, technically (when viewed from the characters/text POV), are shown in italic.
Leaning/hanging over the end of line cliff = slanted/leaning over font style.

A similar principle might be used for highlighting base abstract characters and combining characters. The base characters (or rather the corresponding hex pairs) would be in normal size, but the hex pairs for combining characters would be one pt (or whatever percentage) smaller. This way it would always be clear how many bytes belong to one grapheme cluster.

To make it more obvious how long a line is and if it ends with white space characters, the text column background color and the surrounding space should be another color. This way whitespace characters will have a distinct color compared to the surrounding/other parts of the hex editor display.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 08 Feb 2019 09:09

From https://en.wikipedia.org/wiki/UTF-16:
String implementations based on UTF-16 typically return lengths and allow indexing in terms of code units, not code points. Neither code points nor code units correspond to anything an end user might recognise as a “character”; the things users identify as characters may in general consist of a base code point and a sequence of combining characters (or be a sequence of code points of other kind, for example Hangul conjoining jamos) – Unicode refers to this as a grapheme cluster – and as such, applications dealing with Unicode strings, whatever the encoding, have to cope with the fact that they cannot arbitrarily split and combine strings.
Text segmentation should follow these rules:
https://www.unicode.org/reports/tr29/tr ... Boundaries
See also:
https://stackoverflow.com/questions/358 ... ead-of-one

Delphi apparently implements Unicode 5.2.0 (obvious when looking at TUnicodeCategory = 30 categories and the name of the include file System.Character_const.5.2.0.inc which is current for Delphi 19, and still implements the same categories as in Delphi XE3).
TUnicodeCategory matches https://www.unicode.org/reports/tr44/tr ... ory_Values even if the order is slightly different.

The matching rules and files can be found here:
https://www.unicode.org/reports/tr44/tr44-4.html
https://www.unicode.org/Public/5.2.0/ucd/
https://www.unicode.org/Public/5.2.0/uc ... operty.txt

Edit: I decided to implement Unicode 8.0.0:
Category values are the same as in 5.2.0 (so it remains compatible with Delphi XE3), but additional category groups were defined, such as L (which groups all letter categories), or M (which groups all combining marks).
https://www.unicode.org/reports/tr44/tr ... ory_Values
https://www.unicode.org/reports/tr44/tr44-16.html
https://www.unicode.org/Public/8.0.0/ucd/
https://www.unicode.org/Public/8.0.0/uc ... operty.txt

Format of GraphemeBreakProperty.txt is:
  • # introduces a comment
  • ; separates columns
  • the first column specifies a codepoint or a range of codepoints in hexadecimal, for example:
    • 000D is carriage return
    • 0000..0009 is the range of control characters from 0 to 9
  • the second column specifies the Grapheme_Cluster_Break property
  • the comments specify the character category and the name of the codepoint or the two names of the codepoints defining a codepoint range

When rendering text, complex layout and bidi text should be disabled. Only left-to-right printing (also for Arabic or Hebrew) should be used and complex layout rendering disabled, so that individual characters are printed so that strings of characters can be split up over several lines, with a consistent look.

https://en.wikipedia.org/wiki/Complex_text_layout

Equally text fields should render text the same way (no complex text layout or bidi support) to have the most predictable behavior.

This should be acceptable, since complex script layout is about ligatures and contextual shaping of graphemes (which makes a script harder to mentally process without additional training, not easier to read, but looks "prettier"). Diacritics and other combining characters however are never written separately in normal language use, and will feel unfamiliar.

A remaining issue though is rendering of RTL languages in LTR order. This is probably confusing for speakers/writers of those languages. But it will be difficult to implement RTL correctly, when it's not always clear where a string starts. There can be Codepoints that change the LTR order to RTL (which affects where punctuation marks are actually printed, for example) and there may be incomplete strings in this currently visible portion of a hex view. So RTL rendering of a chopped off string might look confusing and change as we scroll. Fixing this would require to parse beyond the view limits, and possibly even far beyond, thus affect performance in quite unpredictable ways, depending on text.

Since text editors basically parse a file from start to end, or at least line start to end, this is less of an issue. But a hex editor cannot rely on such delimitations to be present, which means that, in the worst case, the entire file has to be analyzed; and in the the not so seldom case, it can still be quite a large portion of the file.

So for now LTR will be used for all strings, and we will see how much of a limitation it is.

In summary: The overall requirement for printing text in a hex editor is that we have a sequence of glyphs, that can be broken up into a stack of lines, such that splitting or joining glyph sequences will only cause a change in the amount of whitespace between glyph sequences, but no change to the glyphs themselves. Splitting up a glyph sequence at any point should also not cause any reordering of the glyphs, as might happen when we handle direction marks for RTL text. So all glyphs are always rendered from left to right.
The same requirements should hold for splitting up Codeunit strings (no reordering of glyphs as a result, or change in glyph shape, except when deleting combining characters).

These requirements means we need to look at enough Codeunits before or after the current view to find all combining characters. But these are limited to a couple Codeunits at most, so this should not create any noticeable performance issues.

To fulfill the requirements of "invariance" of glyph sequence splitting (whitespace stuff above), it might be necessary to use a fixed-width font. But probably this is not strictly necessary for the text part, since it will not be rendered as a table-like representation anyways, only the hexadecimal column will. If ligatures and other fancy typesetting, such as complex text layout / support for complex scripts is switched off, we should still be able to fulfill this invariance requirement.

Similar to using proportional fonts for rendering text, https://en.wikipedia.org/wiki/Halfwidth ... idth_forms might affect how we break lines, or how wide certain glyph sequences will be. This will have to be considered as well, for proper line splitting and computing text column width or handling horizontal scrolling, or automatic setting of bytes per line (=adapt-to-window-width option).

Variation characters or selecting variation forms might be something to consider, but that's hardly a priority: https://en.wikipedia.org/wiki/Variant_form_(Unicode)
It could probably be handled similarly to Combining characters, as mentioned above.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 08 Feb 2019 09:37

I hope I haven't missed any important properties of typesetting or Unicode that could affect text rendering in hex editors (especially as mentioned in the summary in the post just above this one).

Possible points I missed are listed in this table: https://en.wikipedia.org/wiki/Template: ... navigation

If you have more input, feel free to share.

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 09 Feb 2019 16:00

UTF-16 is self-synchronizing on code units level (16-bit words), but not on byte level.

https://en.wikipedia.org/wiki/UTF-16#U. ... U.2B10FFFF
Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units (i.e. the type of code unit can be determined by the ranges of values in which it falls). UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 09 Feb 2019 16:24

Searching for Shift-JIS (and other MBCS) can be done when the search pattern is converted to the byte sequence following the Shift-JIS encoding, and then searching for the byte sequence, disregarding any character notions. Since there is no notion of characters, but only bytes, it doesn't matter where in the byte sequence we start.

This will mean though that special features like case-insensitive searching or searching using regexes that somehow use the notion of a character (such as a range of characters a-z or \w) is probably impossible to make work. Maybe backreferences and similar could work, as long as the regex can be expressed so it only relies on byte values. But working around the limitations of Shift-JIS is very hard if not impossible, and probably wont be a focus.

Even after thinking long, I found no good solution for displaying text that was encoded as Shift-JIS. Not only do we have to know which byte is the first byte (in a byte sequence of 1 or 2 bytes; Shift-JIS is a DBCS), because first and second bytes in Shift-JIS often overlap in values, but even if we know the first byte reliably, we can run into issues later: if there is an encoding error along the way, we might have to skip bytes and become unsynchronized again. This can happy when correctly encoded Shift-JIS strings are interrupted by random binary data (that does not happen to "accidentally" be valid Shift-JIS), which would not be too unusual for a binary file, such as an executable or game ROM.

So we essentially have arbitrarily many positions at which we can get unsynchronized, and don't know which byte is a first byte and which is a second byte. In other words, even if we find the right synchronization offset for one string, others strings that follow can still show up wrong.

I thought about having several text columns, one for each possible correcting offset, but there are too many possibilities. If I didn't make any mistake, in the worst case there are as many possibilities as bytes currently visible in the hex view (+/- a couple before or after).

This solution would work however for synchronizing UTF-16 on the byte level, one column with zero offset, the other with an offset of one.

You could also have a spinedit or some other control that sets the correcting start offset for the text column. However this will prevent quick scanning of the file for interesting strings. In the case of UTF-16 that might still be useful, since there are only two options. But for Shift-JIS there are definitely more than 2 and it will quickly become awkward and cause cognitive overload.

A string extraction routine might be possible, though not very efficient either (would have to think that through more thoroughly).

The question remains if it is sensible to support anything besides UTF-8, UTF-16 (possibly UTF-32), since MBCS all have similar problems to Shift-JIS. Single byte character sets could be supported without issues.

Feedback welcome!

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël » 10 Feb 2019 08:23

Another issue is the Overwrite mode. When a character (in the user sense, so a grapheme cluster) is overwritten, usually a character that has an encoding of the same length (=amount of bytes) is expected. With variable-length encodings this cannot be guaranteed.
The most sensible would be to replace a character with another character no matter how long their byte encodings are. So, the text column would behave differently from what it did with 1-byte encodings.

Assuming the (possibly unexpected) change in behavior above, it would be beneficial to add another option, that watches that the file size doesn't change (to catch such unexpected char size differences). The option should be (un)checkable in the toolbar and be enabled by default (to ensure expected results, i.e., no change in file size, or warnings on deviation).

When at the end of a file, overwrite mode extends the file anyways, so this would be a logical extension of this behavior. I'll have to see if this warning and the warning on overwriting single characters should be separate or not. Maybe separate is better, because again, a character that replaces another character and has a different amount of bytes in length will be unexpected for many (e.g., those who mostly use the ASCII part of UTF-8, the BMP part of UTF-16, or simple 1 byte encodings such as Latin-1), even though correct.

The readonly mode should be unselectable as well (with an explicit confirmation that cannot be clicked accidentally or confirmed by only one key press; i.e., not a default button, or shortcut that only uses one key to confirm).

j7n
Posts: 11
Joined: 28 Jan 2019 18:26

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by j7n » 16 Feb 2019 05:32

Editing in place (overwrite) from the text column would be quite problematic in variable-width encodings. I only see it working realiably with 16-bit Unicode. If I type over the composite 'ü' (only Macintosh writes this format ü) and thus write fewer bytes, where would the padding go? If directly after the character, it should allow typing over these bytes to keep the string continuous, and make it possible to fill all the original bytes. Maybe show this null padding as a replacement character. I don't think it should shift the remaining characters (after ü) left. Text editing should not move or change any bytes after the logical ending of the string (shown random gibberish characters, possibly accidentally joined into a cluster), to avoid breaking the underlying file format. The highlighting within the hex column of the byte range we are typing over, as well as tabular, monospaced display text column, showing how much space remains, are helpful to ensure this.

Perhaps there could be a few modes/encodings: Unicode UCS-2 (simple, tabular view), UTF-8 and UTF-16 (both fully interpreted). An option to keep multiple decoded text columns on screen at the same time would be useful (to see mixed ANSI and Unicode text, or raw UCS-2 vs interpreted UTF-16). A spin edit control for selecting a starting offset for Unicode would be good. I noticed that the new Data Inspector already has WideChar decoding, I'd like to see that as one of the choices for the text column, to read more than one symbol at a time. Maybe other fixed width fields from the Inspector could be candidates for a tabular display in a column, with a spin control to set alignment.
Maël wrote:
08 Feb 2019 01:39
It also makes it hard to read the text, including latin based scripts that are spaced apart too widely.
Agreed, the Unicode implementation there is raw, and needs some polishing, overwrite is not supported; the width of the text column should halve in Unicode, and typing should replace the entire 16-bit word (or code unit).

Maël
Site Admin
Posts: 1125
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings

Post by Maël » 16 Feb 2019 11:16

j7n wrote:
16 Feb 2019 05:32
Editing in place (overwrite) from the text column would be quite problematic in variable-width encodings. I only see it working realiably with 16-bit Unicode. If I type over the composite 'ü' (only Macintosh writes this format ü) and thus write fewer bytes, where would the padding go?
There would be no padding, it would work like overwriting a character in a text editor, i.e., all bytes that belong to the overwritten character (or better, grapheme cluster) will be replaced with the bytes of the new character. Depending on which character it is, this may cause changes in the file size. That's why I mentioned in another post how such file size changes would trigger a warning.
Your idea with UCS-2 encoding might solve this however.
Best would be to show surrogate halves as boxes with the letters/numbers showing the corresponding Codepoint, as with fonts that are missing certain glyphs.


UCS-2 and naming of options

I did some research. UCS-2 is Unicode below 2.0. The surrogate halves are not yet defined: https://www.unicode.org/versions/components-1.1.5.html
But the surrogate range is defined in 2.0: https://www.unicode.org/Public/2.0-Upda ... 2.0.14.txt
No characters are defined in the surrogate range for 2.0, though: https://www.unicode.org/versions/Unicod ... ch06_6.pdf

Here is the official answer, that UCS-2 indeed goes only up to Unicode 1.1:
http://unicode.org/faq/utf_bom.html#utf16-11

The BMP (basic multilingual plane) was not finalized when Unicode 2.0 came out, and UCS-2 was not formally updated afterwards, so maybe a better name would be "UTF-16 (ignoring surrogates)" or "UTF-16 (BMP only)".

Unfortunately, combining characters exist already in the BMP since Unicode 1.1: https://en.wikipedia.org/wiki/Combining ... cter_table
For the official reference, search for "0300;COMBINING GRAVE ACCENT;Mn;230;L;;;;;N;NON-SPACING GRAVE;Varia;;;" in https://www.unicode.org/Public/1.1-Upda ... -1.1.5.txt

Therefore UCS-2 does not ignore combining characters or avoid the issue of fusing them to grapheme clusters.
Also, one would probably want the modern definition of the BMP, and not the old one as still defined in UCS-2/Unicode 1.1.
UTF-16 only focusing on the BMP will not solve the grapheme clustering problem, it just guarantees (like UCS-2) that each Codepoint is made of exactly 16 bits.

So one might need a separate option, for avoiding to build grapheme clusters, maybe as a normalization option or something, but this is really focused on codepoint processing, not graphical output/visual editing. So the better term would be "Enable/disable grapheme clustering".

As even Unicode 11.0 leaves some Codepoints in the BMP undefined, I think "UTF-16 (BMP only)" is a better name than UCS-2, so that possibly new Codepoints can be supported (with a logical name) as Unicode progresses.

Conclusion: "UTF-16 (BMP only)" as encoding name, and an "Enable/disable grapheme clustering" option.


Other points


It might also be useful to support another mode, where not only the bytes that belong to a character are replaced, and new bytes are inserted (the way that text editors do an overwrite), but also a raw overwrite, that simply overwrites any bytes it needs, not caring if they belong to one character, or multiple ones. This could however create random gibberish as you mentioned. Introducing padding bytes would not make much sense, since the text result of a string would be corrupted as well (introducing null bytes or random replacement characters which are even more problematic because they need at least two bytes) as it would interrupt a normal string with fillers that are just artifacts, not actual text information.

Hmm, maybe this is not such a good idea afterall, because it will just create weirdly changed text, and still possibly corrupt a file format by overwriting bytes if the new character is larger than the old one. I think I will stick with "text editor overwrite mode" in the text column (with file size change warning), and traditional overwrite mode in the hex column. It would need to be documented somehow in an introductory article, to grasp how hex editors behave with multi-byte and variable-length encodings.

It may also be possible to select if precomposed or combining characters should be used, if one of the two options preserves the file size, or edit an entire string so that changes even out (some characters shorter, some longer, or an automatic search for a string of the right size). But this would be an option for a future release, not a first one.

Such a principle would be needed for a structure view, as well, where the replaced datatype might be longer/shorter than the original one, so fitting data in the available space (or adjusting length fields somewhere in a header within the file), are things that will matter. But as mentioned below, that's for another feature request.
The highlighting within the hex column of the byte range we are typing over, as well as tabular, monospaced display text column, showing how much space remains, are helpful to ensure this.
Showing which bytes correspond to which characters (depending on the encoding) is useful, agreed.
Text editing should not move or change any bytes after the logical ending of the string (shown random gibberish characters, possibly accidentally joined into a cluster), to avoid breaking the underlying file format.
This can always happen, also when you just support UCS-2 and no grapheme clusters: bytes might be randomly fused to a 16-bit character/glyph that are unrelated and part of two different structures in the file format. It would also vary how such fusings occur depending on the chosen alignment/start offset for the encoding/character (think of the spin edit). To solve such issues you need a structure view (which is another feature request).
Such issues are unavoidable with multibyte encodings, variable width or not, which is why HxD will give up the tabular idea for text encodings that don't naturally support the 1 byte = 1 character idea. Also something that will be mentioned in an introductory article.
Perhaps there could be a few modes/encodings: Unicode UCS-2 (simple, tabular view), UTF-8 and UTF-16 (both fully interpreted).
That could be an option, though some issues would remain with UCS-2 as mentioned above, even if the file size would remain unchanged.
An option to keep multiple decoded text columns on screen at the same time would be useful (to see mixed ANSI and Unicode text, or raw UCS-2 vs interpreted UTF-16). A spin edit control for selecting a starting offset for Unicode would be good.
Both certainly useful ideas.
I noticed that the new Data Inspector already has WideChar decoding, I'd like to see that as one of the choices for the text column, to read more than one symbol at a time.
Maybe other fixed width fields from the Inspector could be candidates for a tabular display in a column, with a spin control to set alignment.
Replied here: https://forum.mh-nexus.de/viewtopic.php?p=3071#p3071

Post Reply