Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Understanding bi-directional text
https://www.iamcal.com/understanding-bi ... onal-text/

There is a file order of code points (and the grapheme clusters they "agglomerate to") and a display/"user reading" order of these "characters".

For Arabic text, which is usually RTL, the codepoints will have increasing positions in the file, while on display they will have decreasing X coordinates.

In a hex editor, the main mode should definitely favor the file order, even if it makes it hard to read the text, because a clear correlation between file order and interpretation should be possible.

An additional display/reading order could be set, but should not be the default (and would need to have limits added, such that HxD does not need to scan all the file until it reaches the first character that really needs to be printed first).

https://w3c.github.io/i18n-drafts/artic ... -basics.en
It is important to understand from the outset that, in all major web browsers, the order of characters in memory (logical) is not the same as the order in which they are displayed (visual).
Issues due to mixing LTR and RTL text:
https://docs.microsoft.com/en-us/dynami ... al-support
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

The logical order of characters in a file is displayed as left to right in a hex editor, also for RTL text. So while increasing file offsets for LTR text mean increasing horizontal pixel offsets, for RTL text, increasing file offsets mean increasing horizontal pixel offsets from right to left (or decreasing when assuming a LTR coordinate system).

So both, LTR and RTL text are in the right logical order in a file, they just map too a LTR coordinate system or an RTL coordinate system which have opposing directions for the x axis.

Therefore properly handling RTL text would also mean that the hexpair/byte display would need to be shown from right to left as well.
Essentially, LTR and RTL text would need to be split up into individual paragraphs that are also edited individually. Each paragraph would also line wrap independently. A problem would still occur with overlong lines, which would be overlong on the left side, for RTL text. This would cause problems for scrolling, since there is no clear start at the left side anymore.
Overlong text lines can occur due to variable-width characters, that are unavoidable with Unicode, even for mono-spaced fonts (e.g., "double width" Chinese characters). It can also occur due to composed characters/grapheme clusters, which need to be displayed together, such as the "COMBINING DOUBLE TILDE" which connects two characters with an overarching tilde.
A possible solution would be to draw such overlong text lines left aligned, but still keep the RTL order of the characters/glyphs, the same way that RTL text behaves and line wraps inside of an LTR container (div or p in HTML).

It's definitely not possible to mix left and right aligned text lines, without knowing the width of all the text lines, otherwise the rightmost pixel position is unknown, which is needed to right align to the longest (RTL) text line. Without knowing the rightmost pixel, we would require an RTL scrollbar, to still be able to align all lines at the same pixel position (rightmost pixel would always be pixel 0), and increasing x pixels would need to go to the left.

Either we have an LTR or a RTL scrollbar, so one of the two display direction must be chosen, and the other one will have the problem just described for RTL text. That's why left aligning also RTL text is necessary (assuming LTR is the default).

If we wrap text based on grapheme clusters, while not respecting word boundaries (or soft break boundaries, as Uniscribe provides), we could still display text in RTL order, but left aligned (due to the scrollbar and alignment of overlong lines mentioned above). It would not be possible to provide contextual shaping, though, as breaking lines within words, would change the text rendering / disable contextual shaping. Therefore it might be best to disable it entirely, rather than having the shaping of words randomly change as we resize a window (and the amount of glyphs that fit horizontally, and therefore places where words are broken up, change as well).

Alternatively, LTR display always reflect the logical order of bytes/code units/code points etc. and therefore text rendering should also reflect this order. I.e., text should always be rendered in LTR, even if it is RTL text, such as Arabic. Again, we should disable contextual shaping, and draw grapheme clusters individually, without contextual shaping between such "user"/abstract characters.

Rendering RTL text after LTR text, requires to scan the file until the end of the RTL text-run is found. This is because the last character of the RTL text-run will be rendered directly after the last character of the LTR text. This is a potential performance problem, with very long text runs, which can easily happen in random files, encountered in a hex editor (real text files can at least be assumed to be somewhat reasonably made).

The paragraph option above would be conceptually clearer, but also be impractical. Both approaches still require potentially scanning the entire file, to determine when an LTR or RTL block ends, as to allow proper rendering.

Especially RTL text would require a different scrolling direction whic
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Line-wrapping ("word" wrapping -- really grapheme cluster wrapping)
Independently of RTL text, line wrapping while respecting grapheme cluster boundaries, may cause rendering of more characters per line, than bytes (hex pairs) per line, since grapheme clusters that are begun on one "hex line" will be completely drawn as one on the text line. Wrapping should be done on the actual text width of both the hex line and the text line (while wrapping only at grapheme boundaries).

Should we still assume a fixed
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

https://docs.microsoft.com/en-us/window ... #abc-width

This link explains ABC width, and would allow to deal with the overhanging H problem Ἧ where the leading decoration is clipped, when normal TextOut is used. Test if ABC width for a text line would include this overhang, such that we can respect it when printing text, and avoid clipping. Same problem with ClearType, that has some additional pixels outside of the allotted rect, as Catch22 noticed.
We should always provide enough spacing before and after the text control in HxD to allow for such text to print fully, without clipping. That also means we should first draw the background and padding between the controls, then draw the text with transparent background and no clipping.

What to do with Zalgo text, though, that might otherwise spill into the hexpair area? We would still need clipping, but only to about half of the space between the hex pair control and the text control.

Note: Above glossary has many other useful definitions of the terms run, item, range, font fallback, complex script, and script ("A script is a system of written language, for example, Latin script, Arabic script, Chinese script" -- so, "Schriftsystem" is the right German translation, not "Handschrift" nor "Schrift").
There is also a section About Complex Scripts that goes into more detail of contextual shaping and combining characters.

More detail on complex script processing.

Even more details on the process of text rendering is available here:
https://docs.microsoft.com/de-de/typogr ... sing-part1

The above link also contains some interesting representations for mapping hexquadruples/codepoints to glyphs/characters in a somewhat similar way to what hex editors do:
op-example5[1].gif
op-example5[1].gif (10.62 KiB) Viewed 19897 times
(This picture resembles what I wanted to do for my structure view.)


op-example8[1].gif
op-example8[1].gif (4 KiB) Viewed 19897 times
op-example9[1].gif
op-example9[1].gif (4.66 KiB) Viewed 19897 times
op-example10[1].gif
op-example10[1].gif (5.08 KiB) Viewed 19897 times

Finally, perfect explanation of complex text layout (Wikipedia), using Arabic and Devanagari as examples. Here the Arabic example (the word العربية al-arabiyyah, "the Arabic [language]" in Arabic), showing three possible renderings: LTR rendering (no complex script layout), RTL (no complex script layout), RTL (with contextual shaping).
250px-Arabicrender[1].png
250px-Arabicrender[1].png (8.55 KiB) Viewed 19892 times
Side note: Arabic alphabet.

There are 4 different variants for letters, depending on their position in a word (initial, medial, final, or isolated):
http://www.arabic-course.com/arabic-alphabet.html (search for "Arabic letters change their shape according to their position in a word" -- no <a> anchor in the html of the page...)
See also table of letter forms (Wikipedia).

From the above link:
Each letter has a position-independent encoding in Unicode, and the rendering software can infer the correct glyph form (initial, medial, final or isolated) from its joining context. That is the current recommendation. However, for compatibility with previous standards, the initial, medial, final and isolated forms can also be encoded separately.
Other relevant information on contextual shaping and complex script rendering in general, with lots of useful pictures:
http://scripts.sil.org/cms/scripts/page ... ndExamples
Ligatures:
CmplxRndLigArabic[1].png
CmplxRndLigArabic[1].png (3.11 KiB) Viewed 19870 times
CmplxRndLigIPA[1].png
CmplxRndLigIPA[1].png (2.12 KiB) Viewed 19870 times

Contextual shaping:
CmplxRndShapingBurmese[1].png
CmplxRndShapingBurmese[1].png (2.62 KiB) Viewed 19870 times
Ligatures
A special case of deletion involves the creation of ligatures, where two or more items are replaced by a single glyph representing both or all of them. In the case of a ligature, there are visual components of the final glyph that correspond to each of the underlying characters.
From: https://scripts.sil.org/cms/scripts/pag ... tutorial14

Given the research above, Arabic seems reasonably similar to cursive writing of the Latin script (eventhough contextual shaping is much more drastic), such that an Arabic reader should be able to read a word if it is composed of isolated letters as opposed to using the more connected shape obtained from using the right shape (depending on in-word position of the letter). Probably, well trained Arabic readers will more readily recognize contextually shaped words (which can look quite different from words written as a sequence of isolated letters -- see embedded image above). But it should remain readable, if being slightly awkward.

Not a scientific proof, but somewhat suggestive that reading is indeed pattern matching of well known shapes (so uncommon rendering will be awkward, which isolated glyphs will be -- but again, a lot of text in HxD will not be natural language and therefore not common patterns forming words):
Microsoft researchers spent more than two years sifting through a large amount of research related to both typography and the psychology of reading. They concluded that reading is a form of pattern recognition. People become immersed in reading only when word recognition is a subconscious task and the conscious mind is free to read the text for meaning.
From https://docs.microsoft.com/de-de/typogr ... -cleartype

(Only slightly related, more of an interesting note:
No two screens are exactly the same and everyone perceives color in a slightly different way.
https://docs.microsoft.com/de-de/typogr ... my-display
)

What was discovered is that word recognition is only subconscious when typographical elements such as the shape and weight of letterforms, and the spacing between letters work together to present words as easily recognized patterns.

The isolated character mode should definitely be available for Arabic letter sequences that do not result in meaningful words, but may be random letter sequences (due to random data viewed in a hex editor), or letter sequences which have non-linguistic meaning. In this case, each character should be unambiguously identifiable, and not be "obfuscated" by contextual shaping, as then recognizing entire words will not work (since those character sequences do not form words).

Since various letter forms are available in legacy encodings (and therefore Unicode? -- see quote above), we might have contextually shaped letters, that should still be printed in isolation, with a gap (and no ligature) with adjacent letters. Check that the text renderer respects this isolated rendering!

So if it turns out cursive script is not possible technically in a hex editor, this should still be acceptable, if not ideal. RTL however would be good, if possible. The issue due to having to scan far needs to be resolved, still (see the two posts above this one).
The question remains how much this is true for Devangari, since it also reorders letters "randomly" and not just from LTR to RTL.

RTL and LTR marks (that alter the text display order) should probably be ignored, since it would require again to scan a large part of the document, and may cause inconsistencies if it it out of range (of our search), and when scrolling becomes in range, and therefore would cause random direction changes of text. For the same reason it is probably better to ignore RTL, but reading would be very awkward this way...

Yet random text, that uses random RTL character sequences, and has no linguistic meaning, would not benefit from RTL order, quite the opposite, since it makes it harder to correlate characters to bytes, then.
So at the very least, provide a forced LTR display order, and possibly, an additional RTL order.

Or, maybe it is best to provide a separate text display, which has full support for text rendering like a text editor, including line breaks (yet would still need to limit scanning the file, otherwise it may become unresponsive -- or it would need to introduce artificial line breaks after too many characters -- but this would again mess up rendering due to potentially missing text display order marks...).
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Practical line limits due to Uniscribe

Breaking up text into lines needs to consider an important limit imposed by Uniscribe.
From https://sourceforge.net/p/scintilla/bugs/1129/?limit=10&page=2:
In Mozilla's source code in Uniscribe related code there's this comment, which fits perfectly:

"Any item of length > 43680 will cause ScriptShape() to fail, as its mMaxGlyphs value will be greater than 65535 (43680*1.5+16>65535). So we need to break up items which are longer than that upon cluster boundaries. See bug 394751 for details."

MSDN page for ScriptShape() recommends to set length of glyphs buffer to 1.5 * StrLength + 16, but there's probably an upper limit of 64K, which should not be exceeded.
From MSDN:
If this function returns E_OUTOFMEMORY, the application might call ScriptShape repeatedly, with successively larger output buffers, until a large enough buffer is provided. The number of glyphs generated by a code point varies according to the script and the font. For a simple script, a Unicode code point might generate a single glyph. However, a complex script font might construct characters from components, and thus generate several times as many glyphs as characters. Also, there are special cases, such as invalid character representations, in which extra glyphs are added to represent the invalid sequence. Therefore, a reasonable guess for the size of the buffer indicated by pwOutGlyphs is 1.5 times the length of the character buffer, plus an additional 16 glyphs for rare cases, for example, invalid sequence representation.
We will assume a bit more tolerance than just the 1.5 factor in the first quote, since the second quote points out how many glyphs per character are needed depends a lot on the actually used font and the complexity of the character. Invalid sequences should not occur, sine we ensure the strings are valid, but possibly boxes with the hex code of the code point are generated (missing glyph replacement), which might be built up of several glyphs (one for the box, several for the letters/numbers inside of them -- in total 7, one for the box 6 for the letters/numbers).
1.5 factor is probably due to the assumption of mostly latin text, but in HxD and potentially random data, this cannot be assumed.

Unfortunately, there is no worst case, since the amount of glyphs per character really depends on how the font is constructed, and could be many glyphs per character. A factor of 8 glyphs per codepoint seems a very safe upperbound, when considering the stroke density of common Asian, Arabian or other scripts, which should cover most complexities, including missing glyph replacement drawings.

With the 64K limit of ScriptShape() this would result in 65536 / 8 = 8192 codepoints at most for the text column. The hex column has less strict requirements, since the codepoints there will always be from the set of 0-9, A-F, a-f, and space characters, all of which are latin characters and/or simple in structure. So while, in the worst case, we have to assume 3 times the amount of codepoints ("AA " -- hexpair and a space per byte = ASCII codepoint) as in the text column (3 * 8192 = 24576), 24576 and even 24576 * 1.5 = 36864 or 24576 * 2 = 49152 is still well below the 64K limit. Even adding additional spaces, between every hex pair would be no problem 49152 + 8192 = 57344.

Therefore, when drawing the hex column and text column individually, 8192 codepoints seem a very safe limit.

This gives us a practical limit for the maximal line length we can support, no matter what the actual line length would be: either due to a BytesPerRow setting, a line break due to word wrapping or an explicit like break (CR/LF etc.).

There is also not much reason to have longer lines than this limit, since it will require a lot of horizontal scrolling. To avoid cutting off anything, we could introduce a "forced line wrap" symbol, that shows the line wrap was made due to technical constraints, and not due to BytesPerRow, CR/LF (and similar, like paragraph break), nor normal word wrapping (at window boundaries).

So we must scan and subdivide text independently of Uniscribe in general. We cannot rely on Uniscribe to break up a text extent into lines, we can at most rely on it to suggest line wrap (soft break) points of an already limited line.

Therefore we must scan in units of grapheme clusters (made of up to 31 codepoints), as already mentioned earlier, up until we consumed 8192 code points, or less, to prevent generating incomplete grapheme clusters. Additionally, any other boundaries that appear earlier, such as BytesPerRow, CR/LF etc., or window width considerations for soft line breaks, will further limit the actual line length.

Given this hard limit for the line length, a string can be obtained independently of Uniscribe, then measured and hittest with Uniscribe, as to obtain the necessary information for horizontal scrolling and mouse position to "user perceived character" position (and vice versa).
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

HxD has no notion of a paragraph, but ScriptItemize() from Uniscribe requires an entire paragraph for proper bidi analysis. We cannot simply ignore bidi text either, nor Directional Formatting Characters occuring in a string, since it will affect the rendered text, especially if we split a longer text randomly, unexpected outcomes may result.

Similar to searching for grapheme cluster boundaries for a given view/window of a file (= range of bytes to be viewed), and rewinding before the view start when necessary to find a grapheme cluster start, we need to find the true boundaries of text runs of a certain directionality, such that the text runs remain complete if they cross view boundaries, to ensure proper text analysis.

To that end, we need to better understand Unicode's bidi algorithm, and cannot merely rely on Uniscribe:
https://unicode.org/reports/tr9/

Especially the shaping section and the Basic Display Algorithm.

Also see https://maxradi.us/documents/uniscribe/

Clarified and summarized Uniscribe terms from Uniscribe glossary mentioned already above, but more clear:
  • paragraph = smallest text unit ScriptItemize can process
  • item = characters of multiple styles, but single script and single direction
  • run = single style, possibly variety of scripts, possibly various directions
  • range = run created by breaking up items at style boundaries (if the item has multiple styles)
    • a range has only one style (because created by breaking at style boundaries), a single script and a single direction (because items have these properties, too)
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Bidi Algorithm and ScriptLayout

Span is an inline element (which cannot cross block/paragraph boundaries), but it can word-wrap.
For example this HTML-code:

Code: Select all

<html><head></head><body>

<div>The following span is an <span style=" background-color:#ee3;">inline element</span>;
its background has been colored to display both the beginning and end of
the <span style=" background-color:#ee3;">inline element</span>'s influence.</div></body></html>
Will render like this, when the window is appropriately resized (see how the second yellow "inline element" word-wraps on the window/rectangle boundary):
span-word-wrap.png
span-word-wrap.png (5.21 KiB) Viewed 19809 times
In general, from its pure definition a run may cross paragraph boundaries. But in the context of the bidi algorithm, which works at most on paragraphs, a run cannot cross a paragraph boundaries. Therefore, a run is equivalent to a span in this context. Span will be used as an alias for run, such that we can use r as a shorthand for right, and s as a shorthand for span (=run).

lp = left-to-right paragraph
rp = right-to-left paragraph
ls = left-to-right run (=span)
rs = right-to-left run (=span)

lne = level n expression (a span at embedding level n)
For example, l1e is a level 1 expression (a span at embedding level 1).

Another minor extension of this notation is obtained using parentheses:
l(n)e where n can be an arithmetic expression, not just an integer, for example:
l(n-1)e is a level n-1 expression (a span at embedding level n-1).

Now we can translate the embedding levels table (the last "etc." entry in the table got lost -- MS docs have many little errors and omissions in general, since they transferred to a new help system).

Code: Select all

level	meaning

0 	A left-to-right run in a left-to-right paragraph.
1 	A right-to-left run embedded in a left-to-right run in a left-to-right paragraph.
	Alternatively, a right-to-left run, not embedded in another run, in a right-to-left paragraph.
2 	A left-to-right run embedded in a right-to-left run of type 1.
3 	A right-to-left run embedded in a left-to-right run of type 2.
etc. 	The embedding levels can continue as far as necessary
When translating this table using the abbreviations defined above and using a forward slash to denote containment -- A/B means B is contained or embedded in A, or in other words: B is a subnode of A, that's why we use / like a path separator -- we get the following equivalent description:
level span type expression name expression expanded expression
0 left span at level 0 l0e lp/ls
1 right span at level 1 l1e lp/ls/rs
or
rp/rs
2 left span at level 2 l2e l1e/ls lp/ls/rs/ls
or
rp/rs/ls
3 right span at level 3 l3e l2e/rs l1e/ls/rs
4 left span at level 4 l4e l3e/ls l2e/rs/ls
5 right span at level 5 l5e l4e/rs l3e/ls/rs
n is even left span at level n lne l(n-1)e/ls l(n-2)e/rs/ls
n is odd right span at level n lne l(n-1)e/rs l(n-2)e/ls/rs
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Now, given this definition, you might assume that each left run/span has to be followed by a right run/span, and vice versa.
However this is not the case. It is entirely possible to have several left to right scripts in sequence, for example in the following phrase, which will generate a sequence (in example below: two) runs of the same direction (in example below: LTR).

Helloठऑक्षझॉ

This will be decomposed into to items by ScriptItemize():
  • Hello
  • ठऑक्षझॉ
The first item is in Latin script, the second in Devanagari. So our bidiLevel array, if constructed from the above item array will look like this (actually tested with real code):

Code: Select all

0, 0
I.e., two LTR runs (both of level 0).

It is also possible to manually construct a bidiLevel array, skipping one level, such as 0, 2, 2, 0. ScriptLayout() also handles this, but I can't think of any situation where such an array could actually be created, except for an LTR run embedding another LTR run, however that would be denoted.

So, given all these definitions and information, how do we actually embed one run into another, as opposed to simply chaining them?

One possibility is to use bidi control characters: Bidi_Control. A brief and essential overview of relevant codepoints that denote the start and end of a directionality is given in comparison of HTML markup and bidi control codepoints.
The main codepoints to start a new direction are LRE/RLE and RLO/LRO, which are all terminated by PDF.
There are also the codepoints LRI/RLI and FSI, which are terminated by PDI.

A relatively simple example is this string:
a·RLO·b·LRO·c·RLO·d·PDF·PDF·PDF
The actual text output is this (notice that the bidi control codepoints are invisible):
a‮b‭c‮d‬‬‬
In case the rendering is wrong, the intended output order (visual order) is
a, c, d, b,
while the input order (logical order) is
a, b, c, d.

The string will be converted by ScripItemize() into this bidi array: (0, 0, 1, 1, 2, 2, 3, 3)

And when applying ScriptLayout() we get this:
VisualToLogicalMap = (0, 1, 4, 5, 7, 6, 3, 2)
LogicalToVisualMap = (0, 1, 7, 6, 2, 3, 5, 4)

Perfect to verify that we use the right map, since in many cases they will have the same values, which usually makes testing harder/tricky.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Correctly positioning the caret on screen (a lot of example code gets it wrong)

The WinAPI function ScriptCPtoX has a parameter fTrailing, which influences whether the trailing edge of the cluster pointed at by iCP is used (fTrailing = True) or the leading edge is used (fTrailing = False).

The leading edge of a cluster is the left most edge for LTR text, and the right most edge for RTL text. The trailing edge is the remaining of the two edges bounding a cluster horizontally.

ScriptCPtoX(iCP, True, ...) = ScriptCPtoX(iCP + 1, False, ...) for LTR and RTL text.
This is an exact equivalence, there is no difference, not even one pixel.

Incorrect behavior:
All of this matters to properly compute the caret position in bidi text, which is wrong in the Catch22 demo and partly wrong in Wine code for the Windows EDIT control (wrong for single-line mode, but correct for the multi-line mode).

Standard/correct behavior:
Notepad, Visual Studio 2017 (maybe earlier versions as well) and WordPad.
Arrow keys navigate in logical order, selection changes direction according to text run order. Backspace and Del's direction are reversed in RTL text.
VS does however stretch Arabic characters to fit into the width of the average/latin characters, using Kashidas and spacing, so I assume they use ScriptJustify for this purpose. However Chinese characters are not justified to multiples of average/latin characters.

VS has two related entries on the status bar: Column (Col) and Char(Ch).
Ch seems to be based on WideChar's: a followed by a diaeresis is rendered as one character, but counted as two. However it is counted as one only for Col.
𨭎 is counted as two Ch and two Col.
For Zalgo text, VS counts Cols visually, apparently. No matter how many combining characters decorate a character vertically, it counts just as one Col.
Col is definitely visual, based on average character width, since the Tab character is only counted as one Ch but as several Col (depending on how wide Tab was defined in the options).
A with diaeresis is counted as 3 characters when the file is encoded as UTF-8. With no specified encoding, the encoding must be assumed to be UTF-16 since then it is counted as two chars. Same for "ANSI" text before it is saved, so internally it must use UTF-16, which makes sense on Windows.
Col is not based on grapheme clusters since 𨭎 is counted as two (and two chars).
Finally, VS gets around the problem of how to interpret column information in compiler (console) output by simply ignoring it (or MS compilers not generating that info). The correct behavior (which is maybe implemented for other languages than C/C++) would be to use the encoding and place the caret based on the char (=code point) position.
Conclusion: Col = count of average characters, Ch = code unit count (WideChar for UTF-16 or Byte for UTF-8).

Col and Ch in the status bar of VS follow the logical order (not visual order like Notepad++).
Otherwise it seems to behave identically to Notepad.

Non-standard behavior:
SciTE / Notepad++ simply ignore bidirectional caret positioning and selection, or rather use the visual order only, never changing direction (not of the arrow keys, but of Backspace and Del key, which is the most confusing part of Windows Notepad). But the column it shows in the status bar is also in visual order and not logical (file order), so I am not sure how wise that idea is. It counts in units of grapheme clusters: an "a" followed by a diaeresis is drawn as one character and counted as one (not as the two codepoints it is made of). Chinese characters are also counted as one. There is no justification/alignment to average character width, neither for Arabic (as opposed to VS), nor for Chinese.
Bug:More analysis shows a bug. For the text "HelloساوِيWorld" selecting with Shift+Right Arrow (pressing Right Arrow 7 times) then copying and pasting it, will generate this string: "Helloسا", which is expected (for logical selection order). Yet it will highlight it in visual order: "Helloوِي" or as a picture:
UniscribeNotepadPlusPlusSelectionBug.png
UniscribeNotepadPlusPlusSelectionBug.png (792 Bytes) Viewed 19502 times
In other words, what is shows as selection and what is actually copied is not the same!

Firefox and others use visual order for caret placement but do consider text direction during selection (and do inverse the direction of Backspace and Del in RTL text but rules are slightly confusing, but keep the arrow keys in visual order). Firefox seems to also hide the caret during selection, probably to avoid confusing positioning of the caret. Selection in Firefox also works in logical order (as opposed to caret positioning with arrow keys which is in visual order), but hides the caret (probably to avoid confusing by how the caret would behave differently).

Yet the logical order selection is slightly different from Notepad. Firefox selects all the RTL text except for its last character when selecting with Shift+Right Arrow (pressing Right Arrow 6 times), then progessively reducing the selection of the RTL text as Right arrow is pressed further. The issue is that the Arabic text cannot be fully selected this way, pressing Right arrow several times will select the W of World, too, as soon as the whole Arabic text is selected. Only going back with left arrow can deselect the W and keep the Arabic text selected.
It seems not very practical.

Further analysis shows that mouse selection in Firefox and Notepad is identical, and that keyboard selection in Firefox is essentially the same as mouse selection, with the only difference being, that when selecting English text followed by Arabic text, the mouse selection would select the whole Arabic text upon touching it from the left, then gradually deselect it while progressing to the right. Keyboard selection in Firefox will do the same, but instead of selecting the entire Arabic text, it will only select the entire Arabic text minus the last character, then equally reduce the selection.
It remains problematic that the entire Arabic text cannot be selected without also selecting the first English letter that follows after it (always assuming the same example text used above: "HelloساوِيWorld").

The other issue is that Firefox text selection is not "idempotent". As seen in the example above, the entire Arabic text cannot be selected without also selecting the "W" of World. But when reducing the selection with Left Arrow it works. Windows keyboard selection is more consistent, and going backwards again in keyboard selecting makes no such difference. Every press of Right Arrow / Left Arrow selects a cluster / or user perceivable character. Firefox sometimes selects two sometimes one (again refer to the example above), this is not consistent.
Firefox could fix this while keeping their selection order, by selecting the entire Arabic text with pressing Shift+Right Arrow directly after "Hello" is already fully selected. But since the caret is hidden it will not be obvious where in the selection you are. Maybe that was the reason for this non-idempotent behavior.

My goal:
Currently, my goal is to follow Notepad's conventions, since this is the Windows standard. I may deviate from this as needed, later, but as an option. A correct / standard behavior should exist first. I will not justify the text (and not expand tabs), to avoid having blanks (that may be interpreted as actual space characters) or Kashidas that also may confuse about how many actual characters there are. Additionally, Col/Ch will be in bytes, and not based on average character width or code units, nor on grapheme clusters. This is the best choice for a hex editor. I might add conversion function to compute higher level units, such as code units (=16 bit for UTF-16) or code points, where necessary. Visual columns (Col), as in VS do not make much sense, the highest level unit should be a cluster as defined by ScriptXtoCP and ScriptCPtoX.
In a code editor / VS, columns may make sense, for alignment purposes / "crude" formatting. But this is not a goal in a hex editor, here complete clarity of what characters are actually in memory/file/data stream is key.

Visual order seems also problematic, because of the ambiguity of caret positioning in bidi text (see post below). Maybe there is a good solution, but I haven't seen it implemented in practice (Firefox selection is confusing especially when compared with visual only direction of caret navigation using arrow keys), and I'd have to think about it in more detail. It is especially problematic when you consider text selection that needs to reflect the selection of the string in memory, i.e., in logical order. So selection has to always map to logical order, or the selection in memory will have gaps (see also Notepad++ bug above). That's probably why caret positioning in Firefox works in visual order as long as you just move the caret (without selecting) but it reverts back to logical order when you select (and hides the caret).

Besides Notepad's convention, the only other reasonable option seems to be to render RTL text from left to right, and then also selecting from left to right. This would be mostly useful for the hex editor's text column, especially when it is not obvious when an RTL run ends, and therefore rendering cannot be done in RTL order.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

On to the implementation:
Assuming you want to make an edit control using Uniscribe and this control has a property like SelStart (or CaretPos), where the first WideChar (= UTF-16 code unit) is at position 0, you need to compute the caret's X pixel position at each change of SelStart.

Pseudocode would look like this:

Code: Select all

function CharPosToX(CharPos: Integer): Integer;
begin
  Result := 0;
  for Run in Line.RunsInVisualOrder do
  begin
    if (CharPos >= Run.Start) or (CharPos < Run.Start + Run.Length) or LastLoopIteration then
    begin
      inc(Result, Run.ScriptCPtoX(CharPos - Run.Start, ...));
      Exit;
    end
    else
      inc(Result, Run.WidthInPixels);
  end;
end;
That is the API ScriptCPtoX will only be called if CharPos is exactly within the bounds of a run or when it is at the end of the last run (LastLoopIteration).
This is perfectly fine if every run is of the same font and the same direction.

However if two adjacent runs, A and B meet at CharPos and they differ in e.g. direction, such as English followed by Arabic, the computation of the caret's correct pixel position is not as simple.
Take for example the text: HelloساوِيWorld

UniscribeCaretPositioning2.png
UniscribeCaretPositioning2.png (10.89 KiB) Viewed 19449 times
Note that when the caret is at position x, it is always in front of the character at position x. So if caret is at position 0, it would be in front of the H in "Hello". At least that's how it works for LTR text, such as English.

When there is bidi text, the meaning of "in front of" is not as clear anymore. Instead, we talk about leading and trailing edge, as defined in the previous post. For bidi text, and when following Notepad's/Windows' logic, when the caret is at position x, it is between the character at logical position x-1 and x. How that relates to horizontal pixel coordinates is deduced below.

In the example above SelStart=CharPos=5. The character that visually follows after the caret is ي, but its position in the string is 9 not 5.

Side note: Visual order is always assumed to be one direction only (usually always left to right), as seen on screen and used during text rendering. The logical order (=string order) and visual order differ for RTL text. (The visual order is obtained by processing the logically ordered string using a bidi algorithm, such that text runs can be printed one after another in strictly increasing x coordinates. See the ScriptLayout function.)

Position 5 in the string is actually س, which appears at the right-most end of the Arabic run when rendered on screen. So you would actually expect that the caret is at the leading edge of the character at position 5, which would be at the right end of س (and not its left end, since this is an RTL run, which is effectively a coordinate system where increasing coordinates go from the right towards the left).

But if the caret is at position 5 it is also next to the end of o from Hello.

Summarized: logical position 5 is at the boundary between LTR text and RTL text. Visually, a caret at CharPos=5 should be at the right end (=logically trailing edge) of "o" (CharPos=4) and the right end (=logically leading edge) of س (CharPos=5), which results in two different pixel positions. This ambiguity needs to be resolved.
Notepad prefers to assume a caret at CharPos=5 is at the right end of "o", which is the logic we follow.

The next ambiguity is at CharPos=10: here the caret should either be at the left end (=trailing edge) of the ي character (which is at CharPos=9), or the left end (=leading edge) of the character W at CharPos=10. Again, tests show that Notepad prefers to position the caret relative to the character coming first (in logical order), which is ي (CharPos=9).

In any case of ambiguity (where two runs of different reading directions meet), the caret position is defined by the trailing edge of the character with the lower character position (out of the two characters logically framing the caret).

Logical vs. visual "framing" characters: what matters are the two characters that logically frame the caret. Referring again to the picture above, for caret position 5, the logical framing characters would be at position 4 and 5, even though visually, it looks like it should be characters 4 and 9.

Since the caret is technically at the higher character position (but the text insertion point is between CaretPos-1 and CaretPos), the lower character position is CaretPos-1 (=SelStart-1) and the higher one CaretPos (=SelStart).

Therefore the correct call is: ScriptCPtoX(SelStart-1, fTrailing=True, ...), i.e., the trailing end of the character coming before SelStart (in logical order).

Therefore the proper logic is to scan the runs in visual order, adding their width up until CharPos falls within a run.

We need to handle the corner cases where SelStart = 0 or SelStart = Length(EntireLine).

TODO: how does that fit into the entire loop, and detail the corner cases logic/reasoning.

Wine's code for multi-line EDIT controls, seems to confirm this.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

I currently have a fallback font mechanism implemented, but it is still lacking in some cases. Compare with Notepad++ and Chrome/Firefox.
However note these bugs:
https://github.com/notepad-plus-plus/no ... issues/870
https://github.com/notepad-plus-plus/no ... ssues/3747
https://github.com/notepad-plus-plus/no ... ssues/5465
https://community.notepad-plus-plus.org ... in-notepad
https://community.notepad-plus-plus.org ... ed-unicode
https://community.notepad-plus-plus.org ... iri-script
https://community.notepad-plus-plus.org ... -as-in-npp
https://stackoverflow.com/questions/157 ... -corrupted

Test of the bug:
https://bryanchain.com/2016/02/17/unico ... n-windows/

Indeed the ⚑ will not show in N++, but instead a replacement glyph that resembles the replacement character (square box around a question mark). Usually the replacement glyph is an empty box, but this depends on the font (Consolas has this weird glyph that looks too close to the replacement character).

⚑ appears correctly in Windows EDIT controls. However TextOut will render it as missing glyph (empty box). My fallback-font technique, SelectFallbackFont, seems to work in this case.
But this line (from https://community.notepad-plus-plus.org/topic/17223/why-does-appears-as-in-npp)" ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ₺ ₻ ₼ ₽ ₾ B" fails with no fallback font found. However, Notepad fails here, too with Consolas (though it apparently tried another font, too, since replacement glyph is empty box => not Consolas)! Wordpad misses the same glyphs with Consolas, and uses a box with a question mark (=> Consolas) as replacement glyph.
But when trying out several fonts in Wordpad, it seems to keep the fonts for the characters for which it misses glyphs in other fonts. So selecting all text and changing it all to Consolas will not actually set Consolas everywhere, for that to happen the entire text needs to be pasted from Notepad again.
This odd behavior makes it appear as if Wordpad had better font fall back mechanism.

Firefox renders it properly (under the exact same Windows setup).

WordPad (and AkelPad but only because of Courier New, so it doesn't count) does it slightly better than Notepad: only one missing glyph for ₻. My Uniscribe code (with Consolas), misses all the glyphs Notepad misses. Additionally, when my SelectFallbackFont() does not find any fallback font for a glyph, it does not recover and cancels all rendering. TODO: fix this.

Unfortunately, AkelPad is no better, as opposed to what the StackOverflow link claims.
Another comment in one of the links mentions RJ TextEd, but it also has the same issues, and also has some issues with text selection using the mouse.

Probably you also need to allow several fallback fonts, not just one per run. But how to itemize a run even further?
Apparently, this really has to be done: https://docs.microsoft.com/de-de/window ... lback-font
Call ScriptShape with each font in a list until it can be determined that no font will succeed. If the error code indicates that some of the characters are mapping to missing glyphs, break up the string into smaller ranges. Different fonts can be assigned to subranges so that more of the characters can be rendered.
But also consider this caveat:https://stackoverflow.com/questions/168 ... t-language
There was also a hack to try to minimize the number of font switches, on the theory that it was better to use one backup font that covers the entire text than to switch fonts several times. As I recall, this solved some problems where strings of CJK characters would pick some glyphs from a Chinese font and others from a Japanese font. To native readers, it was better to have the entire string in one typeface when possible, even if the preferred font was in the other style.
Regarding this "same font as long as possible" issue: https://devblogs.microsoft.com/oldnewth ... 0/?p=38413

TODO: LibreOffice might be an even better reference for font fallback, because they even support SIL Graphite: graphite.sil.org
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Display "non-printable" codepoints and code chart display in general
Since the text-column of a hex editor is in a way a code chart display, it should follow the conventions for them. This will be one display mode of the text column, others will provide more support for complex script shaping, such as combining characters and ligatures, or bidi layout. But the code chart display is definitely one that is needed, especially for handling control characters and other "unprintable" ones that are new in Unicode. What characters in Unicode fall into this category, and how to render them should be derivable from the collected material below.

For conventions on how to display non-printable characters, or, as Unicode calls them, characters with invisible glyphs, see here for various approaches: 24.1 Character Names List, Images in the Code Charts and Character Lists

See also IsGraphicCharacter and IsVisibleNonBlankCharacter in my XmUnicodeMisc.pas.
It may also make sense to render blanks with replacement glyphs, like is done commonly for control characters in Catch22 demo or Notepad++, or using one of the conventions in the link above.

Fallback fonts are: Looking up characters and their glyphs:
  • https://codepoints.net very flexible search and a lot of detailed information/Unicode properties.
  • https://graphemica.com is useful to lookup a lot of raw Unicode properties. Would benefit from additional human readable property names. Has a useful feature that shows which fonts contain a glyph for that character, and a useful "related characters" section.
  • https://decodeunicode.org/ helps to see a glyph for every character in Unicode, no matter if the browser has a font that supports it or not. Search is awkward though, since it requires to enter the character directly or use U+XXXXXX notation, no support for short names of characters, like CGJ. Also horrible all caps everywhere, overloaded design (this is not a magazine...), huge fonts (next to small ones) and lacking in good layout.
codepoints.net has a useful category for invisible glyph characters: "Default Ignorable Code Point (DI)".
For example the https://codepoints.net/U+034F COMBINING GRAPHEME JOINER.

The Unicode standard definition of Default_Ignorable_Code_Point:
For programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQ Display of Unsupported Characters, and Section 5.21, Ignoring Characters in Processing in [Unicode].

Generated from:
Other_Default_Ignorable_Code_Point
+ Cf (Format characters)
+ Variation_Selector
- White_Space
- FFF9..FFFB (Interlinear annotation format characters)
- 13430..13438 (Egyptian hieroglyph format characters)
- Prepended_Concatenation_Mark (Exceptional format characters that should be visible)
Many of the properties linked in the quote above can be found in PropList.txt. Or here the 8.0.0 version, since that's what we base our code on: https://www.unicode.org/Public/8.0.0/ucd/PropList.txt

Default_Ignorable_Code_Point is available in DerivedCoreProperties.txt or in version 8.0.0.

Partial quote from "5.21Ignoring Characters in Processing, Default Ignorable Code Point":
The Default_Ignorable_Code_Point property is also given to certain ranges of code points: U+2060..U+206F, U+FFF0..U+FFF8, and U+E0000..U+E0FFF, including any unassigned code points in those ranges. These ranges are designed and reserved for future encoding of format characters and similar special-use characters, to allow a certain degree of forward compatibility. Implementations which encounter unassigned code points in these ranges should ignore them for display in fallback rendering.

Surrogate code points, private-use characters, and control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters or code points, when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph, so that there is a visible indication of their presence in the text. For more information, see Unicode Technical Report #36, “Unicode Security Considerations.”

A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property. This may surprise implementers, who often assume that all format characters are generally ignored in fallback display.
TODO: This shows we need to differentiate between Unicode's idea of what characters should not be displayed/are invisible, and the displaying of non-printable chars (which should include control-chars, private use chars, surrogate code points -- and possibly all formatting characters, also those excluded from Default_Ignorable_Code_Point -- see quote's source for details on the caveats) with replacement glyphs while disabling any formatting properties they have, and finally a best effort for forward compatibility/rendering as good as possible for common text display as in an EDIT control (though this last option is probably best left to deal with by Uniscribe's functions -- then again may become relevant if font fallback fails).

Also relevant: 5.3 Unknown and Missing Characters

Useful in general: Unicode implementation guidelines
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Random note:
Agreed, even though it's a huge amount of work to make your own edit control, it's better than trying to hack the standard Windows controls (and trying to catch their hard to predict and possibly changing behavior):
https://stackoverflow.com/questions/195 ... it-control

Changes in new OS versions will break such an extended standard control, which is worse than having a self-made control that lacks new behavior. At least it ensures the implemented behavior continues working reliably, possibly with less bells and whistles.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

Control chars and similar unprintable chars will be shown similar to SciTE (a dark box with the character's acronym), and missing glyphs from fonts, will be show similar to Firefox (a box with the hex numbers).
This will make it obvious if it is a font problem or a character that has no visual representation, or if it has then it is strictly formatting (such as line breaks, tabs and similar -- diacritics do not count and should be displayed normally).

Example of a development version rendering of a string with embedded #0 character (NUL):
UniscribeEmbeddedNUL.png
UniscribeEmbeddedNUL.png (5.53 KiB) Viewed 19310 times
This will result in a single-line text/edit control which will be useful in many places in HxD.

It will allow to search for strings with embedded NULL or other special control characters, which is impossible (for NULL) or difficult (for control characters, many of which remain invisible in standard Windows edit controls).

The Datainspector will also be able to show WideChar, AnsiChar or UTF-8 Codepoints which are NULL characters, or control characters.

This will solve the issue of having the hex editor control capable of handling such characters (eventhough just by representing them as a dot), and other GUI controls not being able to handle them, since the Windows standard controls, on which they are based, are not able, either.

Copy and pasting will work as expected as well, when using HxD's internal clipboard.

Finally, a two column mode text edit control could be useful, at least for single characters, one side showing the normal display, the other side the codepoints. Again, this will be especially useful for the Datainspector.
Post Reply