Correctly positioning the caret on screen (a lot of example code gets it wrong)
The WinAPI function ScriptCPtoX has a parameter fTrailing, which influences whether the trailing edge of the cluster pointed at by iCP is used (fTrailing = True) or the leading edge is used (fTrailing = False).
The leading edge of a cluster is the left most edge for LTR text, and the right most edge for RTL text. The trailing edge is the remaining of the two edges bounding a cluster horizontally.
ScriptCPtoX(iCP, True, ...) = ScriptCPtoX(iCP + 1, False, ...) for LTR and RTL text.
This is an exact equivalence, there is no difference, not even one pixel.
Incorrect behavior:
All of this matters to properly compute the caret position in bidi text, which is wrong in the Catch22 demo and partly wrong in Wine code for the Windows EDIT control (wrong for single-line mode, but correct for the multi-line mode).
Standard/correct behavior:
Notepad, Visual Studio 2017 (maybe earlier versions as well) and WordPad.
Arrow keys navigate in logical order, selection changes direction according to text run order. Backspace and Del's direction are reversed in RTL text.
VS does however stretch Arabic characters to fit into the width of the average/latin characters, using Kashidas and spacing, so I assume they use ScriptJustify for this purpose. However Chinese characters are not justified to multiples of average/latin characters.
VS has two related entries on the status bar: Column (Col) and Char(Ch).
Ch seems to be based on WideChar's: a followed by a diaeresis is rendered as one character, but counted as two. However it is counted as one only for Col.
𨭎 is counted as two Ch and two Col.
For Zalgo text, VS counts Cols visually, apparently. No matter how many combining characters decorate a character vertically, it counts just as one Col.
Col is definitely visual, based on average character width, since the Tab character is only counted as one Ch but as several Col (depending on how wide Tab was defined in the options).
A with diaeresis is counted as 3 characters when the file is encoded as UTF-8. With no specified encoding, the encoding must be assumed to be UTF-16 since then it is counted as two chars. Same for "ANSI" text before it is saved, so internally it must use UTF-16, which makes sense on Windows.
Col is not based on grapheme clusters since 𨭎 is counted as two (and two chars).
Finally, VS gets around the problem of how to interpret column information in compiler (console) output by simply ignoring it (or MS compilers not generating that info). The correct behavior (which is maybe implemented for other languages than C/C++) would be to use the encoding and place the caret based on the char (=code point) position.
Conclusion: Col = count of average characters, Ch = code unit count (WideChar for UTF-16 or Byte for UTF-8).
Col and Ch in the status bar of VS follow the logical order (not visual order like Notepad++).
Otherwise it seems to behave identically to Notepad.
Non-standard behavior:
SciTE / Notepad++ simply ignore bidirectional caret positioning and selection, or rather use the visual order only, never changing direction (not of the arrow keys, but of Backspace and Del key, which is the most confusing part of Windows Notepad). But the column it shows in the status bar is also in visual order and not logical (file order), so I am not sure how wise that idea is. It counts in units of grapheme clusters: an "a" followed by a diaeresis is drawn as one character and counted as one (not as the two codepoints it is made of). Chinese characters are also counted as one. There is no justification/alignment to average character width, neither for Arabic (as opposed to VS), nor for Chinese.
Bug:More analysis shows a bug. For the text "HelloساوِيWorld" selecting with Shift+Right Arrow (pressing Right Arrow 7 times) then copying and pasting it, will generate this string: "Helloسا", which is expected (for logical selection order). Yet it will highlight it in visual order: "Helloوِي" or as a picture:
- UniscribeNotepadPlusPlusSelectionBug.png (792 Bytes) Viewed 19508 times
In other words, what is shows as selection and what is actually copied is not the same!
Firefox and others use visual order for caret placement but do consider text direction during selection (and do inverse the direction of Backspace and Del in RTL text but rules are slightly confusing, but keep the arrow keys in visual order). Firefox seems to also hide the caret during selection, probably to avoid confusing positioning of the caret. Selection in Firefox also works in logical order (as opposed to caret positioning with arrow keys which is in visual order), but hides the caret (probably to avoid confusing by how the caret would behave differently).
Yet the logical order selection is slightly different from Notepad. Firefox selects all the RTL text except for its last character when selecting with Shift+Right Arrow (pressing Right Arrow 6 times), then progessively reducing the selection of the RTL text as Right arrow is pressed further. The issue is that the Arabic text cannot be fully selected this way, pressing Right arrow several times will select the W of World, too, as soon as the whole Arabic text is selected. Only going back with left arrow can deselect the W and keep the Arabic text selected.
It seems not very practical.
Further analysis shows that mouse selection in Firefox and Notepad is identical, and that keyboard selection in Firefox is essentially the same as mouse selection, with the only difference being, that when selecting English text followed by Arabic text, the mouse selection would select the whole Arabic text upon touching it from the left, then gradually deselect it while progressing to the right. Keyboard selection in Firefox will do the same, but instead of selecting the entire Arabic text, it will only select the entire Arabic text minus the last character, then equally reduce the selection.
It remains problematic that the entire Arabic text cannot be selected without also selecting the first English letter that follows after it (always assuming the same example text used above: "HelloساوِيWorld").
The other issue is that Firefox text selection is not "idempotent". As seen in the example above, the entire Arabic text cannot be selected without also selecting the "W" of World. But when reducing the selection with Left Arrow it works. Windows keyboard selection is more consistent, and going backwards again in keyboard selecting makes no such difference. Every press of Right Arrow / Left Arrow selects a cluster / or user perceivable character. Firefox sometimes selects two sometimes one (again refer to the example above), this is not consistent.
Firefox could fix this while keeping their selection order, by selecting the entire Arabic text with pressing Shift+Right Arrow directly after "Hello" is already fully selected. But since the caret is hidden it will not be obvious where in the selection you are. Maybe that was the reason for this non-idempotent behavior.
My goal:
Currently, my goal is to
follow Notepad's conventions, since this is the Windows standard. I may deviate from this as needed, later, but as an option. A correct / standard behavior should exist first. I will not justify the text (and not expand tabs), to avoid having blanks (that may be interpreted as actual space characters) or Kashidas that also may confuse about how many actual characters there are. Additionally, Col/Ch will be in bytes, and not based on average character width or code units, nor on grapheme clusters. This is the best choice for a hex editor. I might add conversion function to compute higher level units, such as code units (=16 bit for UTF-16) or code points, where necessary. Visual columns (Col), as in VS do not make much sense, the highest level unit should be a cluster as defined by ScriptXtoCP and ScriptCPtoX.
In a code editor / VS, columns may make sense, for alignment purposes / "crude" formatting. But this is not a goal in a hex editor, here complete clarity of what characters are actually in memory/file/data stream is key.
Visual order seems also problematic, because of the ambiguity of caret positioning in bidi text (see post below). Maybe there is a good solution, but I haven't seen it implemented in practice (Firefox selection is confusing especially when compared with visual only direction of caret navigation using arrow keys), and I'd have to think about it in more detail. It is especially problematic when you consider text selection that needs to reflect the selection of the string in memory, i.e., in logical order. So selection has to always map to logical order, or the selection in memory will have gaps (see also Notepad++ bug above). That's probably why caret positioning in Firefox works in visual order as long as you just move the caret (without selecting) but it reverts back to logical order when you select (and hides the caret).
Besides Notepad's convention,
the only other reasonable option seems to be to render RTL text from left to right, and then also selecting from left to right. This would be mostly useful for the hex editor's text column, especially when it is not obvious when an RTL run ends, and therefore rendering cannot be done in RTL order.