"Abuse" of combining sequences lead to so called Zalgo text: https://stackoverflow.com/questions/657 ... -text-work
The following Unicode report deals with exactly this issue, and suggests a limit of 30 combining characters (or more correctly, 30 non-starter code points), which results in a grapheme cluster of 31 code points when also counting the starting character/code point:
http://unicode.org/reports/tr15/#Stream ... ext_Format
The following link has a brief recap of Unicode, with the most relevant points, and was the first source I found regarding the defined limit of combining characters:
https://mathias.gaunard.com/unicode/doc ... _sequences
Zalgo text or similar use of many combining characters might cause rendering issues in HxD, even when we limit the combining sequence to the number mentioned above, because a line's height might exceeded the normal/average line height. Somehow we have to limit this or clip it, or otherwise handle it. Simply visually clipping the combining characters would falsify the information, though. On the other hand a hex editor should provide a clear and controllable view, without ambiguous or unpredictable behavior.
Weird effects/bugs in script engines, might cause combining characters to appear before the first code point that was printed, e.g., it will decorate the hex column that comes before the text column. Therefore the text column needs to be clipped. We have to make sure clipping does not cause cutting off text in italic, bold, or similar issues that cause slightly larger text than expected. We have to test if TextWidth() and similar functions return the correct dimensions for styled text (or multi-script / multi-font text).
An example that causes buggy horizontal rendering (i.e., decoration before the x position of character "a"):
Code: Select all
Possible solutions for Zalgo text as implemented in gedit, SciTE, and the console/terminal: https://stackoverflow.com/questions/510 ... o-use-them
The idea is essentially to limit how much room combining marks can take up as glyphs. The limit is the font height, plus maybe some small padding, but usually fonts reserve space for diacritical marks. Since a hex editor should have a fixed line height, we can use the one defined by the font (for a certain font size) and possibly add some slight margin on top of that (margin should be relative to font size). Everything else will be cut off, or, depending on the text renderer (and not controllable by us), be drawn on top of each other, as shown in the pictures in the link above.
Possibly useful Uniscribe links:
https://stackoverflow.com/questions/540 ... y-the-font
Complex string processing examples (HxD will not support bidi text and have selection only go in one direction, as mentioned in earlier posts, but these examples might be useful to test grapheme cluster handling):
Displaying text in Uniscribe (all the necessary steps): https://docs.microsoft.com/de-de/window ... -uniscribe
https://scripts.sil.org/cms/scripts/pag ... beVersions
"Uniscribe: The Missing Documentation & Examples":
Uniscribe in Wine: https://wiki.winehq.org/Uniscribe
Lists of which scripts are complex (CJK is not, but Hangul is), and how font fallback is implemented for ScriptString*.
Might be worth a look to know how to implement font fallback when using the "manual" non-ScriptString* APIs from Uniscribe, that don't do any font substitution.
See also: https://docs.microsoft.com/de-de/window ... t-fallback
WebKit's solution: https://stackoverflow.com/questions/168 ... t-language
https://superuser.com/questions/396160/ ... t-fallback