https://docs.microsoft.com/en-us/window ... #abc-width
This link explains ABC width, and would allow to deal with the overhanging H problem Ἧ where the leading decoration is clipped, when normal TextOut is used. Test if ABC width for a text line would include this overhang, such that we can respect it when printing text, and avoid clipping. Same problem with ClearType, that has some additional pixels outside of the allotted rect, as Catch22 noticed.
We should always provide enough spacing before and after the text control in HxD to allow for such text to print fully, without clipping. That also means we should first draw the background and padding between the controls, then draw the text with transparent background and no clipping.
What to do with Zalgo text, though, that might otherwise spill into the hexpair area? We would still need clipping, but only to about half of the space between the hex pair control and the text control.
: Above glossary has many other useful definitions of the terms run
, font fallback
, complex script
, and script
("A script is a system of written language, for example, Latin script, Arabic script, Chinese script" -- so, "Schriftsystem" is the right German translation, not "Handschrift" nor "Schrift").
There is also a section About Complex Scripts
that goes into more detail of contextual shaping and combining characters.
More detail on complex script processing
Even more details on the process of text rendering is available here:
https://docs.microsoft.com/de-de/typogr ... sing-part1
The above link also contains some interesting representations for mapping hexquadruples/codepoints to glyphs/characters in a somewhat similar way to what hex editors do:
(This picture resembles what I wanted to do for my structure view.)
Finally, perfect explanation of complex text layout (Wikipedia)
, using Arabic and Devanagari as examples. Here the Arabic example (the word العربية al-arabiyyah, "the Arabic [language]" in Arabic), showing three possible renderings: LTR rendering (no complex script layout), RTL (no complex script layout), RTL (with contextual shaping).
Side note: Arabic alphabet
There are 4 different variants for letters, depending on their position in a word (initial, medial, final, or isolated):
(search for "Arabic letters change their shape according to their position in a word" -- no <a> anchor in the html of the page...)
See also table of letter forms (Wikipedia)
From the above link:
Each letter has a position-independent encoding in Unicode, and the rendering software can infer the correct glyph form (initial, medial, final or isolated) from its joining context. That is the current recommendation. However, for compatibility with previous standards, the initial, medial, final and isolated forms can also be encoded separately.
Other relevant information on contextual shaping and complex script rendering in general, with lots of useful pictures:
http://scripts.sil.org/cms/scripts/page ... ndExamples
A special case of deletion involves the creation of ligatures, where two or more items are replaced by a single glyph representing both or all of them. In the case of a ligature, there are visual components of the final glyph that correspond to each of the underlying characters.
From: https://scripts.sil.org/cms/scripts/pag ... tutorial14
Given the research above, Arabic seems reasonably similar to cursive writing of the Latin script (eventhough contextual shaping is much more drastic), such that an Arabic reader should be able to read a word if it is composed of isolated letters as opposed to using the more connected shape obtained from using the right shape (depending on in-word position of the letter). Probably, well trained Arabic readers will more readily recognize contextually shaped words (which can look quite different from words written as a sequence of isolated letters -- see embedded image above). But it should remain readable, if being slightly awkward.
Not a scientific proof, but somewhat suggestive that reading is indeed pattern matching of well known shapes (so uncommon rendering will be awkward, which isolated glyphs will be -- but again, a lot of text in HxD will not be natural language and therefore not common patterns forming words):
Microsoft researchers spent more than two years sifting through a large amount of research related to both typography and the psychology of reading. They concluded that reading is a form of pattern recognition. People become immersed in reading only when word recognition is a subconscious task and the conscious mind is free to read the text for meaning.
From https://docs.microsoft.com/de-de/typogr ... -cleartype
(Only slightly related, more of an interesting note:
https://docs.microsoft.com/de-de/typogr ... my-display
No two screens are exactly the same and everyone perceives color in a slightly different way.
What was discovered is that word recognition is only subconscious when typographical elements such as the shape and weight of letterforms, and the spacing between letters work together to present words as easily recognized patterns.
The isolated character mode should definitely be available for Arabic letter sequences that do not result in meaningful words, but may be random letter sequences (due to random data viewed in a hex editor), or letter sequences which have non-linguistic meaning. In this case, each character should be unambiguously identifiable, and not be "obfuscated" by contextual shaping, as then recognizing entire words will not work (since those character sequences do not form words).
Since various letter forms are available in legacy encodings (and therefore Unicode? -- see quote above), we might have contextually shaped letters, that should still be printed in isolation, with a gap (and no ligature) with adjacent letters. Check that the text renderer respects this isolated rendering!
So if it turns out cursive script is not possible technically in a hex editor, this should still be acceptable, if not ideal. RTL however would be good, if possible. The issue due to having to scan far needs to be resolved, still (see the two posts above this one).
The question remains how much this is true for Devangari, since it also reorders letters "randomly" and not just from LTR to RTL.
RTL and LTR marks (that alter the text display order) should probably be ignored, since it would require again to scan a large part of the document, and may cause inconsistencies if it it out of range (of our search), and when scrolling becomes in range, and therefore would cause random direction changes of text. For the same reason it is probably better to ignore RTL, but reading would be very awkward this way...
Yet random text, that uses random RTL character sequences, and has no linguistic meaning, would not benefit from RTL order, quite the opposite, since it makes it harder to correlate characters to bytes, then.
So at the very least, provide a forced LTR display order, and possibly, an additional RTL order.
Or, maybe it is best to provide a separate text display, which has full support for text rendering like a text editor, including line breaks (yet would still need to limit scanning the file, otherwise it may become unresponsive -- or it would need to introduce artificial line breaks after too many characters -- but this would again mess up rendering due to potentially missing text display order marks...).