Page 1 of 1

Control character display using font encoding (was: Several necessary features)

Posted: 30 Mar 2017 07:53
by FalseMaster
  1. Copying to clipboard current offset at <Alt+Ins> keystroke (like in OllyDbg). (Done)
  2. Replacing in selection. Switching to this mode must be automatic (detect presence of marked area). (New Topic)
  3. Block selection if <Alt> key is pressed (like in AkelPad). (Agreed it was not crucial)
  4. Show symbols with codes 0 – 31 when ANSI charset. (Discussed here)
Edit(Maël): I added notes for each feature request, to have a quick overview.

Re: Several necessary features

Posted: 30 Mar 2017 17:11
by Maël
FalseMaster wrote: 30 Mar 2017 07:53 Copying to clipboard current offset at <Alt+Ins> keystroke (like in OllyDbg).
Added, will be in next update.
FalseMaster wrote: 30 Mar 2017 07:53 Replacing in selection. Switching to this mode must be automatic (detect presence of marked area).
Maybe I am missing something, but that should be here already: The selection is replaced when you paste in overwrite mode and insert mode like in a text editor. If you want to overwrite without resizing there is ctrl+b (see edit menu).
Finally, there is an insert mode and overwrite mode, which can be switched using the Ins key, as in most editors.
If you press a key in overwrite mode, the selection is not replaced (on purpose). If you press a key in insert mode, the selection is replaced.
Since usually resizing a binary file arbitrarily will break its file format, overwrite mode is the default.
FalseMaster wrote: 30 Mar 2017 07:53 Block selection if <Alt> key is pressed (like in AkelPad).
Do you have a use case for block selection in a hex editor?
FalseMaster wrote: 30 Mar 2017 07:53 Show symbols with codes 0 – 31 when ANSI charset.
I guess the DOS/OEM character set is what you want. The Ansi code pages of Windows do not assign any visible characters in the range of 0-31.
https://en.wikipedia.org/wiki/Code_page_437

Re: Several necessary features

Posted: 02 Apr 2017 03:49
by FalseMaster
Maël wrote:Maybe I am missing something…
Yep. I mean "find and replace" operation.
Maël wrote:Do you have a use case for block selection in a hex editor?
You're right. This way is rarely needed, but will require big changes in code.
Maël wrote:The Ansi code pages of Windows do not assign any visible characters in the range of 0-31.
Visibility of characters in range of 0-31 is font-dependent, not code page. E.g. in "Sourier New" font 32 begining symbols (except 0-coded) are realized.
49d00936d5ef.png
49d00936d5ef.png (11.31 KiB) Viewed 7984 times

Re: Several necessary features

Posted: 02 Apr 2017 08:57
by Maël
Thanks for the picture to clarify.

This is what I get with Lister under German Win 8.1 (same settings as yours, except the font, which it didn't let me change):
lister.png
lister.png (10.18 KiB) Viewed 7978 times
The major goal is to have a reproducible and predictable system, and character sets and code pages always have been a very muddy topic, so let me elaborate.

So far I only found one reference on the web, that shows the special symbols (such as smilies and dingbats) of DOS/OEM codepages for Windows CP1252 is this: http://beacon.chebucto.ca/Back/A9908/img/vga_1252.gif
All other references do not show any glyphs for the control characters.
If you draw the control characters, too, even if they are declared by Windows as non-printable, you get this:
Windows-1251.png
Windows-1251.png (55.82 KiB) Viewed 7987 times
I checked an old Win95 virtual machine: notepad showed only box glyphs for the control characters for all fonts, and the command line always used a DOS/OEM codepage (and you cannot switch to cp1252 as you can in modern Windows).

From all this information I guess that Windows-1251 (or Windows-1252 on a Western Windows) is used, but the control characters are displayed as in a DOS/OEM codepage.
So actually it is a mix of two codepages.

All tests with Windows 8.1 cmd.exe do not show such a mixed output, neither with the DOS code page, nor the cp1252, nor did Win95, as mentioned above.

So it seems to be pretty non standard to combine the two codepages and draw as shown in the screenshot.

Regarding fonts having their own encoding, yes, that's true. For example "Terminal" is encoded as "OEM/DOS".
If you select that font, it will render all as in "OEM/DOS" encoding, instead of the one selected in HxD.

The disadvantage is, that when you try to copy a string and paste it in an edit field (such as for the search function), it will look differently from what you see displayed in the hex editor, because the Terminal font has the wrong encoding. So actually, I would have to enforce the proper encoding, to make sure all works consistently.

But this is not the case for Courier New. It is an OpenType font and is encoded as Unicode, and should render Windows-1252 as defined.

In conclusion, I think the system as it is is correct. Rendering only the control characters when the MB_USEGLYPHCHARS option of MultiByteToWideChar() returns printable characters, as defined by GetStringTypeW(), will produce reliable results, and create strings that can be copy and pasted properly.

Mixing code pages/charsets also does not seem to be a good idea, because it gives surprising results, and more importantly, MultiByteToWideChar does not support a reliable translation for these bytes.
MultiByteToWideChar just leaves them untouched (same byte values 0-31) and does not translate them to their corresponding Unicode codepoints (which exist). So you get a rendering that has nothing to do with the string data you provide, and therefore, it is reasonable to exclude these characters from rendering when using ANSI charsets. The result is basically random, as can be seen comparing our outputs (I also used the Russian code page).

What needs to be changed is dealing with fonts that do not support the requested encoding, possibly excluding them from the list of selectable fonts, or changing their encoding if possible (Windows will substitute missing glyphs).

Some more relevants links:

All font encodings map to a known codepage, see here: https://msdn.microsoft.com/en-us/library/cc194829.aspx
About differences between Unicode and ANSI versions of TextOut functions when fonts have special code pages:
http://stackoverflow.com/questions/2138 ... rset-displ
TextOutA does however print the same glyphs in our case. So it's probably dependent on what Windows version you use and possibly other locale specific things beyond code pages (of font or string).

For completeness, some screenshots with different code pages:

Under a German Windows and using Courier New (I guess that's the font you meant?) I get the following results, using the Cyrillic codepages I know, which are: Windows-1251, IBM855, cp866
If you draw all characters, including control characters (and additionally enable the MB_USEGLYPHCHARS option), this is what you get:
Windows-1251.png
Windows-1251.png (55.82 KiB) Viewed 7987 times
IBM855.png
IBM855.png (56.48 KiB) Viewed 7987 times
cp866.png
cp866.png (56.12 KiB) Viewed 7987 times

Re: Several necessary features

Posted: 02 Apr 2017 09:47
by Maël
FalseMaster wrote: 02 Apr 2017 03:49
Maël wrote:Maybe I am missing something…
Yep. I mean "find and replace" operation.
Ok. You mean that the selection should be automatically "pasted" in the search pattern text box?

Re: Several necessary features

Posted: 03 Apr 2017 02:45
by FalseMaster
Maël wrote:The disadvantage is, that when you try to copy a string and paste it in an edit field (such as for the search function), it will look differently from what you see displayed in the hex editor
Yes, but current technique with replacing non-printable symbols to dots is not at all working. Let at least visual perception be convenient. Moreover, this behavior can be made optional (checkbox in settings). In any case, you decide.
Maël wrote:What needs to be changed is dealing with fonts that do not support the requested encoding
Nothing.
637fed83f7de.png
637fed83f7de.png (5.36 KiB) Viewed 7842 times
Or make programmatically replacing to dot as usual (but now optional).
Maël wrote:…using Courier New (I guess that's the font you meant?)…
No, no, no. Sourier New.
In unicode mode also there are no problems:
066815b99812.png
066815b99812.png (7.24 KiB) Viewed 7839 times
Maël wrote:You mean that the selection should be automatically "pasted" in the search pattern text box?
Oh, no. Currently replacement occurs in whole file, but it is necessary that only in the selected area.
Example:
5ad8fd7aec3d.png
5ad8fd7aec3d.png (13.13 KiB) Viewed 7841 times
"In selection" item has bean checked automatically because marked text are present.

Re: Several necessary features

Posted: 03 Apr 2017 13:39
by Maël
Ok, I'll split the selection stuff into a new feature request, so I can keep track of it later.
This post will just be about the control characters.

Re: Control character display using font encoding (was: Several necessary features)

Posted: 03 Apr 2017 16:01
by Maël
I did further research, using several tools: ("Sourier New" maybe comes from the French word sourier meaning to smile?)

Analyzing "Sourier New" with FontForge showed it contained the smiley and other graphical glyphs in the range of 0x00-0x1F.
The glyphs are the same as would be displayed when converting allbytes.bin interpreted in DOS encodings (such as codepage 437), using MB_USEGLYPHCHARS option in MultiByteToWideChar(), to Unicode.

sourier new in font forge.png
sourier new in font forge.png (32.23 KiB) Viewed 7945 times

FontForge also showed that "Courier New" has no glyphs assigned for the region 0x00-0x1F, not even a blank character or an empty box.

courier new in font forge.png
courier new in font forge.png (27.9 KiB) Viewed 7945 times

When rendering the character 0x00-0x1F under Windows with "Courier New", font substitution takes places, i.e., either as font fallback or as font linking. So, this is not an encoding question, also not a font encoding question. It is just about how actually non-printable characters (0x00-0x1F) are assigned glyphs in fonts, which strictily speaking, should not be done.

Why then are non-printable characters still printed using "Courier New"? Because it turns out, with the help of Extended CharMap, that some CJK fonts, that were not installed on older Windows versions, have glyphs for 0x00-0x1F. Information obtained this way: Clicking the entry "Find font containing glyph" from a cell's context menu, will list all the fonts that have a glyph assigned to a certain character.

The fonts under my Windows version are:
  • Batang
  • BatangChe
  • DFKai-SB
  • Dotum
  • DotumChe
  • Gulim
  • GulimChe
  • Gungsuh
  • GungsuhChe
  • Microsoft JhengHei
  • Microsoft JhengHei UI
  • MingLiU
  • MingLiU_HKSCS
  • PMingLiU
In the case of "Sourier New" the glyphs for the control character region are also used for the Unicode "Symbols and dingbats" region. So when looking at character 0x01 is this font, it is assigned the "white smilie" glyph, which is also assigned to the character U+263A.
Other fonts that have glyphs for the control character region (such as Batang) do not have this reuse of glyphs. So it is not possible in general to determine the Unicode Codepoint by using this dual/multi assignment of one glyph to several Codepoints.

That leaves the issue that transfering text, that contains control characters, between different text controls (such as when copying and pasting) will show text rendered differently, despite all the care taken to use Unicode and clear encodings throughout.
So rendering of this "non-printable" region is dependent on the selected font, and not just as slight style variation, but really different glyphs.
This remains a problem.

Comparing Batang and Sourier New in FontForge, to see the actually assigned glyphs ...
batang in font forge.png
batang in font forge.png (27.82 KiB) Viewed 7945 times
sourier new in font forge.png
sourier new in font forge.png (32.23 KiB) Viewed 7945 times
... and now in Extended CharMap, to show what is rendered by Windows, shows how Windows does font substitution also in Sourier New:
batang ex charmap.png
batang ex charmap.png (19.89 KiB) Viewed 7945 times
sourier new ex charmap.png
sourier new ex charmap.png (30.47 KiB) Viewed 7945 times
Notice how, for example, the character 0x10 is not using the glyph as defined in Sourier New (a filled triangle pointing to the right, see font forge image), but substituted by a glyph (a kind of cross) from another font, maybe Batang (or another CJK one).
The 0x00 character is rendered as blank, which is the same glyph as in Batang for 0x00, but should be an empty rectangle to indicate a missing glyph, if Sourier New was used for 0x00. This is another example for font substitution.

The same substitution effect is visible in HxD:
sourier new font substitution.png
sourier new font substitution.png (67.93 KiB) Viewed 7945 times
Despite rendering all the bytes from 0 to 255 without any filtering or special handling, Windows will not render all the glyphs from Sourier New, eventhough all the used characters have a glyph assigned in this font.


I assume that the glyphs for the control characters in the CJK fonts were chosen to match the DOS encodings prevalent in Asian countries, because my localized DOS encoding and the 437 codepage (US DOS code page) do have different glyphs assigned for the control character region, than those CJK fonts.
Unfortunately, there seems to be no way to determine the actual Unicode Codepoints that correspond to the glyphs that get rendered.

It is also noticeable, that the special hack in Windows' ClearType font rendering, that makes Courier New thicker, does not apply to Sourier New.

In summary:
  • It seems not possible to determine the Unicode Codepoint that corresponds to a glyph in a font for characters in the region 0x00 to 0x1F, because that information is not stored in the font.
    • Only through multi use of glyphs (several Codepoints using the same glyph), where one of the Codepoints represents a printable character, that would be possible. Sourier New has double assignments of the relevant glyphs, but this does not hold true for the CJK fonts used for drawing the control characters in Windows, usually. Also this approach would be guess work at best, and unclear if there were multiple printable Codepoints assigned to a glyph. Well, we could still use the first printable Codepoints, if there were multiple ones.
    • Copy and pasting will give confusing results, because text boxes might use different fonts, and the Unicode advantage is lost.
  • Font substitution leads to even more confusing results, mixing glyphs at "random" (even when a font contains all glyphs...), and depending on the installed fonts.
  • Control characters are indeed, non-printable characters, and their "printing" gives unpredictable results.
  • For compatibility with DOS and programs using box and smilie glyphs etc. for drawing GUIs or ASCII art, some fonts probably render control characters such, that they match the glyphs in the DOS locale of the script/region the font was intended for, originally. This concept does not translate well to true Unicode, though.

Conclusion about drawing/printing control characters

Posted: 03 Apr 2017 17:01
by Maël
Now it's clear what is happening: effectively a visual mix of encodings, which happens through doubtful glyph assignments in fonts (mixed up even more by font substitution). Rendering/glyph picking is usually driven by Unicode, but apparently not always for Codepoints in the non-printable region.

I still think that this should really be solved by picking a proper encoding, or defining an own encoding if a mix is wanted (such as a mix of DOS and Windows encodings), but not by assigning glyphs in fonts where there should be none or assigning "wrong" ones. This was probably a backwards compatibility hack (in case of the CJK fonts) that leads to confusion and weird program behavior.

Fonts such as Terminal, do not use Unicode internally to map characters to glyphs, but use a specific encoding which they state explicitly. That's what is specified in charset property of a LOGFONT: https://msdn.microsoft.com/en-us/library/cc194829.aspx
Sourier New however is a Unicode font, same as Batang (which is why, strictly speaking, they don't abide to the specification).

I have seen that in the past in Windows, and it changed over time. Now it's clear why it happens sometimes and sometimes not, depending on installed fonts, thanks for the opportunity of clarifying that.

So basically, Unicode mapping of Codepoints to glyphs works as one would expect, but not for non-printable characters.
Fonts could also assign random glyphs for printable Codepoints, in theory, but so far I haven't seen that, fortunately.

Show font's interpretation option
As a result, I will offer an option that shows the non-printable control characters as the font choses, but include a warning, because they might be confused with glyphs that actually have a printable Unicode Codepoint (double/multi glyph assignment).

"Show font's (or its automatic font substitution) arbitrary interpretation of non-printable characters"
"Warning: Controls may display the same data with different characters, even when the data's encoding is the same in every control."

Mixed encodings (in future)
Additionally, I think an even better approach would be to offer a mixed encoding option (this fixes the issue where font substitution makes Sourier New not print all the wanted glyphs). This would be font or font substitution independent and still offer flexibility.
For example: DOS encoding for 0x00 to 0x1F and Windows encoding for the rest.
Just have to think of a short and clear name.

Re: Control character display using font encoding (was: Several necessary features)

Posted: 04 Apr 2017 04:03
by FalseMaster
Maël wrote:In the case of "Sourier New" the glyphs for the control character region are also used for the Unicode "Symbols and dingbats" region.
It is legal practice to minimize font-file size. "Entities must not be multiplied beyond necessity" © William of Ockham.
Maël wrote:…Windows will not render all the glyphs from Sourier New…
Very strange… I made little test program which convert string with MultiByteToWideChar and draw her with TextOut. All 32 control symbols has displayed properly.
836ce0578b74.png
836ce0578b74.png (3.86 KiB) Viewed 7835 times
Test_Font.7z
Maël: Added as attachment in case the link breaks in future. To make the program compile with a default DXE3 a few types and units have to be changed.
(2.85 KiB) Downloaded 267 times

Maël wrote:Copy and pasting will give confusing results, because text boxes might use different fonts, and the Unicode advantage is lost.
In my favorite text editor the finding operation properly works even if control symbols were copied from edit-window to standard Windows textbox, then copied from textbox and pasted back. Availability of glyphs never mind.
62824ace6ef1.png
62824ace6ef1.png (11.64 KiB) Viewed 7835 times
Maël wrote:I will offer an option that shows the non-printable control characters as the font choses, but include a warning…
Ok, main thing is that the user has a choice.

Re: Control character display using font encoding (was: Several necessary features)

Posted: 04 Apr 2017 11:06
by Maël
FalseMaster wrote: 04 Apr 2017 04:03
Maël wrote:In the case of "Sourier New" the glyphs for the control character region are also used for the Unicode "Symbols and dingbats" region.
It is legal practice to minimize font-file size. "Entities must not be multiplied beyond necessity" © William of Ockham.
FontForge flagged it, but that wasn't my point. I actually wanted to make use of that information in Sourier New, so it would have been useful if all fonts do that. Unfortunately Batang doesn't do that, for example.
Maël wrote:Copy and pasting will give confusing results, because text boxes might use different fonts, and the Unicode advantage is lost.
In my favorite text editor the finding operation properly works even if control symbols were copied from edit-window to standard Windows textbox, then copied from textbox and pasted back. Availability of glyphs never mind.
Searching will work, since the encoding did not change; this is also true for HxD. I meant the display as different characters, due to font choice, which is confusing when the data's encoding does not change.
Maël wrote:…Windows will not render all the glyphs from Sourier New…
Very strange… I made little test program which convert string with MultiByteToWideChar and draw her with TextOut. All 32 control symbols has displayed properly.
836ce0578b74.png
836ce0578b74.png (3.86 KiB) Viewed 7833 times
Courier New just uses empty rectangles, that means no substitution font is used on your system. In my case, Courier New does not draw empty rectangles, Batang or another CJK font is used, as mentioned in my lengthy analysis (I have a default German Win 8.1 install).

Re: Control character display using font encoding (was: Several necessary features)

Posted: 04 Apr 2017 11:36
by Maël
I made another test under a default Win 7 German install with your program.
SourierNotInstalled.png
SourierNotInstalled.png (7.7 KiB) Viewed 7868 times
SourierInstalled.png
SourierInstalled.png (7.35 KiB) Viewed 7868 times
P.S.: Can you please attach images and place them inline? This avoids to have dead links in future.

Re: Control character display using font encoding (was: Several necessary features)

Posted: 05 Apr 2017 07:50
by FalseMaster
I think that have found a cause of the trouble.
https://superuser.com/questions/396160/ ... t-fallback
https://msdn.microsoft.com/en-US/globalization/mt662331
https://msdn.microsoft.com/en-us/librar ... 85%29.aspx
Apparently Uniscribe Engine in new versions of Windows is bugged (incorrect glyph substitution and forcing to enable this feature without an explicit call ScriptShape function).
Maël wrote:Can you please attach images and place them inline?
Sorry, i did not behold "Attachments" tab.

Re: Control character display using font encoding (was: Several necessary features)

Posted: 14 Apr 2017 21:41
by Maël
If I review this, I made some notes about the internal implementation details to look out for in the source directory of HxD:
Control character replacement.txt