Pasting Unicode text as "unmodified" hex bytes

Wishlists for new functionality and features.
Post Reply
mcb
Posts: 7
Joined: 11 Oct 2013 16:02

Pasting Unicode text as "unmodified" hex bytes

Post by mcb »

I can understand that rendering Unicode text in the Text column is complex, however the fact that pasting Unicode text shows as all "00"s in the Hex column is counterintuitive. It might help if at least the Hex data reflected the Unicode codes that were in the clipboard.

In my specific case, I was trying to use HxD to check some characters from another source, to see what their codes were. Perhaps there is a way I am missing?
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

I tested it with ☺ which is U+263A. When pasting U+263A into the text column, indeed it is replaced by a 0x00 byte for the Windows (ANSI) encoding, but not for DOS/IBM-ASCII.
What would you expect instead? An error message telling you this character cannot be represented in the currently selected encoding?
mcb
Posts: 7
Joined: 11 Oct 2013 16:02

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by mcb »

Thanks for the fast response. You have me confused now :)

Why does the smiley face even show correctly as "DOS/IBM-ASCII", and why does its code appear as 0x01 in the Hex view?

My layperson view is that "DOS/IBM-ASCII" should be the same as "Windows (ANSI)": if the character does not exist, it should not be displayed? I am not expecting an error message, because already now the Hex view can have characters that cannot be rendered in a given text encoding.

As for the hex view, it should appear as 0xE2 0x98 0xBA, or whatever codes were in the clipboard. Maybe it could be a special paste operation, more similar to Insert bytes? Perhaps it could autodetect the Unicode content, and prompt to do exactly that?

To recap, while I respect that the Text view cannot handle Unicode, I would love to see a way to "paste" Unicode in such a way that at least the clipboard bytes end up correctly in the Hex view. After all, this can be done by opening the same data as a file?
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by Maël »

mcb wrote: 09 Mar 2023 15:51 Why does the smiley face even show correctly as "DOS/IBM-ASCII", and why does its code appear as 0x01 in the Hex view?
Because under DOS / the codepage 437 the white smiley character has the code 0x01: https://de.wikipedia.org/wiki/Codepage_437
So U+263A can be represented there, but in Windows-1252 not.
I am not expecting an error message, because already now the Hex view can have characters that cannot be rendered in a given text encoding.
So the silent replacement of unencodable characters by 0x00 is fine?
As for the hex view, it should appear as 0xE2 0x98 0xBA, or whatever codes were in the clipboard. Maybe it could be a special paste operation, more similar to Insert bytes? Perhaps it could autodetect the Unicode content, and prompt to do exactly that?
There is something similar to what you want in the UTF-8 Codepoint entry in the datainspector.
To recap, while I respect that the Text view cannot handle Unicode, I would love to see a way to "paste" Unicode in such a way that at least the clipboard bytes end up correctly in the Hex view. After all, this can be done by opening the same data as a file?
This would be easy to add, but I think of limited utility since it would not be rendered correctly, so you could just paste the corresponding bytes directly. Otherwise you'll have to pick a Unicode encoding anyways, like UTF-8. UTF-16LE/BE or UTF-32 before pasting.
mcb
Posts: 7
Joined: 11 Oct 2013 16:02

Re: Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

Post by mcb »

Ah, I did not know about the smiley at 0x01 :)
So the silent replacement of unencodable characters by 0x00 is fine?
Not really. My layperson expectation is that the hex data is "sacred", i.e. the clipboard bytes should go into the Hex view as they are, without replacement (especially not silent ones, since you ask). In the Text area, I understand that depending on the current character set, some characters may show as garbage or with a replacement character.

I am just an occasional user. You obviously know the code and all the other consistency aspects that might come up.
Otherwise you'll have to pick a Unicode encoding anyways, like UTF-8. UTF-16LE/BE or UTF-32 before pasting.
I did not know you could do that, as my focus was on the encodings listed in View/Text encoding.

Now that I look at it again, I wonder if maybe a clipboard feature could be under Tools/Open clipboard..., or if all of these should be near Open file... (under File), as they are, after all, all "Open" operations? Concatenate, Split and Wipe could then be at the root of Tools, like Options.

Thanks again for the exchange, and apologies for my novice perspective, which however might perhaps add something "fresh" because of it. For now I solved my need by using View/Character Code Value in EmEditor, but I'd still love to be able to do certain clipboard operations that currently escape me, in HxD.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Pasting Unicode text as "unmodified" hex bytes

Post by Maël »

I split this since it's a different topic/feature request.
mcb wrote: 09 Mar 2023 16:17 My layperson expectation is that the hex data is "sacred", i.e. the clipboard bytes should go into the Hex view as they are
The clipboard does not contain a sequence of bytes, but various formats, such as CF_TEXT or CF_UNICODETEXT etc. so HxD has to treat it like that (what lands in the clipboard is already converted, not what the source application put into it). One consequence of this is that you can only determine the text length by a terminating #0 character, since the allocated memory in the clipboard can be larger than the text in the clipboard.

Also some formats are autogenerated from other formats, for example CF_TEXT automatically creates CF_UNICODETEXT (with the ANSI encoding CF_TEXT has which is stored in CF_LOCALE) and vice versa.

In other words, conversions automatically happen and there is no raw byte format, which is why you have to define your own "raw" clipboard format, specially for a raw sequence of bytes/hex editors.

To solve this issue HxD does copy something different to the global Windows clipboard depending on the column you are in: in the text column it copies the text as Unicode (using the encoding you selected in HxD to convert it to Unicode), in the hex column it copies it as text too, but this time it's a sequence of hex pairs, as seen in the hex view.
Some other hex editors support this style (bytes represented as hex pairs) as well, which is how data transfers between HxD and other editors is possible, for example the hex editor in Visual Studio.

Additionally HxD internally copies the data as a raw sequence of bytes so you can copy larger amounts of data from one file to another file, within HxD. The Windows clipboard again is limited here regarding the amount of data it can hold, and will get very slow way before hitting this limit.

So, there is no raw byte format that is universally accepted between applications (at most some hex editors invent their own), nor is it clear where the data ends in general (and which bytes are just padding/superfluous). How to determine the exact byte length is specific to every format.

But I made a raw clipboard viewer program once, that shows the raw data (including potential additional data at the end), if you are curious:
https://mh-nexus.de/downloads/RawClipView.exe
In the Text area, I understand that depending on the current character set, some characters may show as garbage or with a replacement character.
So in other words, the data you paste in HxD is always textual data in Unicode (since the Windows clipboard automatically converts it to Unicode in Windows NT+). The original raw bytes are not preserved, and no app besides hex editors have a concept of raw bytes regarding the clipboard. So the only sensible thing to do is to convert this Unicode text (which is UTF-16LE in Windows) to the encoding you selected in HxD.
For example, if you opened a text file in HxD, that is encoded in Windows-1252, it would make no sense to paste the clipboard Unicode text as the "raw" bytes of UTF-16LE into this text file. It would create a text file with a mixed encoding, partly Windows-1252, partly UTF-16LE, and be essentially invalid.

To copy and paste raw data, you have to open a file and copy and paste it within HxD, because HxD uses an internal clipboard that can handle raw data. There is no other way to transfer raw data, since no other apps (besides maybe hex editors) even have a concept of raw data in the clipboard, and therefore wouldn't know how to handle it, nor how to put "raw" data into the clipboard. So the "original" is lost anyways, as they data is converted to one of the standard formats/text encodings by Windows itself.
I hope that makes sense.
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Pasting Unicode text as "unmodified" hex bytes

Post by Maël »

I could imagine a feature inspired by RawClipView.exe could be added:
Menu "Edit|Paste raw" which has a list as submenu:
This list would contain the names of the clipboard formats that are currently available, and selecting one of them would paste this one exactly as it is in the clipboard (including padding/superfluous bytes for CF_UNICODETEXT).
Maël
Site Admin
Posts: 1455
Joined: 12 Mar 2005 14:15

Re: Pasting Unicode text as "unmodified" hex bytes

Post by Maël »

mcb wrote: 09 Mar 2023 15:22 In my specific case, I was trying to use HxD to check some characters from another source, to see what their codes were. Perhaps there is a way I am missing?
One option is the datainspector. You can paste Unicode text there (or type it directly) for single characters, see the "WideChar" column (for "Unicode", really UTF16-LE) and "UTF-8 codepoint" column for, well, UTF-8.

HxD-Paste-UTF16LE-Char-Directly.png
HxD-Paste-UTF16LE-Char-Directly.png (41.78 KiB) Viewed 27509 times
mcb
Posts: 7
Joined: 11 Oct 2013 16:02

Re: Pasting Unicode text as "unmodified" hex bytes

Post by mcb »

Thank you for the follow-ups. Very appreciated.

Bearing in mind that the clipboard has different formats, if the goal is to analyze "Unicode text", I would be happy with the definition/choice of "whatever Notepad would paste is what I mean". Your proposed "Paste raw" options sound perfect for this need, as well as for those who need to see formatted data as well.
Post Reply