UTF-8

Wishlists for new functionality and features.
Post Reply
Erquint
Posts: 1
Joined: 07 Apr 2015 07:24

UTF-8

Post by Erquint » 07 Apr 2015 07:29

Seriously, why isn't there UTF-8 character decoding support?
Having to pair HxD with other editors to edit some files I'm working on right now which kinda defeats the purpose.

Also, on a side note, it's impossible to search for something like "关闭" other than by its hex value. Seems like an issue.

Maël
Site Admin
Posts: 1139
Joined: 12 Mar 2005 14:15

Re: UTF-8

Post by Maël » 08 Apr 2015 16:17

There are similar requests already. You can see there why it is not trivial to implement (and usually buggy in other implementations).
It's impossible to search for something like "关闭" other than by its hex value.
For UTF16-LE (the only Unicode encoding supported in HxD currently) this works fine here. Make sure you are in the text tab and checked "Unicode characters".

The hexadecimal representation is not unique, it depends on the encoding/character set. For CJK characters there are many possible encodings, among them BIG-5, UTF-8, UTF16-LE. Your characters would correspond to "73 51 ED 95" in UTF-16LE.

fesal
Posts: 1
Joined: 12 Oct 2019 17:35

Re: UTF-8

Post by fesal » 12 Oct 2019 17:40

The hexadecimal representation is not unique, it depends on the encoding/character set. For CJK characters there are many possible encodings, among them BIG-5, UTF-8, UTF16-LE. Your characters would correspond to "73 51 ED 95" in UTF-16LE.
This is a moot point as there is option to directly specify encoding you need to search in via option as in UTF-16 case. And in that specified encoding there is only single representation.

Not having UTF-8 encoding which is major in modern Internet is a fatal flaw of the application.

Maël
Site Admin
Posts: 1139
Joined: 12 Mar 2005 14:15

Re: UTF-8

Post by Maël » 12 Oct 2019 20:56

fesal wrote:
12 Oct 2019 17:40
Not having UTF-8 encoding which is major in modern Internet is a fatal flaw of the application.
There is no need to emphasize the importance of UTF-8 (nor will complaining speed up or change anything -- answering some of the raised questions might however be useful).

Looking more carefully through the forum you might have noticed this thread:

Text column support for UTF-8, UTF-16 and other multi-byte text encodings (variable width encodings)

It details many implementation details and questions that need to be solved (or are solved already) to implement a correct solution.

Maël
Site Admin
Posts: 1139
Joined: 12 Mar 2005 14:15

Re: UTF-8

Post by Maël » 12 Oct 2019 21:03

fesal wrote:
12 Oct 2019 17:40
This is a moot point as there is option to directly specify encoding you need to search in via option as in UTF-16 case. And in that specified encoding there is only single representation.
UTF-16 is not self-synchronizing (at byte-level), UTF-8 is.

Consider a trivial example of this issue for UTF-16LE (assuming a file made of 4 bytes, given in hexadecimal notation):

Code: Select all

10 20 20 30
  • If you start at offset 0, the two bytes (10 20) result in the valid Unicode code point U+2010, which is a hyphen: ‐
    • Then the next valid code unit is at offset 2 (20 30), which is U+3020: 〠
  • If you start at offset 1, the two bytes (20 20) result in the valid Unicode code point U+2020, which is a dagger: †
    • Then the next valid code would be at offset 3 (30), but one byte alone does not make a valid UTF-16 code unit. So we would drop this byte or represent it with a replacement character, indicating this encoding error.
Both options are entirely valid interpretations regarding the UTF-16LE encoding alone. Which one was intended depends on how your UTF-16LE strings are aligned in memory/file/disk, and this can keep changing throughout the stream.

So no, UTF-16 is not trivial in a hex editor (many implementations are simply buggy), because it is not self-synchronizing. But there are some possible solutions discussed in the link above.

Post Reply