HxD: Custom Character Encoding for Text

Wishlists for new functionality and features.
Post Reply
user_anon

HxD: Custom Character Encoding for Text

Post by user_anon » 02 Jul 2008 20:22

First Maël, you've got a great little freeware hex editor with HxD. It is the best freeware hex editor I've used and I've looked at many. It's fast, good on resources, and has an excellent user interface. I've followed it for a few years now and it just keeps getting better!

I represent a community of people that use hex editors regularly to assist in modifying classic video games as well as translating classic games from Japanese to all languages. (See http://www.romhacking.net for further reference).

HxD is growing into a popular option for the people of my community, but there is one feature we'd love to see and that's a way to have custom character encoding for the text view (labeled 'Charset' in HxD). The idea would be to be able to view the text in non standard encoding formats other than the provided ASCII, EBCDIC etc..

An external custom table file (simple text) could be used for this with the hex values on the left and corresponding text characters on the right. Example:

00=0
01=1
02=2
.....
1F=A
20=B
21=C
......
65=.

This way, non standard text encoding could be easily viewable and editable in the hex editor just as ASCII can be now.

1F 20 21 21 21 in the hex view would show 'ABCCC' for the equivalent text.


Additional wish list for expansion on this feature:

1. UTF-8 or Unicode should be supported to be able to view text in other languages. Say perhaps Japanese characters in our particular case. I believe HxD already handles Unicode though.

2. Ability to handle multi-byte to multi-char encoding. Example:

AD34=A
AD35=B
........
1A=textstring1
1B=textstring2

1A 1B 1A 1B in hex would read out as 'textstring1textstring2textstring1textstring2' on the text side for instance.



I'd love to discuss this further if you're interested in adding such a feature to your editor! I could recommend several ways to do this. I am also a programmer myself and could assist in the logic if you need it.

Maël
Site Admin
Posts: 1066
Joined: 12 Mar 2005 14:15

Re: Custom Character Encoding for Text

Post by Maël » 05 Jul 2008 03:01

Thanks for the nice comment and sorry for the late reply :-)
user_anon wrote:There is one feature we'd love to see and that's a way to have custom character encoding for the text view (labeled 'Charset' in HxD). The idea would be to be able to view the text in non standard encoding formats other than the provided ASCII, EBCDIC etc..
User defined charset/character encoding is planned. It would come after support for multi-byte character encodings, such as UTF-8 or the common Asian encodings. They need special handling because not only the number of bytes per character but also the glyph representing a character varies in width depending on the character code. For latin charset all chars are made of 1 byte and have equally sized glyphs.
user_anon wrote:An external custom table file (simple text) could be used for this with the hex values on the left and corresponding text characters on the right.
I thought of having such a file in a binary format, but I am not sure yet. To generate a character table HxD would provide a window with a grid/table, where you would enter for each byte (from 0..255) the Unicode character it should represent.
Are there any common formats for character tables? Firefox maybe?
Nightcrawler wrote:1. UTF-8 or Unicode should be supported to be able to view text in other languages. Say perhaps Japanese characters in our particular case. I believe HxD already handles Unicode though.
HxD supports Unicode only partially, currently it cannot display multi-byte charset, see above. But when it does it will also support it in character tables.
Nightcrawler wrote:2. Ability to handle multi-byte to multi-char encoding.
AD34=A
AD35=B
Ok
Nightcrawler wrote: 1A=textstring1
1B=textstring2
That is going to be tough because it would completely break the assumption that all the lines in the hex editor window are about the same width, cause alignment problems, and basically not work when representing data in a grid-like fashion like hex editors do.
For multi-byte to 1 char conversions (first example) this is less problematic, because there will be only small variations in the width: usually an Asian glyph has at most two times the width of a Latin glyph.

user_anon

Re: Custom Character Encoding for Text

Post by user_anon » 07 Jul 2008 16:43

Maël wrote: User defined charset/character encoding is planned. It would come after support for multi-byte character encodings, such as UTF-8 or the common Asian encodings. They need special handling because not only the number of bytes per character but also the glyph representing a character varies in width depending on the character code. For latin charset all chars are made of 1 byte and have equally sized glyphs.
Right. I'm familiar with various Japanese encodings as well as UTF-8. The multi-byte and variable width ground work would certainly need to be done first.
Maël wrote: I thought of having such a file in a binary format, but I am not sure yet. To generate a character table HxD would provide a window with a grid/table, where you would enter for each byte (from 0..255) the Unicode character it should represent.
Are there any common formats for character tables? Firefox maybe?
We use tables in text files in our community because it offers the greatest flexibility. You are free to assign and mix multi-byte values with single byte values, and values that equate to more than one character. This covers all sorts of non standard encodings found in various sources. That's really the point for the custom encoding isn't it? You want to cover the greatest range of possibilities for non standard encoding schemes. Doing this through a grid/table defined in HxD is more restricted as opposed to the text file idea I presented earlier.
Maël wrote: That is going to be tough because it would completely break the assumption that all the lines in the hex editor window are about the same width, cause alignment problems, and basically not work when representing data in a grid-like fashion like hex editors do.
For multi-byte to 1 char conversions (first example) this is less problematic, because there will be only small variations in the width: usually an Asian glyph has at most two times the width of a Latin glyph.
It wouldn't change the hex view. That would remain exactly as is. It would only change the TEXT view. The text view wouldn't line up exactly the same on each line, you're right. Each line of text could potentially vary in length. I will ask a few people for some insight on the best way to accomplish something like this. I'll get back to you on this one.

Maël
Site Admin
Posts: 1066
Joined: 12 Mar 2005 14:15

Re: Custom Character Encoding for Text

Post by Maël » 11 Jul 2008 02:37

user_anon wrote:We use tables in text files in our community because it offers the greatest flexibility.
1.) I guess the number on the right-hand side would be a Unicode code point.
2.) A text file is okay, but if there is already a standard (I am not aware of any) it might be useful to follow it, so already available encoding files would be compatible.
user_anon wrote:It wouldn't change the hex view. That would remain exactly as is. It would only change the TEXT view. The text view wouldn't line up exactly the same on each line, you're right. Each line of text could potentially vary in length. I will ask a few people for some insight on the best way to accomplish something like this. I'll get back to you on this one.
I meant that one byte representing several characters would be problematic, because it is a fundamentally different concept from what is usual for hex editors. The assumption that one byte corresponds to at most one character is deeply rooted into the design. That's what I meant. Multi-byte character sets on the other hand require some work, but are less difficult to achieve.

Are there character sets that do represent with one byte several characters?

user_anon

Re: Custom Character Encoding for Text

Post by user_anon » 14 Jul 2008 19:16

Maël wrote:
user_anon wrote:We use tables in text files in our community because it offers the greatest flexibility.
1.) I guess the number on the right-hand side would be a Unicode code point.
2.) A text file is okay, but if there is already a standard (I am not aware of any) it might be useful to follow it, so already available encoding files would be compatible.
1. Right. Hex on one side, characters on the other. Unicode (assuming UTF-8) would be fine there.
2. Are there any standards? If there are, I'd be interested in taking a look. I'm not aware of any personally.
I meant that one byte representing several characters would be problematic, because it is a fundamentally different concept from what is usual for hex editors. The assumption that one byte corresponds to at most one character is deeply rooted into the design. That's what I meant. Multi-byte character sets on the other hand require some work, but are less difficult to achieve.
I have coded a few utilities dealing with these tables before. Basically, I use a dictionary or hash table equating hex to character/s. This way it doesn't matter whether it's 1 byte to 1 character, 2 bytes to 1 character, 1 byte to 3 characters etc.. It can handle any number of bytes to any number of characters. That's where the flexibility comes in.
Are there character sets that do represent with one byte several characters?
Probably not in the traditional sense of the definition. The cases I have seen where one byte could represent more than one character is crude data compression in already custom character encoding schemes. For example frequently used vowel combination such as 'io' might be represented by one byte to save space. Or in some cases frequently used words such as 'and' might be represented by a byte. So, this might fall somewhat outside of character encoding, but it IS still part of the character encoding used in these cases.

Maël
Site Admin
Posts: 1066
Joined: 12 Mar 2005 14:15

Re: Custom Character Encoding for Text

Post by Maël » 14 Jul 2008 19:48

user_anon wrote:I have coded a few utilities dealing with these tables before. Basically, I use a dictionary or hash table equating hex to character/s. This way it doesn't matter whether it's 1 byte to 1 character, 2 bytes to 1 character, 1 byte to 3 characters etc.. It can handle any number of bytes to any number of characters. That's where the flexibility comes in.
Finding a data-structure for the character encoding is not the problem. Without going into too much detail it complicates the drawing logic (for example determining how many character fit into the view, and could become a speed issue for large files), requires quite some changes with handling the caret etc.. Also clipboard handling comes to mind. I don't want to make it more complex than necessary, this will make it harder to find bugs and to maintain.
Probably not in the traditional sense of the definition. The cases I have seen where one byte could represent more than one character is crude data compression in already custom character encoding schemes. For example frequently used vowel combination such as 'io' might be represented by one byte to save space. Or in some cases frequently used words such as 'and' might be represented by a byte. So, this might fall somewhat outside of character encoding, but it IS still part of the character encoding used in these cases.
In this case I think first "uncompressing" the data might be the better approach. Things like this will be considered in future versions, with richer data-interpretation and on the fly forth and back conversions.

Maël
Site Admin
Posts: 1066
Joined: 12 Mar 2005 14:15

Re: HxD: Custom Character Encoding for Text

Post by Maël » 10 Feb 2019 08:19

Code: Select all

1A=textstring1
1B=textstring2
Encodings of this format (few bytes that expand to a lot of characters) can be supported if the character string is seen as an unmodifiable symbol. That is, like a normal character, it will be either deleted in its entirety or inserted in its entirety. The individual characters that make up the string-symbol are never considered when editing, only when searching.

To enter such a multi-character-string-symbol, something similar to an input method editor (IME) as common for Asian languages could be used as well. Or more of an autocomplete-like popup window.
Only once an item from the autocomplete window is selected it will be inserted in the text column, so only existing symbols (=single characters, or predefined multi-character-strings) in the Thingy table can be selected, and no invalid symbols or undefined ones (that come from Unicode, but are not part of the thingy table) can be inserted.

For single characters that are not also the start of a multi-character-string-symbol, no autocomplete window would popup up and they could be inserted directly.

To decode text the same approach as for MBCS encoding could be used. HxD will however require the encoding to be self-synchronizing like UTF-8 is, so that it's clear which byte is a lead and which is a trail byte, to allow for unambiguous decoding.

There might be solutions to this problem, but equally to Shift-JIS, there seems to be no satisfactory one, that doesn't just give random results. See discussion here: viewtopic.php?f=4&t=1004

j7n
Posts: 9
Joined: 28 Jan 2019 18:26

Re: HxD: Custom Character Encoding for Text

Post by j7n » 16 Feb 2019 03:22

Hiew has a function like this, but not as sophisticated as is planned/requested here. I used it to view Windows text in DOS back in the day. In the old version that I have here, it has a simple 1 to 1 byte mapping as closely as possible between "current encoding" (assumed Russian DOS, but today probably should be Unicode) and the target with three tables: target to current, current to target, and target to its uppercase equivalent. Reason they have 3 tables I understand is to permit not just viewing the text, but also typing into the text column and have it translated and stored into the foreign encoding, and searching in case-insensitive mode.

Being able to edit the text is quite often needed, and seems to be overlooked in some editors. Search isn't essential, but maybe good to have too.

I don't really need this function today, but it might be useful to somehow get other language encodings than "ANSI" and "OEM" selectable.

Maël
Site Admin
Posts: 1066
Joined: 12 Mar 2005 14:15

Re: HxD: Custom Character Encoding for Text

Post by Maël » 17 Feb 2019 14:12

Related post "Adding support for Shift-JIS in hex editor HxD": https://www.romhacking.net/forum/index. ... ic=27943.0

Post Reply