HxD: Custom Character Encoding for Text

user_anon · Post by **user_anon** » 02 Jul 2008 20:22

First Maël, you've got a great little freeware hex editor with HxD. It is the best freeware hex editor I've used and I've looked at many. It's fast, good on resources, and has an excellent user interface. I've followed it for a few years now and it just keeps getting better!

I represent a community of people that use hex editors regularly to assist in modifying classic video games as well as translating classic games from Japanese to all languages. (See http://www.romhacking.net for further reference).

HxD is growing into a popular option for the people of my community, but there is one feature we'd love to see and that's a way to have custom character encoding for the text view (labeled 'Charset' in HxD). The idea would be to be able to view the text in non standard encoding formats other than the provided ASCII, EBCDIC etc..

An external custom table file (simple text) could be used for this with the hex values on the left and corresponding text characters on the right. Example:

00=0
01=1
02=2
.....
1F=A
20=B
21=C
......
65=.

This way, non standard text encoding could be easily viewable and editable in the hex editor just as ASCII can be now.

1F 20 21 21 21 in the hex view would show 'ABCCC' for the equivalent text.

Additional wish list for expansion on this feature:

1. UTF-8 or Unicode should be supported to be able to view text in other languages. Say perhaps Japanese characters in our particular case. I believe HxD already handles Unicode though.

2. Ability to handle multi-byte to multi-char encoding. Example:

AD34=A
AD35=B
........
1A=textstring1
1B=textstring2

1A 1B 1A 1B in hex would read out as 'textstring1textstring2textstring1textstring2' on the text side for instance.

I'd love to discuss this further if you're interested in adding such a feature to your editor! I could recommend several ways to do this. I am also a programmer myself and could assist in the logic if you need it.

Post by **Maël** » 05 Jul 2008 03:01

Thanks for the nice comment and sorry for the late reply

user_anon wrote:There is one feature we'd love to see and that's a way to have custom character encoding for the text view (labeled 'Charset' in HxD). The idea would be to be able to view the text in non standard encoding formats other than the provided ASCII, EBCDIC etc..

User defined charset/character encoding is planned. It would come after support for multi-byte character encodings, such as UTF-8 or the common Asian encodings. They need special handling because not only the number of bytes per character but also the glyph representing a character varies in width depending on the character code. For latin charset all chars are made of 1 byte and have equally sized glyphs.

user_anon wrote:An external custom table file (simple text) could be used for this with the hex values on the left and corresponding text characters on the right.

I thought of having such a file in a binary format, but I am not sure yet. To generate a character table HxD would provide a window with a grid/table, where you would enter for each byte (from 0..255) the Unicode character it should represent.
Are there any common formats for character tables? Firefox maybe?

Nightcrawler wrote:1. UTF-8 or Unicode should be supported to be able to view text in other languages. Say perhaps Japanese characters in our particular case. I believe HxD already handles Unicode though.

HxD supports Unicode only partially, currently it cannot display multi-byte charset, see above. But when it does it will also support it in character tables.

Nightcrawler wrote:2. Ability to handle multi-byte to multi-char encoding.
AD34=A
AD35=B

Ok

Nightcrawler wrote: 1A=textstring1
1B=textstring2

That is going to be tough because it would completely break the assumption that all the lines in the hex editor window are about the same width, cause alignment problems, and basically not work when representing data in a grid-like fashion like hex editors do.
For multi-byte to 1 char conversions (first example) this is less problematic, because there will be only small variations in the width: usually an Asian glyph has at most two times the width of a Latin glyph.

user_anon · Post by **user_anon** » 07 Jul 2008 16:43

Maël wrote: User defined charset/character encoding is planned. It would come after support for multi-byte character encodings, such as UTF-8 or the common Asian encodings. They need special handling because not only the number of bytes per character but also the glyph representing a character varies in width depending on the character code. For latin charset all chars are made of 1 byte and have equally sized glyphs.

Right. I'm familiar with various Japanese encodings as well as UTF-8. The multi-byte and variable width ground work would certainly need to be done first.

Maël wrote: I thought of having such a file in a binary format, but I am not sure yet. To generate a character table HxD would provide a window with a grid/table, where you would enter for each byte (from 0..255) the Unicode character it should represent.
Are there any common formats for character tables? Firefox maybe?

We use tables in text files in our community because it offers the greatest flexibility. You are free to assign and mix multi-byte values with single byte values, and values that equate to more than one character. This covers all sorts of non standard encodings found in various sources. That's really the point for the custom encoding isn't it? You want to cover the greatest range of possibilities for non standard encoding schemes. Doing this through a grid/table defined in HxD is more restricted as opposed to the text file idea I presented earlier.

Maël wrote: That is going to be tough because it would completely break the assumption that all the lines in the hex editor window are about the same width, cause alignment problems, and basically not work when representing data in a grid-like fashion like hex editors do.
For multi-byte to 1 char conversions (first example) this is less problematic, because there will be only small variations in the width: usually an Asian glyph has at most two times the width of a Latin glyph.

It wouldn't change the hex view. That would remain exactly as is. It would only change the TEXT view. The text view wouldn't line up exactly the same on each line, you're right. Each line of text could potentially vary in length. I will ask a few people for some insight on the best way to accomplish something like this. I'll get back to you on this one.

Post by **Maël** » 11 Jul 2008 02:37

user_anon wrote:We use tables in text files in our community because it offers the greatest flexibility.

1.) I guess the number on the right-hand side would be a Unicode code point.
2.) A text file is okay, but if there is already a standard (I am not aware of any) it might be useful to follow it, so already available encoding files would be compatible.

user_anon wrote:It wouldn't change the hex view. That would remain exactly as is. It would only change the TEXT view. The text view wouldn't line up exactly the same on each line, you're right. Each line of text could potentially vary in length. I will ask a few people for some insight on the best way to accomplish something like this. I'll get back to you on this one.

I meant that one byte representing several characters would be problematic, because it is a fundamentally different concept from what is usual for hex editors. The assumption that one byte corresponds to at most one character is deeply rooted into the design. That's what I meant. Multi-byte character sets on the other hand require some work, but are less difficult to achieve.

Are there character sets that do represent with one byte several characters?

user_anon · Post by **user_anon** » 14 Jul 2008 19:16

Maël wrote:
user_anon wrote:We use tables in text files in our community because it offers the greatest flexibility.
1.) I guess the number on the right-hand side would be a Unicode code point.
2.) A text file is okay, but if there is already a standard (I am not aware of any) it might be useful to follow it, so already available encoding files would be compatible.

1. Right. Hex on one side, characters on the other. Unicode (assuming UTF-8) would be fine there.
2. Are there any standards? If there are, I'd be interested in taking a look. I'm not aware of any personally.

I meant that one byte representing several characters would be problematic, because it is a fundamentally different concept from what is usual for hex editors. The assumption that one byte corresponds to at most one character is deeply rooted into the design. That's what I meant. Multi-byte character sets on the other hand require some work, but are less difficult to achieve.

I have coded a few utilities dealing with these tables before. Basically, I use a dictionary or hash table equating hex to character/s. This way it doesn't matter whether it's 1 byte to 1 character, 2 bytes to 1 character, 1 byte to 3 characters etc.. It can handle any number of bytes to any number of characters. That's where the flexibility comes in.

Are there character sets that do represent with one byte several characters?

Probably not in the traditional sense of the definition. The cases I have seen where one byte could represent more than one character is crude data compression in already custom character encoding schemes. For example frequently used vowel combination such as 'io' might be represented by one byte to save space. Or in some cases frequently used words such as 'and' might be represented by a byte. So, this might fall somewhat outside of character encoding, but it IS still part of the character encoding used in these cases.

Post by **Maël** » 14 Jul 2008 19:48

user_anon wrote:I have coded a few utilities dealing with these tables before. Basically, I use a dictionary or hash table equating hex to character/s. This way it doesn't matter whether it's 1 byte to 1 character, 2 bytes to 1 character, 1 byte to 3 characters etc.. It can handle any number of bytes to any number of characters. That's where the flexibility comes in.

Finding a data-structure for the character encoding is not the problem. Without going into too much detail it complicates the drawing logic (for example determining how many character fit into the view, and could become a speed issue for large files), requires quite some changes with handling the caret etc.. Also clipboard handling comes to mind. I don't want to make it more complex than necessary, this will make it harder to find bugs and to maintain.

Probably not in the traditional sense of the definition. The cases I have seen where one byte could represent more than one character is crude data compression in already custom character encoding schemes. For example frequently used vowel combination such as 'io' might be represented by one byte to save space. Or in some cases frequently used words such as 'and' might be represented by a byte. So, this might fall somewhat outside of character encoding, but it IS still part of the character encoding used in these cases.

In this case I think first "uncompressing" the data might be the better approach. Things like this will be considered in future versions, with richer data-interpretation and on the fly forth and back conversions.

Post by **Maël** » 10 Feb 2019 08:19

Code: Select all

1A=textstring1
1B=textstring2

Encodings of this format (few bytes that expand to a lot of characters) can be supported if the character string is seen as an unmodifiable symbol. That is, like a normal character, it will be either deleted in its entirety or inserted in its entirety. The individual characters that make up the string-symbol are never considered when editing, only when searching.

To enter such a multi-character-string-symbol, something similar to an input method editor (IME) as common for Asian languages could be used as well. Or more of an autocomplete-like popup window.
Only once an item from the autocomplete window is selected it will be inserted in the text column, so only existing symbols (=single characters, or predefined multi-character-strings) in the Thingy table can be selected, and no invalid symbols or undefined ones (that come from Unicode, but are not part of the thingy table) can be inserted.

For single characters that are not also the start of a multi-character-string-symbol, no autocomplete window would popup up and they could be inserted directly.

To decode text the same approach as for MBCS encoding could be used. HxD will however require the encoding to be self-synchronizing like UTF-8 is, so that it's clear which byte is a lead and which is a trail byte, to allow for unambiguous decoding.

There might be solutions to this problem, but equally to Shift-JIS, there seems to be no satisfactory one, that doesn't just give random results. See discussion here: viewtopic.php?f=4&t=1004

j7n · Post by **j7n** » 16 Feb 2019 03:22

Hiew has a function like this, but not as sophisticated as is planned/requested here. I used it to view Windows text in DOS back in the day. In the old version that I have here, it has a simple 1 to 1 byte mapping as closely as possible between "current encoding" (assumed Russian DOS, but today probably should be Unicode) and the target with three tables: target to current, current to target, and target to its uppercase equivalent. Reason they have 3 tables I understand is to permit not just viewing the text, but also typing into the text column and have it translated and stored into the foreign encoding, and searching in case-insensitive mode.

Being able to edit the text is quite often needed, and seems to be overlooked in some editors. Search isn't essential, but maybe good to have too.

I don't really need this function today, but it might be useful to somehow get other language encodings than "ANSI" and "OEM" selectable.

Post by **Maël** » 17 Feb 2019 14:12

Related post "Adding support for Shift-JIS in hex editor HxD": https://www.romhacking.net/forum/index. ... ic=27943.0

Vag · Post by **Vag** » 08 Dec 2019 00:56

Hello Maël,

Many years ago I used to rom hack/translate retro games to Greek as a hobby and these days, after about 10 years, I decided to translate a game, just for fun. As I didn't have to use a hex editor for years, I only discovered HxD very recently. It's a great program, much better than the one I was using in the past, congratulations for creating it! I would like to ask you to consider adding some features to HxD please, to make even better

I find it interesting that it supports different encodings and it would be brilliant if it could support custom encodings. This is essential in rom hacking and very few (and bad) hex editors support it. Old video games, e.g. like NES, Gameboy, Spectrum or arcade games, usually use a custom encoding. They may have only capital letters, or they may have different fonts (and encodings for them). Even if they have ASCII encoding, the translation will have a custom encoding for sure. One of the best hex editors for rom hacking is Thingy32 and that's because it supports up to two encoding tables at the same time (and you can switch them at the press of a button). Well, Thingy32 is great for rom hacking because of that, but it's bad at the same time, as it doesn't refresh the text and the values on the screen if you edit them! Unbelievable, isn't it? Anyway, a Thingy32 table is just a text file that you create yourself, that contains the encoding and it looks like this example:
24=
25=%
26=&
3C=<
3D==
3E=>
3F=;
40=©
41=Α
42=Β
43=I
44=D
45=Ε
46=F
etc.
The values can be anything though and that's the magic. If you're trying to hack a game and translate it into another language, you will draw the letters in your language over the original letters or over other spaces. But the order of the letters will be obscure, so if you want to be able to read or edit the text when you open the rom file in a hex editor, the hex editor has to support tables, so you can't use HxD.
If you could implement (Thingy32-like) table support in HxD, it would be brilliant and all rom hackers would use HxD exclusively, that's for sure. If you do, please keep in mind that supporting multiple tables at the same time (and not just one) would be even more useful; as you translate the rom, part of the text is translated, so you need to switch all the time. When there are two fonts in the game, you really need four tables.
If you decide to do it, let the user decide and change which character to display for the values that are not in the table. Thingy32 shows ####### which is OK, but sometimes you need to change that to spaces or dots, so you can find potential text easily. In order to do that with Thingy32, I have to create another table that includes all the unused values, with spaces, like this:
00=
01=
02=
...and so on. It works, but it's not practical. So if you support tables, please do that as well.

To generate a character table HxD would provide a window with a grid/table, where you would enter for each byte (from 0..255)

I think that if you want to use custom encodings, you should use text files for tables, exactly like Thingy32. The format is simple, you can even create them with Notepad and also you can use them with Thingy32 as well. Also, keep in mind that sometimes, during a translation, you have to rearrange the characters in the rom (and then change the table). Some times the space is just not enough for all the characters you would like to have available, so you have to do tricks like keeping 0 (zero) only and not O in the font and use the same character for both, or keeping characters that look the same in both languages, only once. This means the table only has a few characters, it's not like ASCII, for example. Sometimes about 30 characters only.

Keep up the good work!

Vag · Post by **Vag** » 09 Nov 2022 16:29

Hello Maël and everyone,
There's an extra reason I'd like to ask you to consider adding this feature, please. These days I decided to hack a game, but then I realized that the "traditional" romhacking hex editors don't support greek letters (and others I guess) in the table files in Windows 10. It doesn't matter if I have unicode, UTF-8 or ANSI files, all programs show garbage. They were made years ago and they work properly in Vista for example, but not Windows 10. There are some hex editors for rom hacking that support Chinese and Japanese characters, but not any characters you want. For now, I will have to use Vista for this, but please try to implement it. HxD is one of the greatest hex editors, it should be the first (that's not created especially for rom hacking) to support this!
Thank you for all the hard work and the great tool!

Sloane · Post by **Sloane** » 04 Jul 2023 10:47

Hi, HxD allows users to create custom character encodings for text, enabling greater flexibility and customization. With this feature, users can define specific character mappings to suit their unique requirements. Whether it's for specialized data formats, legacy systems, or personal preferences, HxD's custom character encoding capability empowers users to work with text in a way that aligns with their specific needs, making it a versatile tool for various applications.revanced

Sloane · Post by **Sloane** » 04 Jul 2023 11:14

Post by **Maël** » 04 Jul 2023 13:57

Sloane wrote: 04 Jul 2023 10:47 Hi, HxD allows users to create custom character encodings for text, enabling greater flexibility and customization. With this feature, users can define specific character mappings to suit their unique requirements. Whether it's for specialized data formats, legacy systems, or personal preferences, HxD's custom character encoding capability empowers users to work with text in a way that aligns with their specific needs, making it a versatile tool for various applications.revanced

I am not really sure what to make of this comment. It sounds like you would post this to some other place to promote HxD?

Vag.GreekRoms · Post by **Vag.GreekRoms** » 16 Aug 2024 08:44

Hello Maël and everyone,
If you decide to add the feature of my dreams, please also try to add another one, that would then make sense: To be able to see two different text/character displays at the same time. Imagine like if you wanted to see ASCII and EBCDIC at the same time. Of course that wouldn't help with anything (with ASCII and EBCDIC), unless you wanted to convert between the two. But two custom encodings, or a custom one and ASCII, would be perfect for translating text and I would use it permanently.

mh-nexus.de

HxD: Custom Character Encoding for Text

HxD: Custom Character Encoding for Text

Re: Custom Character Encoding for Text

Re: Custom Character Encoding for Text

Re: Custom Character Encoding for Text

Re: Custom Character Encoding for Text

Re: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text

Re: HxD: Custom Character Encoding for Text