Page 1 of 1

unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 18:16
by xenofx
Hello everyone,

I need to edit my language files with utf8 unicode characters for text translate for an car player system. but some utf8 char is two bytes. when I enter these chars directly from keyboard then HXD put only one bytes. after upload these files to original system, then these characters not showing on original system screen. all these utf8 chars must be 2 bytes.
check screenshot.

on left ş char is entered from right pane with keyboard(string), you can see hex code is FE
on right (Ş) ş char is entered from left pane with keyboard(hex numbers), you can see hex code is c5 9e
Screenshot_2.jpg
Screenshot_2.jpg (94.5 KiB) Viewed 23453 times
FE hex code not showing on original system screen. but c5 9e showing original system screen on ş.

I want, when I press ş char on keyboard then put c59e on HXD, not put FE. is it possible ? Because, too many times waste when entered hex codes.

other chars :

ç : c3a7
I : c4b1
ö : c3b6
ş : c59e
ü : c3bc

thank you and sorry for my english. (I dont like google translate)

Re: unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 18:50
by Maël
HxD does not support UTF-8 in the text column, yet. See this thread: https://forum.mh-nexus.de/viewtopic.php?f=4&t=1004

What you can do is write your text in a text editor, save it as UTF-8 without BOM, then open it in HxD.
You can copy the relevant byte sequences from this file and paste them into your destination file.

It's not super comfortable, but reasonably efficient.

You can also use the UTF-8 Codepoint row in the data inspector.

Re: unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 19:04
by xenofx
thank you for your answer,

may will add a function about this. when I press "ş" char hen HXD put c59f, not put FE.

can add an option window in setting area:

Convert Char with this HEX codes:
char: ş
convert to this : c59f

Re: unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 19:09
by Maël
It's planned, as mentioned in the forum thread linked above.

Re: unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 19:43
by xenofx
I did find another program having this function. If its no problem I can inform in this thread.

Re: unicode two bytes chars, how can I enter ?

Posted: 17 Feb 2019 21:46
by Maël
What tool did you find? It might be interesting as inspiration for UTF-8 support (which I am developing right now), though most hex editor I saw have some bugs.

Re: unicode two bytes chars, how can I enter ?

Posted: 18 Feb 2019 16:57
by xenofx
wxmedit can support this function what I want.

check screenshot. "ç" char have two bytes.

wxmedit:
Screenshot_4.jpg
Screenshot_4.jpg (309.84 KiB) Viewed 23416 times
I need another option, pattern color, but all I want function have another programs :)

https://docs.hhdsoftware.com/hex/defini ... indow.html

edit : hex editor neo have utf8 and pattern coloring func.

Re: unicode two bytes chars, how can I enter ?

Posted: 18 Feb 2019 17:29
by xenofx
pattern coloring.it is useful, I can know when I stop change string.

it is hex editor neo, support utf8 too.
Screenshot_5.jpg
Screenshot_5.jpg (166.79 KiB) Viewed 23413 times

Re: unicode two bytes chars, how can I enter ?

Posted: 27 Feb 2019 10:16
by Maël
Thanks, for your feedback.

As can be seen in both neo and wxmedit, UTF-8 support is hacked into the grid based UI, adding spaces after characters that are encoded as two or more bytes, which makes it hard to distinguish between actual spaces and filler spaces. Especially when a character is made up of 4 bytes, but is only as wide as one average Latin character, the filler spaces are impractical. This allows for a much quicker implementation, but I prefer a good user experience, even if that means it will take (much) longer to implement than such hacks.

Neo also corrupts up the characters when you edit the text, treating them as individual bytes, instead of UTF-8 code units that belong together.
wxmedit is better here, but it is also buggy when you have a text like "hö" and change the h to an ö as well, it will mess up the second ö to something "random". wxmedit also does not handle combining characters properly: displaying them together sometimes, but not when there is a line wrap, and corrupting them when you overwrite (by handling them individually). It's inconsistent.

wxmedit claims to support encodings such as Shift-JIS, yet this encoding is not self-synchronizing, so decoding can change randomly, if jumping to a random part of the file (when not all the file was read from the start).

In summary, the claimed features work only partially.

I see they frame bytes to show which bytes belong to a character. This is useful and is done in HxD's data inspector as well.
Color highlighting is nice, but will conflict with the structure viewer that is planned.

So instead of these two approaches, maybe making the hex bytes italic or adding spacing will help identify which group of bytes form which character (since framing is already used by the data inspector).

I am also listing all these issues, so it becomes more obvious why a correct implementation is not trivial.

Though, other examples might be good as inspiration, if you find some.