Thingy table support

gingsz · Post by **gingsz** » 23 Sep 2013 23:05

Hi !

If HxD can support thingy tables (.tbl file), it would be used for rom editing or game translating.
.tbl file example (left is the hexa value, right is the game letter)

74=v
75=w
76=x
77=y
78=z
7E=É
82=I
86=Ó
90=s
A0=ó

Post by **Maël** » 27 Sep 2013 15:16

You mean a manually defined character set table/encoding. So a table that maps each byte in the 0x00..0xFF range to a character?

justcool393 · Post by **justcool393** » 04 Oct 2013 01:24

Yeah, I'm pretty sure that's what gingsz means.

Remeer · Post by **Remeer** » 05 Aug 2014 13:41

I'm aware the post is a little old, however, I was wanting to also suggest support for .tbl files, so I figured might as well post here.

I've been into rom hacking for years and I've often had to edit text in games, including a large project I'm taking on now, which is to add alot of new text.
HxD is by far my favorite 'standard' hex editor, however, the need to seeing custom character sets makes me have to open and use separate hex editors, or write the hex character of the text myself and paste it in, which is quite inconvenient compared to simply typing it in.
.tbl files are pretty common in rom translating communities, so quite a number of (Older/more 'homebrewish') hex editors support it and it's a great help to the communities to even have the ability to use custom character encodings.

However, if I may, I'd like to actually expand on the request, instead of simply bumping it;

Thingy(The hex editor OP mentioned) supports tbl files, but it also has the ability to load more than once 'display character' per byte.
For example;

Normally- 52=y
More - CB=ou
(Silly example, but the project I'm working on does this).

This displays like such;

: z30sn1J.png (38.69 KiB) Viewed 26993 times

(Look at that ugly UI! ;_;)
Sure, the output might look a little ugly since the character length isn't perfect along all lines, but it works wonders for those who need it.

Obviously I don't expect (If this even did get implemented) HxD to take inputted text and automatically shorten it to the single-byte equivalent.
For example, taking "ou" (0x484E) and automatically shortening it to 0xCB (Which also means 'ou' by the table config). Though, it would be nice, it would be quite a pain to add into HxD, I would imagine.
That said, I 'would' like HxD to be able to take .tbl's into consideration when searching. For example, when searching for "you" it should both find 0x52484E and 0x52CB. A little extra work on HxDs side in terms of processing, but I imagine it wouldn't be too difficult to actually implement.
I would also hope you could type into the 'text view' and it would automatically type in single characters defined in the tbl instead of the defaults.

Aside from obvious text-changing applications, this could also be used to simply show a easy view of specific things.
For example, again, in my current project, all shop data ends with F0FF. So if I made, say, F0FF=[/Shop] or something, there is a easy indicator of when that routine ends. So .tbl files can have many uses, not only for text editing.

I hope I explained stuff enough and I would 'love' to see this added in HxD. I've been using HxD for years now, it's amazing. Lightweight, does pretty much all the things you need it to and it doesn't hold onto the file when it's not editing(Which alot of modern Hex Editors seem to do), which is a big thing for me. The only reason I've ever had to flip to another Hex Editor is only for .tbl files.

I hope this will be added someday.

Post by **Maël** » 06 Aug 2014 10:13

Your post suggests some interesting features and shows you put some effort into writing it, which cleared up some previous ambiguities. That makes me more motivated to implement this. Thanks!

I'll add here some links I found for future reference:

TBL file format:
http://datacrystal.romhacking.net/wiki/Text_Table
http://www.everything2.com/index.pl?node_id=1430647
http://transcorp.romhacking.net/scratch ... Format.txt

Example TBL file:
http://datacrystal.romhacking.net/wiki/ ... _Zelda:TBL

ROM maps:
http://datacrystal.romhacking.net/wiki/ROM_map
Could be interesting to translate that to structure views or bookmark views in future.

Detailed information about an analyzed and modded game: (the steps outlined there assume you have a ROM obtained from a cartridge you own; I don't want copyrighted ROMs linked here and AFAIK this link doesn't contain any ROM downloads):
http://datacrystal.romhacking.net/wiki/ ... d_of_Zelda

The following are general thoughts, I am not sure if I will implement this, but I write it down as long as I remember:

In general, an interesting feature could be to have special sequences, maybe regex (or other grammar rules like EBNF) matched sequences replaced by symbols. This could be increasingly hierarchical symbols that you can fold and unfold, like a structure view, but inline, instead of a separate view. A bit like emacs has special custom made characters (rounded rectangles filled with black and white letters/acronyms) to represent unprintable characters.
Similarily, you could define your own "characters"/symbols for certain sequences, give them colors, and then have this hierachical grouping, where top level symbols summarize lower level symbols plus uninterpreted hex sequences.
This would allow to tag raw hex sequences then gradually add higher level structures and add rules to gradually define a file format. This would be done on the hex view side which becomes a structure view side (as opposed to a raw data stream), the text view side would then be the "interpreted data view side", which decodes into text, numbers or whatever else the data represents.

The difference to a tree is that you can redefine the tree inline and switch locally between raw unstructured view and structured view, and that you can have partially structured data mixed with unstructured raw hex sequences. How exactly this last part is done to be intuitive is future work.

This is interesting in general for file format or packet analyzing or logic analyzer like displays. The point is to make a solution that is useful in many scenarios. TBL-files would then be translated to these structured views.

Post by **Maël** » 17 Aug 2014 00:53

I am adding here information about rom hacking as I learn. The techniques used seem to be a nice specific and relatively simple case of what you do in general hex editing. Having this specific use case is good for developing general tools that help to understand unknown formats.

Lists what you need to get started in ROM hacking, especially text editing/translating:
http://www.romhacking.net/start/#text
http://www.zophar.net/fileuploads/2/107 ... stutor.txt

A summary of the steps involved is:

Use a program called Search Relative to find a known string with unknown encoding in a file.
- The string would be one you know appears in the GUI of the game/program you are analyzing.
- The file you search can be a program's/game's data file or executable. It can also be a ROM file (= a game console cartridge dump).
- This search works based on the assumption that characters are represented by numbers (probably bytes) in ascending order (i.e., 'a'=n+0, 'b'=n+1, 'c'=n+2, ..., 'z'=n+26, where n is some integer).
- Relative Searcher 2.5, an alternative tool, can be found (with source and better documentation): http://www.romhacking.net/utilities/39/
- SearchR 3.0, is another alternative that comes with a Windows GUI and source code: http://rainemu.swishparty.co.uk/searchr/
Potential matches are listed and the user has to pick the most plausible ones.
- For example if we search for a word and the matched word appears as part of a larger phrase that makes sense or is known to appear in the program/game we very likely found what we looked for.
The selected matches can be used to derive partial encoding rules. The suggested program here is Table Maker.
- A match will tell us the corresponding byte sequence for the searched string.
  For example: "CYNDAQUIL" corresponds to "03 19 0E 04 01 11 15 09 0C". So C=03, Y=19, etc..
- This byte sequence lets us automatically derive a mapping from characters to their encoding.
Finally, the created table (=encoding rules) can be used in a hex editor called Thingy, which has a text view column (as any hex editor does) that decodes strings based on this table.

Adaptation/improvements of this approach:

We could adapt this approach to HxD, by using regex search for all possible byte sequences of the given example string.
- Those would be found by applying all known encodings (e.g., Windows-1252, Shift-JIS, ...) or a subset of these encodings selected by the user. Additionally to that, all possible encodings where characters get assigned numbers in alphabetical order (see first step) should be applied.
- The regex would be generated automatically.
Other options would be to search for common known encodings, or slight variations of that. We could also use the order of characters (=code points) in Unicode to determine the likely order of the characters' number representations (=code units) in specific encodings.
Apparently using a tile editor it is possible to see the order of the character tiles and thereby find out if characters are encoded in alphabetical order or another order. We should support searching for characters giving this order we found, and not just limit ourselves to alphabetical order.

Generating all possible byte sequences where each character is stored in at most two bytes. If this still does not succeed, because some parts of the string are stored in a dictionary for compression reasons, it should be tried to omit parts of the string with .* to ignore those dictionary indexes/place holders and still have a match.

It should also be possible to wiggle around a little with the mappings, choosing a threshold with a slider of what ranges to search and how similar a string should be. Soundex is a possible algorithm, but other similarity algorithms and statistical methods could be interesting.

All of these things should be visible together. The temporarily built character/encoding table, the hex view / text view, the potential matches for the searched string, and additionally searches for the start and end offset or the length of found strings. The last item is interesting to find places that need to be adapted in the file, so that after editing the string the file is still valid. This would still not help in case of checksums.

But again this could be searched by generating possible and common checksums and searching for them, while omitting bytes at the beginning or end with some threshold (in case the checksum does not include the checksum itself or excluded some other header and trailer bytes). Again we get potential checksums that can be narrowed down to likely ones, by seeing if they can be found, and if several games of the same manfucaturer or game series are made, or save files that change over time and therefore also change the checksum, can be used to compare and narrow down the likely cases.

After that the user has to manually check if the modified files are accepted and run, and then can make the final decision and choose it, to make the file structure annotation final. In other words, before there can be contradicting annotations, that are excluded as more information is gained, or the user manually excluded them or selects the right one.

If no example string is known it would also make sense to search for byte sequences that look like text, because of character frequencies typical for language X (such as e is most frequent in English). It could also be searched for common words (dictionary), or phrase structures (S->NP VP). We could give training files of known decoded files as input and use AI/machine learning techniques to train pattern recognizers.

Similar partial rules or regularity rules could use statistical distribution or structural patterns to recognize picture/pixel data, program code, pointers/offsets, etc.

All this would result in annotation of possible content. This would allow to identify regions of the file, and automatically decode them into the most likely presentation. The user could then pick alternative interpretations, if he thinks they make more sense.

This could work a bit like OCR or handwriting recognition. Again with training possibilities, and options for users to add their own rules.

We could also add some rules that define plausibility, or let them train from user decisions when scanning text.

With all this data we could automatically analyze unknown files, and present them in the most likely interpretation, instead of a raw hex dump. The precise raw representation would always be visible too, either using tooltips or by switching views, or simply by having the left column be the hex view and the right one the interpreted view. The hex view would possibly need to be folded, because of all the space it would occupy.

Basically we would have a real table with sizable columns, left side hex, right side interpreted, so we can see how one side affects the other side.

General application, not limited to strings: We should be able to search a sequence of symbols, even if we don't know their encoding. All we need to know is the order of their encoding to allow searching for them. For example: character order: A, B, D, F, C, E, G, H, .... This would also allow a relative search, but with a slightly different order than alphabetical. An additional flexibility could be gained by saying that only this holds: A < B < D < F < C < E < G < H ..., but we do not know the increments between the characters, usually it would be 1, but it could be > 1, too!
We should be able to search for such number sequences, even if one symbol is encoded by more than one byte, and even if code point (= number representing symbol) order is not just alphabetical, and if the increments between symbols are >= 1.

We should also look for other possible and common ways to encode things. Maybe be need more than regex? CFG? Then again, if we add so much flexibility, it will become hard to specify. Should look at parsers again. This searching is more like pattern matching. We have to think of how much expressibility we allow and what we leave out. In general expressing things relative to others should be possible, besides absolute. And low level is pattern matching, higher levels work with "sharp" symbols, and then it starts again with strings of symbols that are the low level of higher levels and again need pattern matching to form the yet higher level of "sharp" symbols there.

See how this level by level pattern matching compares to how parsers work.

Post by **Maël** » 27 Nov 2014 14:58

Some other relevant links, regarding dictionary compression or DTE (Dual-Tile Encoding):

http://www.everything2.com/index.pl?node_id=1430647
http://en.wikibooks.org/wiki/Data_Compr ... ompression
http://en.wikipedia.org/wiki/Byte_pair_encoding

Useful links for the general process of ROM editing, including locating various data regions/creating map files:
http://www.zophar.net/fileuploads/2/10690nxpfq/AoRH.htm
http://www.romhacking.net/documents/111/
http://en.wikipedia.org/wiki/ROM_hacking

mh-nexus.de

Thingy table support

Thingy table support

Re: Thingy table support

Re: Thingy table support

Re: Thingy table support

Re: Thingy table support

Re: Thingy table support

Re: Thingy table support