I am adding here information about rom hacking as I learn. The techniques used seem to be a nice specific and relatively simple case of what you do in general hex editing. Having this specific use case is good for developing general tools that help to understand unknown formats.
Lists what you need to get started in ROM hacking, especially text editing/translating:
http://www.romhacking.net/start/#text
http://www.zophar.net/fileuploads/2/107 ... stutor.txt
A summary of the steps involved is:
- Use a program called Search Relative to find a known string with unknown encoding in a file.
- The string would be one you know appears in the GUI of the game/program you are analyzing.
- The file you search can be a program's/game's data file or executable. It can also be a ROM file (= a game console cartridge dump).
- This search works based on the assumption that characters are represented by numbers (probably bytes) in ascending order (i.e., 'a'=n+0, 'b'=n+1, 'c'=n+2, ..., 'z'=n+26, where n is some integer).
- Relative Searcher 2.5, an alternative tool, can be found (with source and better documentation): http://www.romhacking.net/utilities/39/
- SearchR 3.0, is another alternative that comes with a Windows GUI and source code: http://rainemu.swishparty.co.uk/searchr/
- Potential matches are listed and the user has to pick the most plausible ones.
- For example if we search for a word and the matched word appears as part of a larger phrase that makes sense or is known to appear in the program/game we very likely found what we looked for.
- The selected matches can be used to derive partial encoding rules. The suggested program here is Table Maker.
- A match will tell us the corresponding byte sequence for the searched string.
For example: "CYNDAQUIL" corresponds to "03 19 0E 04 01 11 15 09 0C". So C=03, Y=19, etc..
- This byte sequence lets us automatically derive a mapping from characters to their encoding.
- Finally, the created table (=encoding rules) can be used in a hex editor called Thingy, which has a text view column (as any hex editor does) that decodes strings based on this table.
Adaptation/improvements of this approach:
- We could adapt this approach to HxD, by using regex search for all possible byte sequences of the given example string.
- Those would be found by applying all known encodings (e.g., Windows-1252, Shift-JIS, ...) or a subset of these encodings selected by the user. Additionally to that, all possible encodings where characters get assigned numbers in alphabetical order (see first step) should be applied.
- The regex would be generated automatically.
- Other options would be to search for common known encodings, or slight variations of that. We could also use the order of characters (=code points) in Unicode to determine the likely order of the characters' number representations (=code units) in specific encodings.
- Apparently using a tile editor it is possible to see the order of the character tiles and thereby find out if characters are encoded in alphabetical order or another order. We should support searching for characters giving this order we found, and not just limit ourselves to alphabetical order.
Generating all possible byte sequences where each character is stored in at most two bytes. If this still does not succeed, because some parts of the string are stored in a dictionary for compression reasons, it should be tried to omit parts of the string with .* to ignore those dictionary indexes/place holders and still have a match.
It should also be possible to wiggle around a little with the mappings, choosing a threshold with a slider of what ranges to search and how similar a string should be. Soundex is a possible algorithm, but other similarity algorithms and statistical methods could be interesting.
All of these things should be visible together. The temporarily built character/encoding table, the hex view / text view, the potential matches for the searched string, and additionally searches for the start and end offset or the length of found strings. The last item is interesting to find places that need to be adapted in the file, so that after editing the string the file is still valid. This would still not help in case of checksums.
But again this could be searched by generating possible and common checksums and searching for them, while omitting bytes at the beginning or end with some threshold (in case the checksum does not include the checksum itself or excluded some other header and trailer bytes). Again we get potential checksums that can be narrowed down to likely ones, by seeing if they can be found, and if several games of the same manfucaturer or game series are made, or save files that change over time and therefore also change the checksum, can be used to compare and narrow down the likely cases.
After that the user has to manually check if the modified files are accepted and run, and then can make the final decision and choose it, to make the file structure annotation final. In other words, before there can be contradicting annotations, that are excluded as more information is gained, or the user manually excluded them or selects the right one.
If no example string is known it would also make sense to search for byte sequences that look like text, because of character frequencies typical for language X (such as e is most frequent in English). It could also be searched for common words (dictionary), or phrase structures (S->NP VP). We could give training files of known decoded files as input and use AI/machine learning techniques to train pattern recognizers.
Similar partial rules or regularity rules could use statistical distribution or structural patterns to recognize picture/pixel data, program code, pointers/offsets, etc.
All this would result in annotation of possible content. This would allow to identify regions of the file, and automatically decode them into the most likely presentation. The user could then pick alternative interpretations, if he thinks they make more sense.
This could work a bit like OCR or handwriting recognition. Again with training possibilities, and options for users to add their own rules.
We could also add some rules that define plausibility, or let them train from user decisions when scanning text.
With all this data we could automatically analyze unknown files, and present them in the most likely interpretation, instead of a raw hex dump. The precise raw representation would always be visible too, either using tooltips or by switching views, or simply by having the left column be the hex view and the right one the interpreted view. The hex view would possibly need to be folded, because of all the space it would occupy.
Basically we would have a real table with sizable columns, left side hex, right side interpreted, so we can see how one side affects the other side.
General application, not limited to strings: We should be able to search a sequence of symbols, even if we don't know their encoding. All we need to know is the order of their encoding to allow searching for them. For example: character order: A, B, D, F, C, E, G, H, .... This would also allow a relative search, but with a slightly different order than alphabetical. An additional flexibility could be gained by saying that only this holds: A < B < D < F < C < E < G < H ..., but we do not know the increments between the characters, usually it would be 1, but it could be > 1, too!
We should be able to search for such number sequences, even if one symbol is encoded by more than one byte, and even if code point (= number representing symbol) order is not just alphabetical, and if the increments between symbols are >= 1.
We should also look for other possible and common ways to encode things. Maybe be need more than regex? CFG? Then again, if we add so much flexibility, it will become hard to specify. Should look at parsers again. This searching is more like pattern matching. We have to think of how much expressibility we allow and what we leave out. In general expressing things relative to others should be possible, besides absolute. And low level is pattern matching, higher levels work with "sharp" symbols, and then it starts again with strings of symbols that are the low level of higher levels and again need pattern matching to form the yet higher level of "sharp" symbols there.
See how this level by level pattern matching compares to how parsers work.