Disassembly feature as a 3rd party extendable plug-in?

Wishlists for new functionality and features.
Maël
Site Admin
Posts: 1354
Joined: 12 Mar 2005 14:15

Re: Disassembly feature as a 3rd party extendable plug-in?

Post by Maël »

It would also make sense to specify how many leading zeros a hexadecimal number should have for Intel assembly. I have yet to figure out what the standards are, personally I find leading zeros distracting, even if they indicate the operand size.

Need to review NASM, MASM and Borland assembler (TASM) conventions. Edit: NASM/MASM use the shortest representation, NASM adds an operand size specifier to avoid ambiguity. TASM adds leading zeros when necessary to indicate operand size. See also post below. I chose NASM's style due to readability.

What about Motorola 6800 (and related) assembly?
Maël
Site Admin
Posts: 1354
Joined: 12 Mar 2005 14:15

Re: Disassembly feature as a 3rd party extendable plug-in?

Post by Maël »

NASM does print hexadecimal letters in lower case and does always chose the minimal length (which is 1 hex digit) for a hexadecimal number, no leading zeros, also not to ensure number length are multiples of 2 (i.e., hex pair sequences). As it chooses the C style prefix 0x, there is also no need for a leading zero if the hex number starts with a letter.

MASM acts almost the same at least under WinDbg. Hexadecimal numbers have the Intel (leading zero) syntax, with letters in uppercase, but it also does not create hex pairs / a number of length which is a multiple of 2. Minimal length is two hex digits, however values <= 9 are always shown in decimal, only immediate values >= 10 are shown in hexadecimal (with at least 2 hex digits).

Delphi's debugger / TASM seem to follow somewhat arbitrary rules. Hexadecimal numbers that are some kind of immediate values (i.e., encoded in the assembler instruction itself) will be represented in the length of that immediate value, not the shortest possible.
But other instructions such as or rcx,$01 where the immediate value $01 is really a 64 bit integer, will still be shown in its short form, probably because the encoded command only uses one byte to encode the immediate value: 4883C901

Further tests show:
48 81 c9 01 00 00 00 is or rcx,01h in NASM and rcx,$0000000000000001 in TASM/Delphi debugger.
48 83 c9 01 is or rcx,byte +01h in NASM and or rcx,$01 in TASM/Delphi debugger.
So TASM/Delphi represents the operand length using the number of digits, but NASM uses operand length prefixes, if its not clear from the register operand (or there is no register operand).

Pointers are always as long as the pointer size of the CPU (8 bytes/16 chars on 64 bit systems, 4 bytes / 8 chars on 32 bit systems).

Summary:
NASM's method seems the most clear and short, so fixed the current implementation to follow the NASM rules (which is close to MASM rules). The only addition is that I chose to have at least two hex digits, not one like MASM.
Maël
Site Admin
Posts: 1354
Joined: 12 Mar 2005 14:15

Re: Disassembly feature as a 3rd party extendable plug-in?

Post by Maël »

I was notified you viewed the topic. Did you see the questions? Maybe they got lost within the amount of text I wrote.
GregC
Posts: 30
Joined: 08 Oct 2020 04:27

Re: Disassembly feature as a 3rd party extendable plug-in?

Post by GregC »

I was notified you viewed the topic. Did you see the questions?
Hi Maël. Apologies, I’ve just had some personal distractions lately.
So roughly you could say processors from the MOS/WDC 65xx and Motorola 6800 family.
As a description for the plug-in itself, this is not really a description that is functionally all encompassing.
Yes, the 6502 / 65816 / 6809 could all trace their design ancestry back to the original Motorola 6800 design. However, the 6500, 6502, 6809 are all unique CPU architectures.

To put it another way, the disassembly plug-in is not limited to: “MC6800, MC6809, 6502 and related CPUs.” Therefore, this description of the plug-in does not encompass it’s generalised implementation that allows support for numerous CPU’s, simply with the addition of appropriate CPU definition text files (which can be contributed by anyone, without requiring any plug-in code changes).

As noted in my earlier post, I believe the disassembly plug-in is more accurately described as supporting: “Retro 8 or 16-bit CPU architectures, that utilise no more than 2 Operands in any instruction.”

Although, I’d also note that in some cases more than 2 Operands is actually supported. For example, Operands that reference a CPU register are defined as multiple individual instruction definitions (as with the 6809’s Indexed Addressing mode instruction definitions).
So the current 2 Operand restriction is in reality in reference to the Operand types that specify data, address, or offsets etc.
Regarding hexadecimal number formatting and naming of the styles
Yes, there are multiple hex formatting. The 4 you listed would appear to cover those that I’m familiar with.
Likewise you will see Assembly (and hex) presented in uppercase and lowercase.

I’m not aware of any case sensitivity in this area (unless a specific application's author has chosen to recognise only upper or lower case).

I think you will find that upper / lower case representation will typically align with the age of the CPU / code being published (or is simply the personal preference of the author).

Older systems, around the time of 6800 etc, were in some cases only interfaced with terminals that supported uppercase only. Or, certainly in the earlier days of bitmapped fonts and non-descender 5x7 (or 7x9) pixel based fonts, lowercase was not very legible, so uppercase Assembly code was the obvious choice!

So perhaps, as a subjective observation, you might find old-school enthusiasts prefer uppercase Assembly, and those that weren't around in the early days might prefer lowercase?

For the disassembly plug-in’s current implementation, the upper / lower case, and the hex representation, is simply defined by the strings in the CPU definition file.
ie. If lowercase is someone’s preference, they could simply lowercase convert the CPU definition file! Also, the hex representation style (of the CPU being defined) is represented in the definition string.
how many leading zeros
Depending on the CPU, the number of leading zeros represented in the disassembly should be aligned with the instruction variant.
For example, an instruction may have alternate Opcodes for different Operand lengths.
eg. With the CPU’s I’ve produced definition files for so far, there can be different Opcode’s (for the same instruction), where the Operand may be 5-bits, 8-bits, 16-bits, or even 24-bits.
In each case it is appropriate to Disassemble the Operand to the specific number of bytes that aligns with the specific Opcode's Operand length (in this example: 1, 2, or 3 bytes).

Noting also, that an Assembler would typically be coded to be smart enough to assemble the appropriate instruction Opcode using the smallest required Operand size Opcode.
eg. If I assembly coded a Branch instruction with a specified offset of $0001 (in my source). When assembled, if a 5-bit offset Opcode was available, then that Opcode should be assembled (even though a 2 byte $0001 offset was specified), as using the 16-bit offset Operand instruction variant would be wasteful when a shorter and faster 5-bit offset Opcode was available for the instruction!

I’m not sure if the above observations cover what you intended to raise, but hopefully it is of some assistance? :?:
Maël
Site Admin
Posts: 1354
Joined: 12 Mar 2005 14:15

Re: Disassembly feature as a 3rd party extendable plug-in?

Post by Maël »

Thanks, I think that cleared it all up.

(Regarding the naming of the CPUs apparently people have different schemes Wikipedia etc. But calling it 8-bit era CPUs seems too generic. Intel also had 8 bit computers, and the Z-80 seems to be derived from it. But that's not really that important, I guess.)

So it's really like for Intel x86 assembly, where you also have the same mnemonic yet operand size can vary. NASM implies immediate value operand size, using the other operand which is a memory or register operand (of defined size). When that is not enough, it will use a prefix string like "byte", "word", "dword", etc.

So in conclusion, I will add formatting options such that you can control leading zeros, casing, and prefix/postfix style of hexadecimal numbers, and casing of instructions/operands in assembly. Those options will be passed to the plugin, like the integer options are passed now, so that the plugin can react to this.

I'll also provide those functions over the plugin interface, so it's not necessary to implement them again in each plugin.
Post Reply