The process of disassembly involves reading a binary instruction, searching in the instruction set and generating assembly language instruction. The input binary file contains almost all (well most of) the necessary information of the original source file. Unfortunately, the process of disassembly is non-trivial as the binary file is not designed to undergo disassembly. Assemblers throw away a lot of information present in the original source which is irrelevant to the execution of the program. The greatest problem in disassembling is to identify and distinguish code (instructions) and data, as both are represented as sequence of bytes. Furthermore designing a generic disassembler involves extra effort because information about instruction set of a processor is coded in the processor specification file. Instruction set of the processor must be extracted in a format so that an instruction read from the binary file can be identified easily. In addition, information about number of instructions in the instruction set, length of an instruction, parameters in an instruction etc. varies from processor to processor. Various different processors evaluate the target address for jump instructions using bits available in the instruction in different ways which affects the design of a disassembler.
Lastly, the complexity of the symbolic disassembler is high because it uses symbols to refer to the locations. While programming, users normally use symbols (names) to refer to variables and functions. The compilers usually retain the names of functions (and global variables sometimes) in the compiled binary files. However, symbols corresponding to local variables or locations are not retained. Thus disassembler has to generate new names if not available in the binary file.
In this chapter, we shall describe the algorithm used by the disassembler
for the disassembly. Basically the approach adopted is to point out what
information is available and how it contributes in the generation of the
final output.