Introduction


The goal of this project is to introduce people to reverse engineering by getting to know the basic functionality of Ghidra, decompilers, and disassemblers. Two exercises are also included for those who wish to apply the knowledge gleaned from this website by exploring the limits of decompilers and solving a simple CrackMe.

It is recommended that the reader have a basic knowledge in C and the memory hierarchy before reading this resource. These topics are covered in depth in Olin’s Software Systems class.

What is Reverse Engineering?

Reverse engineering is a technique used to analyze software in order to identify and understand the parts it is composed of [1]. This often means taking an executable file and attempting to recreate its source code [1].

People use tools such as decompilers or disassemblers to recreate an executable file’s source code. In this resource, we will cover how decompilers work, how disassemblers work, how to use Ghidra (a popular reverse engineering tool), and how to solve a simple CrackMe.

What is a Disassembler?

A disassembler is a program that converts machine language instructions into assembly language instructions [2]. It is a commonly used tool in reverse engineering because of its effectiveness. As you can infer from the name, a disassembler performs operations that are inverse to the operations performed by an assembler [2].

Disassemblers should not be confused with decompilers, which attempt to recreate the source code in a higher-level computer language such as C. Disassemblers are kind of like the 'middle-man' in this scenario, as the assembly instructions it generates can be used by the decompiler to remake the source code in C, C++, or another language.

In the next section, we'll walk through how disassemblers work.

How do Disassemblers work?

There are two options that one can use to implement a disassembler: a Linear Sweep algorithm or a Recursive Traversal algorithm [3].

Both of these options require some basic information, such as where the program code starts and what architecture is being used (e.g. x86 Assembly, MIPS, etc—these have different instructions!). We need to know these two things because we need to know where we can start disassembling the program, and we also need to know the instruction sets being used (which vary depending on the architecture) [3].

The Linear Sweep Algorithm

A Linear Sweep algorithm will start by reading the first N-bytes of the machine language (can be adjusted) until the disassembler gets a correct opcode [4]. Then it disassembles the next opcode until it encounters an invalid opcode. When the algorithm gets an invalid opcode, it can either skip it and resume or break gracefully and tell the user where it stopped. This makes the algorithm very simple, but also susceptible to any mistakes intentionally left in the code to derail the linear sweep algorithm from its path [4].

The most common linear sweep disassemblers are objdump, WinDbg and SoftICE.

The Recursive Traversal Algorithm

The recursive traversal algorithm is fairly similar to the linear sweep algorithm, except for a few key differences. The algorithm will start by reading the first N-bytes until the disassembler gets a correct opcode. However, instead of disassembling the next opcode until it encounters an invalid opcode, it will disassemble the next opcode until it encounters any sort of jump statement [4]. From there, it will store its current position and follow the jump statement. Then, once it hits an invalid opcode, it can resume at its previously stored position.

This technique is far less susceptible to mistakes than the linear sweep algorithm because the code isn’t disassembled in a linear way [4]. By following the jump instructions, we are able to avoid some of the most common ways a linear sweep disassembler can break.

What is a Decompiler?

A decompiler is a computer program that takes an executable file as input and attempts to create a high-level source file which can be recompiled [5]. It converts the binary code via reverse engineering into source code. This process is much more complicated than disassembling.

Most decompilers create the high-level source file in C because C is relatively close to the machine language instructions in comparison to even higher-level languages like Python or C++ (Python is written in C!). There are some decompilers that will create a source file in Python, C++, or other languages, but they’re few and far between. Fortunately, this is not an issue if you have prior experience with C.

How do decompilers work?

One might think that since we can easily compile code in C, it must not be that hard to just undo whatever the compiler did… right? Wrong!

Decompilers are a lot more complicated than we think they are. We can break down the process of decompiling code into four main stages: the expression decompiler stage, the local variable conversion stage, the control flow reconstruction stage, and the prettifying (aka making code look nice) stage [6].

The Expression Decompiler Stage

The primary purpose in this stage is to convert all stack-based instructions to abstract expressions operating on constants and variables. By the end of this stage, we will end up with an abstract expression syntax tree [6].

To successfully convert our stack-based instructions, we must perform the following steps:

  • Keep track of the stack state at each instruction

  • Create an abstract expression stack and pop/push abstract expressions to it

Here is an example of what an abstract expression syntax tree looks like at the end of this stage: