Crash Course in x86 Disassembly
Last updated
Last updated
Levels of abstraction - create a way of hiding the implementation details
The lower the level = the less portable across computer systems
Hardware
Only physical level
Consists of electrical circuits that use complex combinations of logical operators (XOR, AND, OR and NOT gates)
Microcode
Also known as firmware
Operates only on the exact circuitry for which it was designed
Contains instructions that translate from higher machine-code level to interface with the hardware
Machine code
Consists of opcodes, hexadecimal digits that tell the processor what you want it to do
Implemented with several microcode instructions so that the hardware can execute the code
Created when programs written in a high-level language is compiled
Low-level languages
Human-readable version of a computer architecture's instruction set
Most common low-level language is assembly language
Use a disassembler to generate low-level language text
High-level languages
Provide strong abstraction from the machine level and make it easy to use programming logic and flow-control mechanisms
Includes: C, C++
Languages are typically turned into machine code by a compiler through the process of compilation
Interpreted languages
At the top level
Includes: C#, Perl, .NET and Java
Code at this level is not compiled into machine code but is instead translated into bytecode
Bytecode
is an intermediate representation that is specific to the programming language
Executes within an interpreter
Interpreter
Is a program hat translates bytecode into executable machine code on the fly at runtime
Provides an automatic level of abstraction when compared to traditional compiled code
Can handle errors and memory management
Assembly Language
Actually a class of languages
Each dialect is typically used to program a single family of microprocessors:
x86
x64
SPARC
PowerPC
MIPS
ARM
Most malware is compiled for x86
Three hardware components
Central Processing Unit (CPU) executes code
Main Memory (RAM) stores all data and code
Input / Output system (I/O) interfaces with devices such as hard drives, keyboards and monitors
Control Unit gets instructions to execute from RAM using a register (the instruction pointer), which stores the address of the instruction to execute
Registers
CPU's basic data storage units
Used to save time so that the CPU doesn't need to access RAM
Arithmetic logic unit (ALU)
Executes an instruction fetched from RAM
Places the results in registers or memory
Divided into the following major sections:
Data
Used to refer to a specific section of memory called the data section
Contains values that are put in place when a program is initially loaded
Static values do not change while the program is running
Global values are available to any part of the program
Code
Includes the instructions fetched by the CPU to execute the program's tasks
Controls what the program does and how the program's tasks will be orchestrated
Heap
Used for dynamic memory during program execution, to create new values and eliminate values that the program no longer needs
Dynamic memory - contents can change frequently while the program is running
Stack
Used for local variables and parameters for functions
Help control program flow
Are the building blocks of assembly programs
In x86 assembly, instructions are made of a mnemonic and zero or more operands
mnemonic - a word that identifies the instruction to execute, such as mov
(moves data)
operands - used to identify information used by the instruction, such as registers or data
Opcodes
Tell the CPU which operation the program wants to perform
Disassemblers translate opcodes into human-readable instructions
Endianness
Describes whether the most significant or least significant byte is ordered first within a larger data item
Changing between endianness is something malware has to do during network communication because network data uses big-endian and an x86 program uses little-endian
Need to be aware of this to make sure you don't accidentally reverse the order of important indicators like an IP address
Used to identify the data used by an instruction
Three types:
Immediate - fixed values
Register - operands refer to registers
Memory address - refer to a memory address that contains the value of interest, typically denoted by a value, register, or equation between brackets.
A small amount of data storage available to the CPU
Contents can be accessed more quickly than storage available elsewhere
x86 processors have a collection of registers available for use as temporary storage or workspace
Four categories:
General registers - used by the CPU during execution
Segment registers - used to track sections of memory
Status flags - used to make decisions
Instruction pointers - used to keep track of the next instruction to execute
General registers are 32 bits in size
Can be referenced as either 32 or 16 bits in assembly code
General Registers
Store data or memory addresses
Used interchangeably to get things accomplished within the program
Used in a consistent fashion throughout a program
Example - EAX register generally contains the return value for function calls
EFLAGS register
A status register
During execution - each flag is either set (1) or cleared (0) to control CPU operations or indicate the results of a CPU operation
Most important flags to malware analysis
ZF - set when the result of an operation is equal to zero otherwise it is cleared
CF - set when the result of an operation is too large or too small for the destination operand or it is cleared
SF - set when the result of an operation is negative or cleared when the result is positive. Also set when the most significant bit is set after an arithmetic operation
TF - used for debugging. The x86 processor will execute only one instruction at a time if this flag is set
Also known as the instruction pointer or program counter
A register that contains the memory address of the next instruction to be executed for a program
Only purpose is to tell the processor what to do next
Corrupted EIP - leads to a program crash because it points to a memory address that does not contain legitimate program code
Attackers want to control EIP because it lets them control what is executed by the CPU
mov
used to move data from one location to another
reads and writes to memory
format - mov destination, source
lea
"load effective address"
format - lea destination, source
used to put memory address into the destination
not used to exclusively to refer to memory addresses
useful when calculating values because it needs fewer instructions
add destination, value
sub destination, value
Zero flag (ZF) is set if the result is zero
Carry Flag (CF) is set if the destination is less than the value subtracted
Multiplication and division
both act on a predefined register
Format - mul value
and div value
Result is stored as 64-bit value across two registers
Shift registers
shift the bits in the destination operand to the right and left
shr destination, count
NOP
Does nothing, execution just moves to the next instruction
opcode is 0x90
Commonly used in a NOP sled for buffer overflow attacks
Provides execution padding - reduces the risk that the malicious shellcode will start executing in the middle
Stores memory for functions, local variables and flow control
Is a data structure characterized by pushing and popping
Last in, first out structure
Short term storage only
Primary usage is for the management of data exchanged between function calls
Stack instructions - push, pop, call, leave, enter
and ret
ESP
The stack pointer
Contains a memory address that points to the top of the stack
EBP
The base pointer
Stays consistent within a given function
The program can use it as a placeholder to keep track of the location of local variables and parameters
Functions
Portions of code within a program that perform a specific task
Relatively independent of the remaining code
Prologue - prepares the stack and registers for use within the function
Epilogue - restores the stack and registers to their state before the function was called
Flow of function call implementation
Arguments are placed on the stack using push
instructions
Function is called using call memory_location
, this causes the current instruction address (the contents of the EIP register) to be pushed onto the stack.
This address is used to return to the main code when the function is finished.
When the function begins, EIP is set to memory_location
(the start of the function)
Using the prologue, space is allocated on the stack for local variables and EBP (base pointer) is pushed onto the stack
The function performs its work
Using the function epilogue, the stack is restored. ESP is adjusted to free the local variables and EBP is restored so that the calling function can address its variables properly
Functions returns by calling the ret
instruction. The program will continue executing from where the original call was made
Stack is adjusted to remove arguments
A sequence of code that is conditionally executed depending on the flow of the program
jump instructions
Most popular way branching happens
jmp location
- causes the next instruction executed to be the one specified by the jmp
Conditional Jumps
Use the flags to determine whether to jump or to proceed to the next instruction
A set of instruction for manipulating data buffers
Usually in the form of an array of bytes
ESI - Source Index Register
EDI - Destination Index Register
ECX - Counting Variable