The Root of the Dilemma in Understanding Assembly Language
For programmers who are “native” in C/C++, reading assembly code often encounters the following difficulties:
- Poor Readability: Assembly instructions have a low level of abstraction and lack the expressiveness of high-level languages.
- Lack of Context: Low-level details such as register operations and memory accesses obscure the high-level intentions of the program.
- Differences in Thinking Style: The structured programming mindset struggles to adapt to the unstructured flow of instructions.
The Mapping Relationship Between Assembly and C
1. Function Calls and Stack Frames
C Language Example:
int add(int a, int b) {
return a + b;
}
Corresponding Assembly (x86):
_add:
push ebp ; Save old base pointer
mov ebp, esp ; Create new stack frame
mov eax, [ebp+8] ; Get first parameter a
add eax, [ebp+12] ; Add second parameter b
pop ebp ; Restore old base pointer
ret ; Return
2. Control Structures
C Language if Statement:
if (x > 0) {
// Code block 1
} else {
// Code block 2
}
Corresponding Assembly:
cmp eax, 0 ; Compare x with 0
jle ELSE_BLOCK ; Jump to else block if x <= 0
; Instructions for code block 1
jmp END_IF ; Skip else block
ELSE_BLOCK:
; Instructions for code block 2
END_IF:
3. Loop Structures
C Language for Loop:
for (int i = 0; i < 10; i++) {
// Loop body
}
Corresponding Assembly:
mov ecx, 0 ; i = 0
FOR_LOOP:
cmp ecx, 10 ; Compare i with 10
jge END_FOR ; If i >= 10, exit loop
; Instructions for loop body
inc ecx ; i++
jmp FOR_LOOP ; Continue loop
END_FOR:
Reverse Engineering Techniques
1. Identifying Function Prototypes
By analyzing parameter passing and return values in assembly, one can infer the C function prototype:
; Calling convention is stdcall (parameters pushed from right to left)
push 3 ; Third parameter
push 2 ; Second parameter
push 1 ; First parameter
call _func ; Call function
add esp, 12 ; Clean up stack (3 parameters × 4 bytes)
; Can infer C prototype:
; void func(int a, int b, int c);
2. Struct Access Patterns
Struct access typically involves a base address plus a fixed offset:
mov eax, [ebx+8] ; Equivalent to C's struct_ptr->field2
3. Array Access Patterns
Array access usually includes base address, index, and element size calculation:
mov eax, [esi+edi*4] ; Equivalent to C's array[index]
; Here assuming element size is 4 bytes (int)
Practical Recommendations
- Start with Small Functions: Begin by analyzing simple function calls and gradually transition to more complex logic.
- Comment Conversion: Add C-style comments to assembly code.
- Bidirectional Comparison: Compile simple C code and observe the generated assembly to build an intuitive understanding.
- Use a Debugger: Dynamically trace register changes to understand data flow.
Complete Example of Conversion
C Code:
int factorial(int n) {
if (n <= 1)
return 1;
else
return n * factorial(n-1);
}
Corresponding Assembly:
_factorial:
push ebp
mov ebp, esp
mov eax, [ebp+8] ; eax = n
cmp eax, 1 ; Compare n with 1
jg RECURSIVE_CASE ; if n > 1
mov eax, 1 ; return 1
jmp END_FACT
RECURSIVE_CASE:
dec eax ; eax = n-1
push eax ; Prepare parameter
call _factorial ; Recursive call
add esp, 4 ; Clean up stack
imul eax, [ebp+8] ; Return value multiplied by n
END_FACT:
pop ebp
ret
Through this systematic correspondence analysis, assembly code will no longer be a pile of incomprehensible instructions, but can be understood and analyzed like reading C code.