Dobby Source Code Reading: Instruction-Level Tool

This article is an excellent piece from the KX forum.

KX Forum Author ID: KerryS

Dobby has two main functions: one is inline hooking, and the other is instruction instrumentation. The principles of both are similar, but this article mainly introduces instruction instrumentation.

Instruction instrumentation refers to inserting instrumentation into any instruction (either at the function header or within the function). When execution reaches this instruction, it will execute a user-defined callback function before returning to the original instruction flow. Usage example:

int res_instument = DobbyInstrument((void *) addr, offset_name_handler); // handler is our custom callback // RegisterContext is the register context, HookEntryInfo contains necessary information for the hook, such as hook address, etc. void offset_name_handler(RegisterContext *ctx, const HookEntryInfo *info) typedef struct _RegisterContext { uint32_t dummy_0; uint32_t dummy_1; uint32_t dummy_2; uint32_t sp; union { uint32_t r[13]; struct { uint32_t r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12; } regs; } general; uint32_t lr; } RegisterContext; // HookEntryInfo contains hook address and ID typedef struct _HookEntryInfo { int hook_id; union { void *target_address; void *function_address; void *instruction_address; }; } HookEntryInfo;

一

工作原理

As the saying goes, “A picture is worth a thousand words;” using diagrams for explanation is the best. While reading, I also organized and drew diagrams.

Dobby Source Code Reading: Instruction-Level Tool

The instrumented instruction is replaced with:

------------------------------------------------------------------------------process 60890x9d639d32 nop0x9d639d34 ldr.w pc, [pc, #-0x0]0x9d639d38  // Address 0xcea0a0ac0x9d639d38            0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF00000000  ac a0 a0 ce                                      ....------------------------------------------------------------------------------

The ARM processor uses instruction pipelining, which means fetching, decoding, and executing instructions happen in three synchronized stages. The PC register points to the instruction currently being fetched; in ARM mode, it is the address of the currently executing instruction + 8, while in THUMB mode, it is the current position + 4. Therefore, when the above ldr instruction executes, the value of the PC register is the ldr instruction position + 4. Thus, ldr,pc,[pc,-0x0] effectively loads the next instruction into the PC, causing a jump.

This jump method supports a range equal to the width of a register, which is 32 bits, allowing for a 4GB memory space. It seems that the virtual address space of a Linux process is also 4GB, enabling jumps across the entire address space of the process. So where does it jump to? It jumps to prologue_dispatch_bridge0xcea0a0ac.

0xcea0a0ac ldr ip, [pc]0xcea0a0b0 ldr pc, [pc]0xcea0a0b4  // Address 0xa2305b800xcea0a0b8  // Address 0xcea0a0000xcea0a0b4            0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF00000000  80 5b 30 a2 00 a0 a0 ce

It primarily does two things: first, it places 0xa2305b80 into the IP register, and second, it jumps to 0xcea0a000. Note that these are ARM mode instructions, and the PC offset is 8.

Among them, 0xcea0a000 is the upper half of the closure bridge.

0xcea0a000 sub sp, sp, #0x380xcea0a004 str lr, [sp, #0x34]0xcea0a008 str ip, [sp, #0x30]0xcea0a00c str fp, [sp, #0x2c]0xcea0a010 str sl, [sp, #0x28]0xcea0a014 str sb, [sp, #0x24]0xcea0a018 str r8, [sp, #0x20]0xcea0a01c str r7, [sp, #0x1c]0xcea0a020 str r6, [sp, #0x18]0xcea0a024 str r5, [sp, #0x14]0xcea0a028 str r4, [sp, #0x10]0xcea0a02c str r3, [sp, #0xc]0xcea0a030 str r2, [sp, #8]0xcea0a034 str r1, [sp, #4]0xcea0a038 str r0, [sp]0xcea0a03c add r0, sp, #0x380xcea0a040 sub sp, sp, #80xcea0a044 str r0, [sp, 4]0xcea0a048 sub sp, sp, #80xcea0a04c mov r0, sp0xcea0a050 mov r1, ip0xcea0a054 bl #0xcea0a05c0xcea0a058 b #0xcea0a0640xcea0a05c ldr pc, [pc, #-4]0xcea0a060 // Address 0x9d2b43e10xcea0a060            0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF00000000  e1 43 2b 9d

At 0xcea0a060, the value 0x9d2b43e1 is the high-level handler, which will call our custom handler.

instrument_call_forward_handler

void instrument_call_forward_handler(RegisterContext *ctx, HookEntry *entry) {  DynamicBinaryInstrumentRouting *route = (DynamicBinaryInstrumentRouting *)entry->route;  if (route->handler) {    DBICallTy handler;    HookEntryInfo entry_info;    entry_info.hook_id = entry->id;    entry_info.instruction_address = entry->instruction_address;    handler = (DBICallTy)route->handler;    (*handler)(ctx, (const HookEntryInfo *)&amp;entry_info);  }   // set prologue bridge next hop address with origin instructions that have been relocated(patched)  set_routing_bridge_next_hop(ctx, entry->relocated_origin_instructions);}

This handler not only calls our handler but also performs a crucial task, which will be discussed later.

Let’s summarize the closure bridge: it first saves the register context, then jumps to address 0xcea0a054 using the bl instruction. At 0xcea0a05c, it finds the address of the high-level handler using ldr and calls it. Note that the bl instruction saves the next instruction address, 0xcea0a058, into the lr register. After the function is executed, it returns to the address saved in the lr register, which is 0xcea0a058 b #0xcea0a064. Let’s check the contents of 0xcea0a064.

Closure bridge lower half

0xcea0a064 add sp, sp, #80xcea0a068 add sp, sp, #80xcea0a06c pop {r0}0xcea0a070 pop {r1}0xcea0a074 pop {r2}0xcea0a078 pop {r3}0xcea0a07c pop {r4}0xcea0a080 pop {r5}0xcea0a084 pop {r6}0xcea0a088 pop {r7}0xcea0a08c pop {r8}0xcea0a090 pop {sb}0xcea0a094 pop {sl}0xcea0a098 pop {fp}0xcea0a09c pop {ip}0xcea0a0a0 pop {lr}0xcea0a0a4 mov pc, ip

It performs the common task of popping the previously saved registers from the stack while restoring stack balance. The only unusual aspect is the last instruction, mov pc, ip, which jumps to the address stored in the ip register. So what is the value stored in the ip register? Remember the crucial task mentioned earlier?

The last line of the instrument_call_forward_handler function

// set prologue bridge next hop address with origin instructions that have been relocated(patched)  set_routing_bridge_next_hop(ctx, entry->relocated_origin_instructions); void set_routing_bridge_next_hop(RegisterContext *ctx, void *address) {  *reinterpret_cast&lt;void **&gt;(&amp;ctx->general.regs.r12) = address;}

This assigns the content of entry->relocated_origin_instructions to the r12 register, where entry->relocated_origin_instructions is the relocated position of the original instruction.

Since the original instruction is patched as ldr pc,[pc,-4] and an address, these patched instructions will be restored and placed in entry->relocated_origin_instructions (the instruction restoration will be discussed later). After executing the restored original instruction, it will jump back to the instructions following the patched original instruction and continue execution. This process is roughly as follows:

Original instruction

Since it’s a patch, at least 8 bytes are needed, and here the original instruction is a THUMB instruction, so four instructions are patched and restored.

Relocated instructions

------------------------------------------------------------------------------process 60890xcea0a0c0 nop0xcea0a0c2 nop0xcea0a0c4 push {r0, r1, r2, lr}0xcea0a0c6 nop0xcea0a0c8 cbz r0, #0xcea0a0cc0xcea0a0ca nop0xcea0a0cc b.w #0xcea0a0d00xcea0a0d0 ldr.w pc, [pc, #0x14]  0xcea0a0d0 + 0x14+thumb_pc_offset(4)=0xcea0a0e8,即 0x9d639d450xcea0a0d4 nop0xcea0a0d6 nop0xcea0a0d8 add r2, sp, #80xcea0a0da nop0xcea0a0dc str r1, [r2, #-0x4]!0xcea0a0e0 ldr.w pc, [pc, #-0x0]   同理，0x9d639d3d0xcea0a0e4 // Address 0x9d639d3d0xcea0a0e8 // Address 0x9d639d450xcea0a0e4            0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF00000000  3d 9d 63 9d 45 9d 63 9d                          =.c.E.c.------------------------------------------------------------------------------

The logic for instruction restoration is that for PC-related instructions, ldr is used to jump back to the correct position; for PC-independent instructions, they are copied directly. Here, push, add, and str.w instructions are directly copied, and some nop instructions are inserted for 4-byte alignment. As for why alignment is needed, I don’t know; I recall that THUMB instructions can be aligned to 2 bytes.

In the original instruction, only cbz is PC-related. The semantic meaning of this instruction is to check if the r0 register is zero; if it is, jump to the specified location, which in this case is to an offset of 0x10 from the current position, specifically to address 25ED44. As you can see, Dobby’s restoration method modifies the cbz instruction. If r0 is 0, it jumps back to offset 25ED44 (0x9d639d45) using the ldr instruction (0xcea0a0d0 ldr.w pc, [pc, #0x14]).

If it is not zero, it jumps to the instruction following the patched instruction using ldr (0xcea0a0e0 ldr.w pc, [pc, #-0x0]), continuing execution at offset 25ED3C (0x9d639d3d). Both of these addresses add 1, as the original instruction is a THUMB instruction, and the ARM processor determines whether to use ARM or THUMB mode based on whether the last address bit is 1. If it is 1, THUMB mode is used. Thus, the entire logic of Dobby instruction instrumentation is complete.

二

代码详解

First, traverse a HookEntry linked list, which stores information about each instrumentation. Each time an instruction is instrumented, a HookEntry structure is generated and added to this list. Traversing this list can determine whether the instruction to be instrumented has already been instrumented.

route->DispatchRouting(); is the key method, as it performs almost all instrumentation work. route->DispatchRouting() calls two methods: BuildDynamicBinaryInstrumentRouting() and GenerateRelocatedCode(trampolinebuffer->getSize()).

void DynamicBinaryInstrumentRouting::DispatchRouting() {  BuildDynamicBinaryInstrumentRouting();   // generate relocated code which size == trampoline size  GenerateRelocatedCode(trampoline_buffer_->getSize());}

BuildDynamicBinaryInstrumentRouting()

void DynamicBinaryInstrumentRouting::BuildDynamicBinaryInstrumentRouting() {  // create closure trampoline jump to prologue_routing_dispath with the `entry_` data   ClosureTrampolineEntry *closure_trampoline;   void *handler = (void *)instrument_routing_dispatch;#if __APPLE__#if __has_feature(ptrauth_calls)  handler = __builtin_ptrauth_strip(handler, ptrauth_key_asia);#endif#endif    closure_trampoline = ClosureTrampoline::CreateClosureTrampoline(entry_, handler);  this-&gt;SetTrampolineTarget(closure_trampoline-&gt;address);    DLOG(0, "[closure bridge] Carry data %p ", entry_);    DLOG(0, "[closure bridge] Create prologue_dispatch_bridge %p", closure_trampoline-&gt;address);   // generate trampoline buffer, run before `GenerateRelocatedCode`  GenerateTrampolineBuffer(entry_->target_address, GetTrampolineTarget());}

In this, closure_trampoline = ClosureTrampoline::CreateClosureTrampoline(entry, handler); generates the assembly instructions for the prologue_dispatch_bridge, where __ EmitAddress((uint32_t)get_closure_bridge()) is a key point, as the closure_bridge instructions are generated here.

ClosureTrampolineEntry *ClosureTrampoline::CreateClosureTrampoline(void *carry_data, void *carry_handler) {    ClosureTrampolineEntry *entry = nullptr;  entry = new ClosureTrampolineEntry; #ifdef ENABLE_CLOSURE_TRAMPOLINE_TEMPLATE#define CLOSURE_TRAMPOLINE_SIZE (7 * 4)  // use closure trampoline template code, find the executable memory and patch it.  Code *code = Code::FinalizeCodeFromAddress(closure_trampoline_template, CLOSURE_TRAMPOLINE_SIZE);#else // use assembler and codegen modules instead of template_code#include "TrampolineBridge/ClosureTrampolineBridge/AssemblyClosureTrampoline.h"#define _ turbo_assembler_.  TurboAssembler turbo_assembler_(0);   PseudoLabel entry_label;  PseudoLabel forward_bridge_label;     _ Ldr(r12, &amp;entry_label);  _ Ldr(pc, &amp;forward_bridge_label);  _ PseudoBind(&amp;entry_label);  _ EmitAddress((uint32_t)entry);  _ PseudoBind(&amp;forward_bridge_label);  _ EmitAddress((uint32_t)get_closure_bridge());   AssemblyCodeChunk *code = nullptr;  code = AssemblyCodeBuilder::FinalizeFromTurboAssembler(&amp;turbo_assembler_);   entry-&gt;address = (void *)code-&gt;raw_instruction_start();  entry-&gt;size = code-&gt;raw_instruction_size();  entry-&gt;carry_data = carry_data;  entry-&gt;carry_handler = carry_handler;   delete code;  return entry;#endif}

void *get_closure_bridge() {   // if already initialized, just return.  if (closure_bridge)    return closure_bridge; // check if enable the inline-assembly closure_bridge_template#if ENABLE_CLOSURE_BRIDGE_TEMPLATE  extern void closure_bridge_tempate();  closure_bridge = closure_bridge_template;// otherwise, use the Assembler build the closure_bridge#else#define _ turbo_assembler_.  TurboAssembler turbo_assembler_(0, code_buffer);   _ sub(sp, sp, Operand(14 * 4));  _ str(lr, MemOperand(sp, 13 * 4));  _ str(r12, MemOperand(sp, 12 * 4));  _ str(r11, MemOperand(sp, 11 * 4));  _ str(r10, MemOperand(sp, 10 * 4));  _ str(r9, MemOperand(sp, 9 * 4));  _ str(r8, MemOperand(sp, 8 * 4));  _ str(r7, MemOperand(sp, 7 * 4));  _ str(r6, MemOperand(sp, 6 * 4));  _ str(r5, MemOperand(sp, 5 * 4));  _ str(r4, MemOperand(sp, 4 * 4));  _ str(r3, MemOperand(sp, 3 * 4));  _ str(r2, MemOperand(sp, 2 * 4));  _ str(r1, MemOperand(sp, 1 * 4));  _ str(r0, MemOperand(sp, 0 * 4));   // store sp  _ add(r0, sp, Operand(14 * 4));  _ sub(sp, sp, Operand(8));  _ str(r0, MemOperand(sp, 4));   // stack align  _ sub(sp, sp, Operand(8));   _ mov(r0, Operand(sp));  _ mov(r1, Operand(r12));   _ CallFunction(ExternalReference((void *)intercept_routing_common_bridge_handler));   // stack align  _ add(sp, sp, Operand(8));   // restore sp placeholder stack  _ add(sp, sp, Operand(8));   _ ldr(r0, MemOperand(sp, 4, PostIndex));  _ ldr(r1, MemOperand(sp, 4, PostIndex));  _ ldr(r2, MemOperand(sp, 4, PostIndex));  _ ldr(r3, MemOperand(sp, 4, PostIndex));  _ ldr(r4, MemOperand(sp, 4, PostIndex));  _ ldr(r5, MemOperand(sp, 4, PostIndex));  _ ldr(r6, MemOperand(sp, 4, PostIndex));  _ ldr(r7, MemOperand(sp, 4, PostIndex));  _ ldr(r8, MemOperand(sp, 4, PostIndex));  _ ldr(r9, MemOperand(sp, 4, PostIndex));  _ ldr(r10, MemOperand(sp, 4, PostIndex));  _ ldr(r11, MemOperand(sp, 4, PostIndex));  _ ldr(r12, MemOperand(sp, 4, PostIndex));  _ ldr(lr, MemOperand(sp, 4, PostIndex));   // auto switch A32 &amp; T32 with `least significant bit`, refer `docs/A32_T32_states_switch.md`  _ mov(pc, Operand(r12));   AssemblyCodeChunk *code = AssemblyCodeBuilder::FinalizeFromTurboAssembler(&amp;turbo_assembler_);  closure_bridge = (void *)code-&gt;raw_instruction_start();   DLOG(0, "[closure bridge] Build the closure bridge at %p", closure_bridge);#endif  return (void *)closure_bridge;}

BuildDynamicBinaryInstrumentRouting() also calls GenerateTrampolineBuffer(entry_->target_address, GetTrampolineTarget()); This method generates the TrampolineBuffer, which is used to patch the original instructions, as shown in the second small box of the flowchart.

bool InterceptRouting::GenerateTrampolineBuffer(void *src, void *dst) {  CodeBufferBase *trampoline_buffer = NULL;  // if near branch trampoline plugin enabled  if (RoutingPluginManager::near_branch_trampoline) {    RoutingPluginInterface *plugin = NULL;    plugin = reinterpret_cast&lt;RoutingPluginInterface *&gt;(RoutingPluginManager::near_branch_trampoline);    if (plugin->GenerateTrampolineBuffer(this, src, dst) == false) {      DLOG(0, "Failed enable near branch trampoline plugin");    }  }   if (this->GetTrampolineBuffer() == NULL) {    trampoline_buffer = GenerateNormalTrampolineBuffer((addr_t)src, (addr_t)dst);    this->SetTrampolineBuffer(trampoline_buffer);     DLOG(0, "[trampoline] Generate trampoline buffer %p -> %p", src, dst);  }  return true;}

GenerateRelocatedCode(trampolinebuffer->getSize())

bool InterceptRouting::GenerateRelocatedCode(int tramp_size) {  // generate original code  AssemblyCodeChunk *origin = NULL;  origin = AssemblyCodeBuilder::FinalizeFromAddress((addr_t)entry_->target_address, tramp_size);  origin_ = origin;   // generate the relocated code  AssemblyCodeChunk *relocated = NULL;  relocated = AssemblyCodeBuilder::FinalizeFromAddress(0, 0);  relocated_ = relocated;   void *relocate_buffer = NULL;  relocate_buffer = entry_->target_address;   GenRelocateCodeAndBranch(relocate_buffer, origin, relocated);  if (relocated->raw_instruction_start() == 0)    return false;   // set the relocated instruction address  entry_->relocated_origin_instructions = (void *)relocated->raw_instruction_start();   DLOG(0, "[insn relocate] origin %p - %d", origin->raw_instruction_start(), origin->raw_instruction_size());    DLOG(0, "[insn relocate] relocated %p - %d", relocated->raw_instruction_start(), relocated->raw_instruction_size());    // save original prologue  memcpy((void *)entry_->origin_chunk_.chunk_buffer, (void *)origin->raw_instruction_start(),         origin->raw_instruction_size());  entry_->origin_chunk_.chunk.re_init_region_range(origin_);  return true;}

Among them, GenRelocateCodeAndBranch(relocate_buffer, origin, relocated); is the key point, as it generates the relocation code and places it in the address space pointed to by relocated.

void GenRelocateCodeAndBranch(void *buffer, AssemblyCodeChunk *origin, AssemblyCodeChunk *relocated) {  CodeBuffer *code_buffer = new CodeBuffer(64);   ThumbTurboAssembler thumb_turbo_assembler_(0, code_buffer);#define thumb_ thumb_turbo_assembler_.  TurboAssembler arm_turbo_assembler_(0, code_buffer);#define arm_ arm_turbo_assembler_.   Assembler *curr_assembler_ = NULL;   AssemblyCodeChunk origin_chunk;  origin_chunk.init_region_range(origin->raw_instruction_start(), origin->raw_instruction_size());   bool entry_is_thumb = origin->raw_instruction_start() % 2;  if (entry_is_thumb) {    origin->re_init_region_range(origin->raw_instruction_start() - THUMB_ADDRESS_FLAG, origin->raw_instruction_size());  }   LiteMutableArray relo_map(8); relocate_remain:  addr32_t execute_state_changed_pc = 0;   bool is_thumb = origin_chunk.raw_instruction_start() % 2;  if (is_thumb) {    curr_assembler_ = &amp;thumb_turbo_assembler_;     buffer = (void *)((addr_t)buffer - THUMB_ADDRESS_FLAG);     addr32_t origin_code_start_aligned = origin_chunk.raw_instruction_start() - THUMB_ADDRESS_FLAG;    // remove thumb address flag    origin_chunk.re_init_region_range(origin_code_start_aligned, origin_chunk.raw_instruction_size());     gen_thumb_relocate_code(&amp;relo_map, &amp;thumb_turbo_assembler_, buffer, &amp;origin_chunk, relocated,                            &amp;execute_state_changed_pc);    if (thumb_turbo_assembler_.GetExecuteState() == ARMExecuteState) {      // relocate interrupt as execute state changed      if (execute_state_changed_pc &lt; origin_chunk.raw_instruction_start() + origin_chunk.raw_instruction_size()) {        // re-init the origin        int relocate_remain_size =            origin_chunk.raw_instruction_start() + origin_chunk.raw_instruction_size() - execute_state_changed_pc;        // current execute state is ARMExecuteState, so not need `+ THUMB_ADDRESS_FLAG`        origin_chunk.re_init_region_range(execute_state_changed_pc, relocate_remain_size);         // update buffer        buffer = (void *)((addr_t)buffer + (execute_state_changed_pc - origin_code_start_aligned));         // add nop to align ARM        if (thumb_turbo_assembler_.pc_offset() % 4)          thumb_turbo_assembler_.t1_nop();        goto relocate_remain;      }    }  } else {    curr_assembler_ = &amp;arm_turbo_assembler_;     gen_arm_relocate_code(&amp;relo_map, &amp;arm_turbo_assembler_, buffer, &amp;origin_chunk, relocated,                          &amp;execute_state_changed_pc);    if (arm_turbo_assembler_.GetExecuteState() == ThumbExecuteState) {      // relocate interrupt as execute state changed      if (execute_state_changed_pc &lt; origin_chunk.raw_instruction_start() + origin_chunk.raw_instruction_size()) {        // re-init the origin        int relocate_remain_size =            origin_chunk.raw_instruction_start() + origin_chunk.raw_instruction_size() - execute_state_changed_pc;        // current execute state is ThumbExecuteState, add THUMB_ADDRESS_FLAG        origin_chunk.re_init_region_range(execute_state_changed_pc + THUMB_ADDRESS_FLAG, relocate_remain_size);         // update buffer        buffer = (void *)((addr_t)buffer + (execute_state_changed_pc - origin_chunk.raw_instruction_start()));        goto relocate_remain;      }    }  }   // TODO:  // if last instr is unlink branch, skip  //dkl 调回插桩点之后继续执行  addr32_t rest_instr_addr = origin_chunk.raw_instruction_start() + origin_chunk.raw_instruction_size();  if (curr_assembler_ == &amp;thumb_turbo_assembler_) {    // Branch to the rest of instructions    thumb_ AlignThumbNop();    thumb_ t2_ldr(pc, MemOperand(pc, 0));    // Get the real branch address    thumb_ EmitAddress(rest_instr_addr + THUMB_ADDRESS_FLAG);  } else {    // Branch to the rest of instructions    CodeGen codegen(&amp;arm_turbo_assembler_);    // Get the real branch address    codegen.LiteralLdrBranch(rest_instr_addr);  }   // Realize all the Pseudo-Label-Data  thumb_turbo_assembler_.RelocBind();   // Realize all the Pseudo-Label-Data  //dkl 在这里会修正之前lable link的ldr指令，  arm_turbo_assembler_.RelocBind();   // Generate executable code  {    // assembler without specific memory address    AssemblyCodeChunk *cchunk;    cchunk = MemoryArena::AllocateCodeChunk(code_buffer->getSize());    if (cchunk == nullptr)      return;     thumb_turbo_assembler_.SetRealizedAddress(cchunk->address);    arm_turbo_assembler_.SetRealizedAddress(cchunk->address);     // fixup the instr branch into trampoline(has been modified)     reloc_label_fixup(origin, &amp;relo_map, &amp;thumb_turbo_assembler_, &amp;arm_turbo_assembler_);     AssemblyCodeChunk *code = NULL;    code = AssemblyCodeBuilder::FinalizeFromTurboAssembler(curr_assembler_);    relocated->re_init_region_range(code->raw_instruction_start(), code->raw_instruction_size());    delete code;  }   // thumb  if (entry_is_thumb) {    // add thumb address flag    relocated->re_init_region_range(relocated->raw_instruction_start() + THUMB_ADDRESS_FLAG,                                    relocated->raw_instruction_size());  }   // clean  {    thumb_turbo_assembler_.ClearCodeBuffer();    arm_turbo_assembler_.ClearCodeBuffer();     delete code_buffer;  }}

It seems a bit verbose now, so let’s focus on the instruction restoration part. In our example, the instruction to be restored is a THUMB1 instruction, which will ultimately lead to here. I have omitted the restoration of other instructions and will only look at cbz; the details will not be discussed.

The general idea is to use ldr pc,[pc,xxx] to jump, but when the ldr instruction is first generated, xxx is not useful. After all the relocation instructions are generated, these ldr instructions will be corrected because the addresses to which ldr jumps are stored after all instructions.

static void Thumb1RelocateSingleInstr(ThumbTurboAssembler *turbo_assembler, LiteMutableArray *thumb_labels,                                      int16_t instr, addr32_t from_pc, addr32_t to_pc,                                      addr32_t *execute_state_changed_pc_ptr) {  bool is_instr_relocated = false;   _ AlignThumbNop();   uint32_t val = 0, op = 0, rt = 0, rm = 0, rn = 0, rd = 0, shift = 0, cond = 0;  int32_t offset = 0;   int32_t op0 = 0, op1 = 0;  op0 = bits(instr, 10, 15);  // [F3.2.3 Special data instructions and branch and exchange]  if (op0 == 0b010001) {    op0 = bits(instr, 8, 9);    // [Add, subtract, compare, move (two high registers)]    if (op0 != 0b11) {      int rs = bits(instr, 3, 6);      // rs is PC register      if (rs == 15) {        val = from_pc;         uint16_t rewrite_inst = 0;        rewrite_inst = (instr &amp; 0xff87) | LeftShift((VOLATILE_REGISTER.code()), 4, 3);         ThumbRelocLabelEntry *label = new ThumbRelocLabelEntry(val, false);        _ AppendRelocLabelEntry(label);         _ T2_Ldr(VOLATILE_REGISTER, label);        _ EmitInt16(rewrite_inst);         is_instr_relocated = true;      }    }     // compare branch (cbz, cbnz)  if ((instr &amp; 0xf500) == 0xb100) {    uint16_t imm5 = bits(instr, 3, 7);    uint16_t i = bit(instr, 9);    uint32_t offset = (i &lt;&lt; 6) | (imm5 &lt;&lt; 1);    val = from_pc + offset;    rn = bits(instr, 0, 2);     //ThumbTurboAssembler 的data_labels_记录所有的ThumbRelocLabelEntry，保存着要跳转的地址，同时绑定了跳转指令，等待后续把要跳转的地址找到合适的内存储存后，一起修复好//    即，修复前 ldr pc,xxx  修复后 ldr pc, [pc,offset],pc+offset就是存储要跳转地址的内存    ThumbRelocLabelEntry *label = new ThumbRelocLabelEntry(val + 1, true);    _ AppendRelocLabelEntry(label); //    imm5 = bits(0x4 &gt;&gt; 1, 1, 5);    //dkl 修复      imm5 = bits(0, 1, 5);    i = bit(0x4 &gt;&gt; 1, 6);     _ EmitInt16((instr &amp; 0xfd07) | imm5 &lt;&lt; 3 | i &lt;&lt; 9);    _ t1_nop(); // manual align    _ t2_b(0);    //这个label持有要跳转过去的地址，跳转采用ldr pc 的方式，这个label同时又采用PseudoLabelInstruction结构体绑定到指令上，所以，已经具备了跳转的全部信息了，    // 只差把跳转地址存到合适的位置，然后修复ldr即可，修复工作好像是后面统一处理， thumb_turbo_assembler_.RelocBind();在这里修正    _ T2_Ldr(pc, label);     is_instr_relocated = true;  }     // if the instr do not needed relocate, just rewrite the origin  if (!is_instr_relocated) {#if 0        if (from_pc % Thumb2_INST_LEN)            _ t1_nop();#endif    _ EmitInt16(instr);  }

Thus, the code explanation ends. Actually, the instruction restoration mainly involves instruction parsing, which is slightly cumbersome.

三

收获

The most significant gain is having a complete experience in reading source code, along with learning some engineering techniques, such as C++ linked list techniques.

First, define a generic linked list head:

Specific data nodes:

The benefit of this approach is that when traversing the linked list, you can directly use the NodHead pointer to traverse, and when you need to read data, simply convert NodHead to EntryNod, since the struct pointer is the address of the first item in the struct. Thus, you can write a generic linked list template; in the future, any linked list can use this template, requiring only adjustments to EntryNod.

The second gain is some classic macros, such as ## for string concatenation. For instance, this macro can obtain the this pointer of a class through class type, class member name, and class member address, referring to the container_of macro (https://blog.csdn.net/lezardfu/article/details/44916167).

#define offsetof(t, d) __builtin_offsetof(t, d) #define container_of(ptr, type, member)                                                                                
  ({                                                                                                                   
    const __typeof(((type *)0)->member) *__mptr = (ptr);                                                               
    (type *)((char *)__mptr - offsetof(type, member));                                                                 
  })

Additionally, Dobby has its own memory allocation module, which records each allocation of memory with the same properties. When memory needs to be allocated, it first checks if there is available memory already allocated, thus avoiding frequent memory allocations.

四

使用Dobby过程中遇到的问题

I encountered three issues in total. The first issue was that during instrumentation, the instruction was being executed, resulting in errors. There are two fixes: the first is to complete instrumentation as soon as the shared object (so) is loaded; the second is to interrupt the process via exceptions and complete instrumentation during the exception handling process using a custom signal handler.

I adopted the first method, completing instrumentation as soon as the so is loaded. The loading of the so in Android ultimately occurs through the linker’s do_dlopen function, which calls soinfo* si = find_library(ns, translated_name, flags, extinfo, caller); Here, you can obtain the soinfo pointer, and with the soinfo, you have everything.

So, you only need to hook this function. In fact, in AOSP 10, this function is inline, so I hooked the si->increment_ref_count(); function to get the soinfo pointer.

The second issue is related to mprotect, as the original instruction that needs to be patched is usually read-only. You need to use mprotect to change the property to writable. Mprotect operates on a page basis, and Dobby modifies the permissions of the page where the instruction to be instrumented is located. In most cases, this works without issues.

However, occasionally, the instrumented instruction is located at the bottom of a page, and the patch requires at least 8 bytes, which may span two pages. In such cases, Dobby only modifies one page, so caution is needed.

The third issue is sigill, which mainly occurs during instruction restoration when the incorrect assembly is generated, leading to jumps to the wrong locations. This requires targeted fixes based on the source code, which is also why I looked into Dobby’s source code.

五

总结

Currently, among reverse engineering tools, IDA is the king of static analysis, while Frida (presumably) is the king of dynamic analysis. However, Frida operates at the function level, which is not granular enough, necessitating the use of Dobby in combination for instruction-level dynamic analysis.

While debuggers can also achieve the goal, they often introduce many other issues. Initially, I used gdb, but encountered numerous problems, such as gdb pausing the process, leading to Android’s broadcast timeout, which killed my process, or accidentally touching the screen, causing a response timeout that also killed my process. Sometimes gdb fails to recognize THUMB instructions, requiring manual mode setting, resulting in a poor experience. However, gdb has a memory breakpoint, which may occasionally be necessary.

Dobby Source Code Reading: Instruction-Level Tool

KX ID: KerryS

https://bbs.pediy.com/user-home-844633.htm

* This article is an original work by KerryS from the KX forum. Please indicate the source when reprinting from the KX community.

Dobby Source Code Reading: Instruction-Level Tool

# Previous Recommendations

1. AFL Fast Pass – Process and Source Code Analysis of afl-fuzz.c

2. SQL Injection Study Notes

3. Various Methods of File Upload and File Inclusion

4. Reverse Engineering Practice of React Native Hermes

5. 2022 CISCN Preliminary Round ez_usb WriteUp

6. APT Turla Sample Analysis

Share

Like

Watching

BuildDynamicBinaryInstrumentRouting()

GenerateRelocatedCode(trampolinebuffer->getSize())

Related posts

Leave a Comment Cancel reply