Research Article Determining the Image Base of ARM Firmware by Matching Function Addresses

,


Introduction
From mobile phones, smart bands, smart watches to routers, switches, solid-state disks, wireless sensor, etc., embedded systems have spread all over society [1,2]. Recently, there have been many incidents related to security of firmware in embedded systems. For instance, NSA developed malware that infects hard disk firmware; Stuxnet targets SCADA systems and is believed to be responsible for causing substantial damage to Iran's nuclear program [3,4]; Heartbleed vulnerability exists in the firmware of embedded systems, and many vendors release new versions of firmware to mitigate Heartbleed vulnerability [5]; at Recon BRX 2018, two researchers from Northeastern University reversed the firmware of Xiaomi's IoT devices and found vulnerabilities in the Xiaomi ecosystem [6]. The security issues of firmware in embedded systems are getting more and more attention [7][8][9], including firmware emulation [10][11][12], firmware testing [13], and uncovering vulnerabilities [14]. In order to improve the security of the embedded system, it is necessary to perform reverse engineering on the firmware [15].
When disassembling firmware, the disassembly tools such as IDA Pro need to know the processor type and image base. Correct image base allows the disassembler to establish accurate cross-references, which are important for firmware analysts to understand the firmware. At the same time, correct image base helps to understand the memory layout of the firmware as a whole, and the wrong image base leads to the instruction that references the immediate values to addressing failed [16].
To determine the image base of firmware, many researchers have put a great deal of effort and several manual solutions have been proposed.
Skochinsky [17] proposed a general principle for determining the image base of a file with an unknown format.
They suggested that some kinds of hints, such as selfrelocating code, initialization code, and string tables, can be used.
Basnight et al. [16,18] presented two methods for inferring an image base. The first method uses immediate values in instructions to infer a reasonable image base. And the second method uses a hardware debugger to halt a programmable logic controller to obtain a memory dump. Then, the image base can be found by manually analyzing common instruction patterns in the memory dump.
Dacosta et al. [19] noted that, when the case values in a switch-case statement of a C program are sequential and dense, the memory addresses of case are usually stored in a jump table; this fact can be used to infer the memory address of nearby code and eventually obtain the image base.
All the above methods require the intuition and experience of reverse engineers, and the success and effectiveness highly depend on the human factor. To address this problem, Zhu et al. proposed several methods to automatically determine the image base of ARM-based firmware [20][21][22][23]. However, these methods cannot determine the image base of all types of firmware, and some of them are timeconsuming.

Contributions.
According to statistics, about 63% of embedded devices are based on ARM architecture [24]. Hence, we focus on the firmware under ARM architecture in this paper. By studying the binary function in firmware and the loading method of its address, we propose a method for determining the image base of the firmware that loads the function address using LDR instruction. The method is divided into three steps. The first step is to identify all the binary functions and output their offsets. The second step is to identify all the function addresses in the firmware that might be loaded by LDR instructions. The third step is to determine the image base by using the binary function offset and the function address loaded by LDR instruction. The experimental results indicate that the proposed method is effective for firmware which uses LDR instruction to load function address. The method proposed in this paper can improve the efficiency of reverse engineering.

1.2.
Roadmap. The rest of this paper is organized as follows. Section 2 introduces the binary function and the method of loading the function address and introduces the FIND-LDR algorithm to identify LDR instruction in the firmware and calculate its loaded address. Section 3 introduces the principle of determining the image base and gives the DBMFA (Determining image Base by Matching Function Addresses) algorithm for determining image base. Section 4 analyzes the experimental results with the real firmware as the test set. Finally, this paper is concluded in Section 5.

Binary Functions in Firmware and
Their Addresses 2.1. Binary Functions in Firmware. Disassembling the binary file, we can obtain some binary functions which roughly correspond to functions in a high-level language [25]. When compiling a function, the compiler usually adds some instructions at the beginning to create and initialize stack frame and save registers. These instructions are called prologue. Similarly, it also adds some instructions at the end of the function to clear stack frame and restore registers. These instructions are called epilogue. The binary function which is compiled from function in high-level language includes prologue, body, and epilogue [26][27][28].
We compile the C source code shown in Figure 1 into a binary file and then disassemble the binary file using IDA Pro; the results are shown in Figure 2. The binary function corresponding to the add function is divided into 3 parts, the prologue of which is the STMFD instruction, and the epilogue of the function is the LDMFD instruction.
Base on the above analysis, we write an IDA Pro script to obtain all binary functions in a firmware and output the offset of prologue of binary functions [26]. The script is shown in the appendix.

Binary Function Address Loaded by LDR Instruction.
In ARM-based firmware, the compiler typically uses the LDR instruction to load the function address into the register. The C code snippet that defines a function pointer is shown    Wireless Communications and Mobile Computing in Figure 3. And the corresponding disassembly code is shown in Figure 4, where the assignment operation uses LDR instruction to load the function pointer, as shown in the red box. The compiler and disassembler in this article are arm-linux-gcc 4.3.2 and IDA Pro 6.8, respectively.
Next is an example of LDR instruction loading function address in real firmware. The disassembly code of ABB NETA-21 firmware uImage is shown in Figure 5.
The loading process of function address is detailed as follows. Take the LDR instruction at the memory address Encoding format in ARM state Figure 6: Encoding format of LDR instruction.

Input: binaryFile
Output: The addressees loaded by LDR instruction in ARM state. Input: binaryFile Output: The addressees loaded by LDR instruction in Thumb state.  Wireless Communications and Mobile Computing 0xC0082EEC in Figure 5 as an example, the machine code for LDR instruction is D0 22 9F E5. Since this firmware is stored in the little-endian format, the actual machine code is E5 9F 22 D0. The format of LDR instruction used to load immediate values to a register in ARM state is LDR <Rd>, [PC,#immed_12], and the corresponding instruction encoding format is shown in Figure 6(b) [29]. Analysis of the encoding format of the LDR instruction and machine code yields Rd = ð0010Þ 2 = R2, immed 12 = ð0010 1101 0000Þ 2 = 0x2D0. The access address of the LDR instruction in the ARM state is ðPC&0xFFFFFFFCÞ + imme d 12. Because an ARM processor uses three-stage pipeline technology, the value of PC equals the address of the current instruction plus 8 in the ARM state (i.e., PC = Current + 8). Then, the access address of the LDR instruction is given by Thus, the address accessed by LDR instruction is 0xC00831C4, as shown in Figure 5. The four bytes at the beginning of the memory address 0xC00831C4 are 00 D0 05 C0. Since the firmware is stored in the little-endian format, the actual address is 0xC005D000, which is, in fact, the address loaded by the LDR instruction. As shown in Figure 5, 0xC005D000 is the entry address of the binary function sub_C005D000.
The syntax and loading process of the LDR instruction in the Thumb state is similar to which in the ARM state; Figure 6(a) shows the corresponding encoding format.
Based on the above analysis, we introduce the FIND-Thumb-LDR algorithm and FIND-ARM-LDR algorithm [23] to scan all LDR instructions in the firmware and output its loaded addresses.
Note that the addresses loaded by LDR instructions are not all function addresses; they may correspond to string addresses, structure addresses, etc. However, the nonfunction entry addresses have no effect on the final determination of image base.

Determination of Image Base
The script described in Section 2.1 can obtain a set of binary function offsets in firmware, and the FIND-LDR algorithm described in Section 2.2 can obtain a set of memory addresses loaded by the LDR instruction in firmware. The address of some binary functions in the firmware is loaded into the register, usually for the assignment of the function pointer variable. If the function addresses are loaded by LDR instruction, there is a correspondence between some elements in the offset set of the binary function and some elements in the memory address set loaded by the LDR instruction, and the corresponding relationship between two sets can be used to determine the image base.
The set of binary functions obtained by the script in Section 2.1 is recorded as F, and the corresponding function where n is the number of binary functions in the firmware. The memory addresses obtained by the FIND-LDR algorithm after removing the duplicate element are recorded as A = ða 1 , a 2 , ⋯, a m Þ, where m is the number of addresses in set A.
As shown in Figure 7, if a binary function with offset o i loaded into memory location a j , assuming that the image base is base, then Therefore, the image base of the firmware is base = a j − o i .
The address of some binary functions in set F will be loaded by the LDR instruction, and some of the addresses in set A are the binary function addresses, so some of the elements in set O and some of the elements in set A match formula (1). When base is a particular memory address, the number of elements that satisfy formula (1) in set O and set A is the maximum, this memory address is considered to be the image base of the firmware.
In a 32-bit system, the memory range is large (0~2 32 -1). We can obtain an image base by enumerating memory addresses in the range, but this method is less efficient. Therefore, we designed an algorithm to efficiently calculate the image base. The main idea of the algorithm is detailed as follows. First, we can determine the minimum value of image base is 0, and the maximum value of the image base

Wireless Communications and Mobile Computing
in a 32-bit system is 0xFFFFFFFFFF-fileSize, where fileSize is the size of the binary file. Second, subtract each element o i in set O by each element a j in set A. If the difference is in the range of image base, i.e., 0-0xFFFFFFFFFF-fileSize, it is saved; otherwise, it is discarded. Then, count the number of occurrences of each difference, sort in descending order by the number of occurrences, and output the results.
Assume base is a particular memory address, and set O and set A have k pairs of elements that conform to o i + base = a j . When subtracting each element in set O with the jth element in set A, a set is obtained.
So that each element a j in set A minus each element o i of set O obtains a matrix M.
We have assumed that there is k pair of elements to satisfy formula (1), but the specific index of k pair of elements is unknown. Suppose ða x , o y Þ is one of k pairs of elements. Element M ½x,y on the xth row yth column of matrix M stores the value of a x − o y . Then, we count the number of occurrences of each element in matrix M, sort in descending order according to the number of occur-rences of the elements, and output the result. Then, the most frequent occurrence of the element is the candidate image base. The practical significance of the candidate image base is that there is a correspondence between the most elements in set A and set O when the image base is this memory address. That is, the most binary function addresses in set F are loaded into registers. Based on the above analysis, we propose the Determining image Base by Matching Function Addresses (DBMFA) algorithm. For statistical purposes, the algorithm first initializes all elements of matrix M to -1.
The time complexity of the DBMFA algorithm is O ðn * mÞ where n is the number of binary function offsets and m is the number of addresses in set A.
The first memory address in the algorithm output (that is, the address with the most occurrences) is considered a candidate image base. If there exists one and only one candidate image base whose number of occurrences is much greater than those of the other candidate image bases, such candidate image base is considered to be the correct image base. Otherwise, the outputs do not contain the correct image base because the DBMFA algorithm cannot be applied successfully to the binary file.

Experimental Results and Analysis
Since there is no common test set that can be used in our experiments, we collected multiple firmware from some embedded devices, such as digital video cameras, smart watches, MP3 players, solid-state drives, and satellite phones, from the Internet, and created a test set to evaluate the validity of our algorithms. The DBMFA algorithms described above were written in python. The experiments were performed on a personal computer with an Intel i7-2600 3.40 GHz processor and 18 GB memory running Microsoft Windows 7 SP1.  Wireless Communications and Mobile Computing

Experimental Results.
In the experiment, we choose the test set as the experimental object and then performed the method described in Section 2.1 to obtain the binary function offset in the firmware and performed the FIND-LDR algorithm to identify the address loaded by the LDR instruction. Table 1 shows the experimental results. Note that the column "Function" lists the number of binary functions, the column "ARM_LDR" and column "Thumb_LDR" are the number of addresses loaded by LDR instruction identified by the FIND-LDR algorithm in ARM and Thumb state, respectively. The column "ALL_LDR" lists the numbers of addresses identified by the FIND-LDR algorithm after duplicate elements are removed. The column "Match" lists the most frequent occurrence of the element in matrix M, i.e., the number of image base appears in matrix M. The "Base" column is the image base determined by the DBMFA algorithm. The symbol N/A means that the proposed algorithm is not available for this firmware, and the reasons for this are discussed in Section 4.3. "Time" column is the execution time of the DBMFA algorithm. The results of the DBMFA algorithm are shown in Figure 8(a). As you can see from the figure, the peak point is X = 0xC0008000, Y = 226. That is, the memory address 0xC0008000 in matrix M appears 226 times, and its number of occurrences is much greater than other candidate image bases. Hence, 0xC0008000 is the image base. The practical significance is there are 226 pairs of data meeting o i + base = a j at memory address 0xC0008000.
To verify whether the experimental results are correct, we load uImage file using IDA Pro and set the processor type to "ARM little-endian" and the image base to 0xC0008000.
Then, IDA Pro can identify most binary functions, and some of the addresses loaded by the LDR instructions point to binary functions and display as function names. This means that the memory address 0xC0008000 is the correct image base while verifying that the 7549 binary functions in the firmware have 226 function addresses loaded into the register via the LDR instruction. Figure 8(d) shows the experimental results obtained for the firmware samples tintin_fw.bin from Pebble smart watch. We can see that there is no sharp point in curve, which indicates that the algorithm proposed in this paper does not apply to this file. Figure 8(c) is the experimental results of the SBH52_ firmware.bin from Sony SBH52, which shows that its image base is 0x8040001. As we all know, image base of ARM firmware is 4 bytes aligned, but why this image base is 0x8040001? The reason is that 1564 LDR addresses in Thumb state are identified in the SBH52_firmware.bin firmware, while only 4 LDR addresses in ARM state are identified. When executing the BX Rm instruction, if the least significant bit (LSB) of the target address is 1, the processor switches to Thumb state. Otherwise, it switches to ARM state. For example, when the LSB of register Rm is 1, the execution of instruction "BX Rm" makes the processor switch to the Thumb state. This instruction is equivalent to the assignment PC = Rm&0xFFFFFFFE. In fact, the actual entry address of the Thumb function in memory is Rm&0xFFFFFFFE, while the value of target address is Rm&0xFFFFFFFE+1. Therefore, all entry addresses of the Thumb function are odd.
Some of the 1564 LDR addresses mentioned above are Thumb function addresses, and where the function address is its true value plus 1. This results in one-byte difference between the results and the true image base. The correct image base for the firmware SBH52_firmware.bin should be 0x8040000. Table 1, we can see that for some firmware, the image base (1) Some firmware files are encrypted or compressed.

Reasons for Image Base Determination Failures. From
These files must be decrypted or decompressed before applying the proposed methodology (2) Due to different coding practices or compilation modes of a compiler, some firmware uses other instructions, such as ADR instruction, to load binary function addresses. In this case, the addresses identified by the FIND-LDR algorithm contain no binary function addresses. The algorithm proposed in this paper needs to utilize the binary function address loaded by LDR instruction, so it is not valid for such firmware, such as firmware tintin_fw.bin file of Pebble smart watch and firmware wingtip_in.bin file of Samsung gear fit

Conclusions
Disassembling for firmware is a necessary step in the analysis of embedded system security. Most of the firmware for unknown format cannot obtain the image base directly, which blocks the disassembly work. This paper studies the prologue of binary function and load method of binary function in ARM firmware and proposes a method for determining the image base by the binary function offset in firmware and the function address loaded by the LDR instruction.  indicate that the method proposed in this paper is effective for the firmware of loading function addresses using LDR instruction. For other types of firmware, try using other methods to determine the image base, or manual method. For the future work, it is interesting to explore the encoding of ARM instruction and propose the image base determination method for other types of ARM firmware. We will focus on automatically determining the image base of firmware of other architectures, such as MIPS and PowerPC.