Malware analysis methods

Overview

We here describe the methodology we use to analyze untrusted binaries such as the one found on a Linux system, which will be taken as an practical example.

 

Binary analysis basically has two phases that can be run independently or not, a parallel approach being the best one:

-         a static analysis, where the binary is disassembled and analyzed “off-line”, without executing it. This phase, which can be considered as “brute-forcing” its way in the binary, can be very time consuming but usually has the best results. It is very efficient when working on small binaries, or small points of big binaries. It is also very useful to remove trapdoors in binaries before executing them.

-         A dynamic analysis, where the binary is analyzed on live lab systems. The binary is dynamically traced with debuggers and stressed by network traffic generators. Network sniffers  are used on other machines to record network traffic for later analysis.

 

Detailed methodology

·        Static analysis

The static analysis is performed by an objdump-enhanced tool (which we did not get permission from the company to publish here). The resulting assembly dump is then manually analyzed.

 

Preliminary tests using basic Unix utilities such as file and others showed us that the binary is statically linked and stripped, which makes the job a bit harder.

 

After getting the assembly dump, we first need to roughly simplify the code. As noted above, most of the functions are library routines. We thus need to strip them. Since we don’t have symbols, our approach is to determine which libraries were linked with the binary and then put, using source code, a name on each library function call.

 

We determine that the binary is linked with Linux Libc 5.3.12, as shown below (only the first output line is important, the rest being libdl diagnostic messages):

[root@redhat nico]# strings the-binary |grep library

@(#) The Linux C library 5.3.12

Cannot exec a shared library directly

Accessing a corrupted shared library

Can not access a needed shared library

 

We then grab the source code for this Linux 5.3.12 libc, and start the name assignment job as follows:

-         for each diagnostic message in the binary (found using strings output), we find in the libc code the correct function. We then rename accordingly the calling function in the assembly dump.

-         Knowing the C code of one function and its real name, we compare the assembly and C code of the function, and rename each function called accordingly

-         We iterate the following as many times as necessary. Each totally cleaned function (meaning, where no reference to unknown ones are made) is removed and cross-references to it are also renamed accordingly.

 

We also quickly determine system calls wrappers functions by looking at int $0x80 instructions in the assembly dump and the Linux syscall numbers list found in the usual C include files (/usr/include/asm/unistd.h).

 

Type of stack or global variables are then determined according to libc calls referencing the variables. Since libc functions prototypes are known, we are able to type most variables.

 

We finally look for constructs like switch(), if(), for(), while() statements, since they usually always produce the same assembly constructs. For example, finding a jump table is a good hint for a switch() statement.

 

This method allowed us to reverse most of the binary, and find out that most of the hard-to-understand parts of it are actually IP packets assembly routines. We managed to find the encryption/decryption functions used to obfuscate network traffic and produced a decoder. We also determined that there is no “trapdoor” in the code to destroy itself or the host machine if it is not executed as expected. We finally determined that there are no anti-debugging tricks so it is safe and easy to do a dynamic study of this binary.

·        Dynamic analysis

We set up a test network composed of two machines connected by a crossover cable. One is the testbox, running a standard RedHat Linux 6.2 installation and hosting the backdoor, the other is a BSD machine running a network sniffer dumping everything to disk and console. This last machine is also used as a packet generator.

 

GDB is used to instrument the backdoor code. We can for example see how the backdoor reacts without even generating the proper packets by modifying the EIP register and pointing it where we want.

 

The dynamic analysis was not really used, except to understand obscure assembly parts.