Initial testing
As a first measure to analyze the binary, we set aside an unused lab
machine and installed VMWare on it. In the virtual machine we then installed
tripwire to be able to monitor any changes of system or log files. After
the virtual machine was all configured and "the-binary" had been transferred,
the hard disk mode was changed from persistent to non-persistent, which
would enable us to always start up the system with the initial configuration.
We then disconnected the lab machine from our network and started the binary.
After typing "the-binary" at the command prompt as user root, the prompt
returned immediately. A "ps -aux" revealed a new process,
"[mingetty]" that was running on the system. A "ps -auxc" actually showed
that process running as "the-binary". A "netstat -a" showed a new open raw
IP socket listening, using the Network Voice Protocol (NVP), a transport
layer IP protocol. An analysis of the tripwire logs showed that no system
files or logs had been modified.
We repeated the startup of the binary, this time also running tcpdump
on the host machine. After no initial network traffic could be observed
after the execution of the binary, we let the virtual machine run for 24
hours and recorded any network traffic. As there wasn't any, we decided to
move on to static code analysis of the binary.
Static Analysis
Jim did the first testing of the binary:
We used the Linux 'file' command to determine characteristics for the-binary.
Here is the output of this command:
$ file the-binary
the-binary: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV),
statically linked, stripped
Given that the-binary was in ELF format, we next set about to determine
what systems calls were being made. We first used 'objdump' to generate
an assembly code listing, then wrote a Perl script to find the type and location
of the system calls. The script operated by maintaining the current state
of the eax register and looking for 'int $0x80' (Linux system call trap)
instructions. The value present in the eax register at the time of the instruction
is the index into the system call table as defined in /usr/include/asm/unistd.h.
The script showed the following system calls were being made:
80480b4: 0x88 personality
8048105: 0x1 exit
8056a11: 0x72 wait4
8056a54: 0x66 socketcall
8056a9c: 0x66 socketcall
8056ae4: 0x66 socketcall
8056b26: 0x66 socketcall
8056b72: 0x66 socketcall
8056bcc: 0x66 socketcall
8056c1e: 0x66 socketcall
8056c78: 0x66 socketcall
8056cd1: 0x66 socketcall
8056d1c: 0x66 socketcall
8057140: 0xc chdir
805716c: 0x6 close
805716c: 0x6 close
805719b: 0x3f dup2
80571ca: 0xb execve
80571f0: 0x2 fork
8057214: 0x31 geteuid
8057238: 0x14 getpid
8057263: 0x4e gettimeofday
8057292: 0x36 ioctl
80572bf: 0x25 kill
80572ee: 0x5 open
805731e: 0x3 read
8057344: 0x42 setsid
8057372: 0x7e sigprocmask
805739c: 0x7a uname
80573c8: 0xa unlink
80573fa: 0x4 write
8057424: 0x1b alarm
8057450: 0xd time
8057482: 0x92 writev
80574ac: 0x52 select
80574f7: 0x43 sigaction
8057530: 0x48 sigsuspend
8057560: 0x1 exit
8065d23: 0x5a mmap
8065d65: 0x6a stat
8065da1: 0x6c fstat
8066106: 0x37 fcntl
8066136: 0x13 lseek
8066163: 0x5b munmap
8066192: 0x91 readv
80661c6: 0xa3 mremap
8066206: 0x2d brk
8066244: 0x2d brk
With the information that the binary was statically linked and the location
of the system calls, we next began to look for the libraries used in creating
the-binary.
Florian then continued the reverse engineering process:
Rajeev had been looking for a free decompiler we could use and suggested
that a free decompiler, the Reverse Engineering Compiler (REC) was available
for Linux. I downloaded the decompiler,
and executed it on "the-binary" with the default settings. The result from
the decompile was the file "the-binary.rec".
A look at the decompilation quickly showed that this was only a small improvement
from the assembly code. As the binary had been stripped of all its symbols,
all variable and function names still looked very assembly-like. However,
it was now much easier to follow the control flow of the code. A brief examination
of the code revealed, that functions as well as global variables were named
after their absolute address in the assembly code, prepended with a "L0"
(functions) or "*L0" (variables). Examples:
L08048088() a function, such as main()
*L0806D228 a global variable, such as environ
0x0606D228 address of a variable (&environ)
Local variables are allocated from the stack and therefore are denoted
as an offset from the base pointer (ebp), so the assignment
*(ebp + -17616) = ebp + -2048;
makes ebp + -17616 a pointer that holds the address of some variable that
starts at ebp + -2048.
Function parameter names start with A8, and their numeric value increases
by 4 in hexadecimal notation (Ac, A10, A14, A18, A1c, ...). Furthermore,
there are also local variable names such as Vffffffbc, which also seems to
be some sort of offset from the stack pointer. The size of variables can
only be determined by looking at context and neighboring values. For example,
there are variables
ebp + -2048
ebp + -4096
ebp + -4536
without any values in between those. Thus ebp + -4096 is 2048 bytes and
ebp + -4536 is 440 bytes. However, even though we have
*(ebp + -17616) = ebp + -2048;
*(ebp + -17620) = ebp + -2028;
*(ebp + -17624) = ebp + -2026;
I concluded later on that ebp + -2048 is a buffer of 2048 bytes and that
ebp + -17620 and ebp + -17624 are merely pointers into that buffer.
As Jim had prepared a list of system call addresses from the assembly file,
I decided to start putting those into the code. Basically, wherever there
is a function that contained the line
asm("int 0x80");
it was likely to be a system call. For example, the function:
L08056A2C(A8, Ac, A10)
/* unknown */ void A8;
/* unknown */ void Ac;
/* unknown */ void A10;
{
/* unknown */ void ebx;
/* unknown */ void Vfffffff4;
/* unknown */ void Vfffffff8;
/* unknown */ void Vfffffffc;
Vfffffff4 = A8;
Vfffffff8 = Ac;
Vfffffffc = A10;
ecx = & Vfffffff4;
eax = 102;
ebx = 5;
asm("int 0x80");
edx = eax;
if(edx < 0) {
*L08078B14 = ~edx;
edx = -1;
}
return(edx);
}
calls system call 102 (or 0x66) and 5 is being passed as a parameter. System
call 0x66 is the "socketcall" system call, and a "man 2 socketcall" revealed
that the first argument is the call number. A look at <linux/net.h>
(I actually did a "find . | xargs grep -d skip LISTEN" in the "/usr/include"
directory to find the correct header file) revealed that SYS_ACCEPT was
equal to 5. Thus I could conclude that the above function was the "accept"
system call. I then replaced all invocations of "L08056A2C" with accept.
I proceeded like that with all the system calls from Jim's list. If applicable,
I also replaced integer numbers with the constant names:
(save)11;
(save)3;
(save)2;
L08056CF4();
became
socket(AF_INET, SOCK_RAW, NVP);
where NVP is not really a constant, but I put it in for better readability.
A complete mapping for the system call functions can be found in the file
system_functions.txt.
After putting every system call in, the code looked a little better (see
file decompile_with_syscalls.c), but it was still too early to really learn
anything from it.
Since the binary had been statically linked, large parts of the standard
C library were likely to be included in it. Jim and I agreed that the next
step would be to identify the functions from the standard library. Once identified,
they could be properly named and their code be removed from the code file.
As I had noticed earlier, some of the functions never got called, so I wrote
a simple perl script that identified the number of occurrences of functions
in the code (proc_check). Another script would remove functions that I specified
in a file from the code (pruneit). As the removal of a function could cause
other functions not being called anymore, this process needed to be iterated
until no more "dead" functions could be pruned out. This technique actually
reduced the number of lines from 37,228 to 22,894.
There are several ways to identify functions from the standard library.
Given the library's source code, one can try to identify functions based on
other functions they call, strings they contain, or constants they contain.
Another method is to look at the context where a given function is called
and then make an educated guess as to what function it might be and then compare
the function's code with the library's source code. The easiest way is definitely
using plaintext strings that are contained in the functions. However, I had
started using the glibc-2.2.5 library source code as the base for my comparison.
Many of the strings I found in the decompiled code could nowhere to be found.
My first suspicion was that some other library other than the standard C
library was compiled in as well. I also had a very hard time matching up
the code as the glibc source code contains plenty of macros and #ifdefs.
Fortunately, Jim pointed out that the binary was compiled using the libc-5.3.12
library. I downloaded the source code (from ftp.linux.org.uk/pub/linux/libc/).
Suddenly, my work got much much easier. Once a function had been identified,
I first checked if it was calling other, unidentified functions, noted what
they were and then replaced the "L080..." name with the proper one for all
newly discovered functions. Example:
L0804F620 is the fopen function. I found this out doing as search of the
string "/etc/resolv.conf" in the library source code. The string itself
appeared in function L0804D744 as
eax = L0804F620("/etc/resolv.conf", "r");
a 'find . | xargs grep -d skip "/etc/resolv.conf"' in the root directory
of the library source code didn't give any exact matches, but variable _PATH_RESCONF
was defined as the string. The same kind of search for that variable name
then revealed the next line, the only one that matches the above:
./inet/res_init.c: if ((fp = fopen(_PATH_RESCONF, "r")) != NULL) {
Thus L0804D744 had to be res_init and L0804F620 fopen. To show the degree
of code similarity, here is the code for L0804F620 and for fopen to compare.
The res_init function looks equally similar to its L0804D744 counterpart
and from there more functions, such as fgets and strncpy can be derived.
From the fopen function we can then derive the malloc (L0805BD74) and free
(L0805C290) calls, and so forth.
L0804F620(A8, Ac) /* unknown */ void A8; /* unknown */ void Ac;
{ /* unknown */ void ebx;
ebx = L0805BD74(84);
if(ebx == 0) { eax = 0; } else { (save)0; (save)ebx; L08061F34(); *(ebx + 80) = 0x807902c; (save)ebx; L08060D24(); (save)Ac; (save)A8; (save)ebx; esp = esp + 24; if(L08060E20() == 0) {
L08061788(); L0805C290(ebx, ebx); eax = 0; } else { eax = ebx; } } }
|
_IO_FILE *
DEFUN(_IO_fopen, (filename, mode), const char *filename AND const char *mode)
{
struct _IO_FILE_plus *fp = (struct _IO_FILE_plus*)malloc(sizeof(struct _IO_FILE_plus)); if (fp == NULL) return NULL;
_IO_init(&fp->file, 0); _IO_JUMPS(&fp->file) = &_IO_file_jumps;
_IO_file_init(&fp->file); #if !_IO_UNIFIED_JUMPTABLES fp->vtable = NULL; #endif
if (_IO_file_fopen(&fp->file, filename, mode) != NULL) return (_IO_FILE*)fp; _IO_un_link(&fp->file); free (fp); return NULL; }
weak_alias (_IO_fopen, fopen);
|
Comparison of the L0804F620 function
from the decompile with the fopen function from libc-5.3.12
|
Comparing the two code snippets, you might notice that there is a discrepancy
with the parameters of the functions that are called. We have:
L08061788();
L0805C290(ebx, ebx);
but
_IO_un_link(&fp->file);
free (fp);
This is one of a few decompiler glitches. The assembly code for this looks
like this:
804f664: 53 push %ebx
804f665: e8 1e 21 01 00 call 0x8061788
804f66a: 53 push %ebx
804f66b: e8 20 cc 00 00 call 0x805c290
but for some reason, the decompiler associates the first ebx with the second
function call. Once aware of this, I could quickly identify those glitches
and rectify them. Sometimes, parameters were missing as well, but a look
at the assembly code always cleared up the confusion.
For some reason, the decompiler also can't handle the modulus function
if one of the operands is a function result. This results in code like:
rand();
ecx = 10;
asm("cdq");
edi = ecx / ecx % ecx / ecx;
The assembly code looks like this:
8048440: e8 13 dc 00 00 call 0x8056058
8048445: b9 0a 00 00 00 mov $0xa,%ecx
804844a: 99 cltd
804844b: f7 f9 idiv %ecx,%eax
804844d: 89 d7 mov %edx,%edi
so the code should read:
edi = rand() % 10;
The identification of the standard C library calls and the removal of their
code was a long and tedious task. After I had identified all that I could
(basically, there were no more distinguishing strings, constants, function
calls or context left), I pruned the code of "dead" functions once again,
and the resulting file (decompile_with_syscalls.c) was down to 4217 lines
of code.
My next task was to interpret the C code that was left to a more readable
format. Hence I went through the code, starting at the entry point and re-wrote
most of it. For most parts, the "original" code was left as a comment below
the re-written one.
Ben did an analysis of what was going on at startup and he concluded that
this was probably standard system initialization and that function L08048134
was "main", so I started my analysis there. The biggest challenge was understanding
how the variables are used and giving them proper names. Here is a mapping
of the most important variables:
char buffer[2048] : *(ebp + -17616) = ebp + -2048;
char buffer2[2048]: *(ebp + -17632) = ebp + -4096;
char buffer3[440] : *(ebp + -17636) = ebp + -4536;
unsigned char r: *(ebp + -17648);
int offset: *(ebp + -17644);
FILE fstream: *(ebp + -17628);
char *buffer4: *(ebp + -17640); // turned out to be a pointer
char buffer5[504] ebp + -17596;
struct sockaddr_in cli_sock: ebp + -4568;
char buffer6[19]: ebp + -17340;
The final version of the interpretation can be found in the file decompile_final.c.
It is a result of reading C code, reading up on network programming literature,
and plenty of assistance from Ben and Jim. This is not working C code, but
a person familiar with C and UNIX network programming shouldn't have any
trouble following it. While interpreting, I found a few other functions from
the standard C library (such as inet_addr) and removed their code. I wasn't
able to identify two functions that actually get called in the code.
I named them precise_sleep and signal_action, but their purpose should be
clear.
I did not interpret the function I named more_udp_stuff (an initial name
I gave it that I never changed), as its functionality is the same as dos_dns_udp
with the an additional option of specifying a destination address.
The function dos_dns_udp is a function that sends DNS requests with a spoofed
IP address to a destination address and can therefore be used as a reflector
DoS client. During its analysis, I discovered that the function reads data
from the read-only data section of the binary starting at address 0x8067698.
These turned out to be buffer lengths followed by DNS query packets. I wrote
a perl script to extract the data (dns_extract), and the data is commented
and in a pseudo C code for each packet in file dns_data.c.
Furthermore, for the destination of the DNS packets, a list of IP addresses
is used that resides in the .data portion of the binary starting at address
0x806d22c (in the .asm file). It seems a random address is picked from the
first 8000 entries of that list. The list itself, however, is larger than
that. Again, I wrote a perl script that extracted those addresses (ip_extract).
The first 8000 are contained in the file ip_addresses.txt.
This concludes the analysis portion. The answers to the questions were
derived from looking at the C code that we reverse-engineered.
Tool used for reverse engineering
gdb
Reverse Engineering Compiler (REC)
less
find
grep
man
Other resources
"Unix Network Programming Vol. 1", W. Richard Stevens, Prentice Hall, 1998
"TCP/IP Illustrated Vol. 1", W. Richard Stevens, Addison Wesley, 1994
"Advanced Programming in the UNIX Environment", W. Richard Stevens,
Addison Wesley, 1993
Intel i386 instruction manual
libc-5.3.12 source code