Monday, April 30, 2012

IOCCC: C code obfuscation at its best!


Finally the International Obfuscated C Code Contest is back again and this month, the 14 winning entries were announced! You can check their website and give a look to the code: www.ioccc.org.

In particular, there was one winning entry that made me really curious: the one liner by Taketo Konno. Reading the code I noticed some interesting tricks and I began to think two things: that this was a very cool entry and that it deserves to be analyzed as well :)

First of all, you have to download the C source code and the respective make file, put them in the same directory and type "make konno" to build the executable. Then, type the following: ./konno qwerty

And you will get something like this:

o o o o o o . . . .
 . . . . . . . . .
  . . . . . . .

What is it? What does this remind you of?
Try with different inputs if you have no ideas, but remember: all the inputs must be in same form "./konno just_a_single_argument_of_lower_case_letters".

Ok, here we are, after some attempt you should have realized what the output is… that is… the visual representation of the keystrokes!

Well, with this little hint we are ready to begin our analysis. The first step consists of re-shaping the code, in order to make it more clear and easy to understand:

main(_,l)
char **l;
{
  6 * putchar
  (
    (--_ % 20) ?
      (_ + _ / 21 & 56 > _) ?
        strchr(1[l], _ ^ "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2])?  
        111 : 46
      :32
    :10
  )
  ^ _ && main(2 + _, l);
}


Let's start by observing the structure of the code: it is a single line of code, where the function "main" performs a recursion

6*putchar(…) ^ _ && main(2 + _, l);

The function takes the parameters "_" and "l" respectively as the standard "argc" and "argv" and the recursion continues until the expression "6*putchar(…) ^ _ && main()" holds true.

Pay attention in identifying "_" with "argc", as this allow us to assume that its initialization value is 2.

Now, if you give a deeper look at the source-code, you will recognize the following construct:


(exp) ? true : false;

 
appearing three time (nested). So, this whole code is one big loop, and the nested operators, that are all inside the "putchar", must control the output of chars in every iteration. 


Let's start from the first ternary operator:

(--_ % 20) ? ... :10

this one, when false, will return the value 10 (0x0A in hex) which is the ascii value for "\n", (it will cause the cursor to go return on the following line). How does it work exactly?

The variable "_" is actually used as a counter for the main big loop, and it is incremented at every iteration because of the recursion ( main(2 + _, l) ) and because the "--" in the above expression. Every time the iterator is a multiple of 20, the condition of the ternary operator will evaulate to false, causing the code to output a "\n" char.

This handles the division of the output in lines, but what about the rest? Well, we can now look at what happens when the above ternary operator evaluates to true (that is, in every iteration where the counter is not a multiple of 20):
   
(_ + _ / 21 & 56 > _) ? ... : 32
    
mmm, not easy to follow all these arithmetic passages. Let's rewrite this line including parenthesis to highligh the operators precedence:
  
( (_ + (_ / 21)) & (56 > _) ) ? ... : 32

ha! Much better! When this evaluates to false, the space character is printed on screen (32 decimal = 0x20 hex, the ascii code for space).

We know from the output that the code will alternate spaces to other characters ("." or "o"), so the above condition must evaluate to false at alternate values of the counter.
First of all, 56 is the upper limit for the characters to be printed. The expression "56 > _" is always true for all iterations (can you guess what happens when the counter reaches value 56? :P).

The other part of the formula, instead, ("(_ + (_ / 21))") is more interesting: the code adds to the counter the value of the division between the counter and the number 21.

Let's explain it better: counter / 21 will result in 0 for the first 20 iterations, in 1 for the second 21, and in 2 for the last ones. The result of this sum is then "and-ed" with the value 1 (that is, the "true" resulting from "56 > _ "). 

So, in the first line, the counter is left unchanged, and all the even values of the counter and-ed with 1 will return false. Why? Because the number 1 in binary is... well... 1 :).

Every multiple of two, instead (that is, every even value of the counter), once translated in binary will correspond to a number that has the least significant bit set to 0. This means that in every iteration where the counter is even the "and" operation will be something like this:
   
bbbbb0  and
000001

where "b" is a bit that can be either 0 or 1. Of course the result of this and will always be false. On the other hand, it will always be true for odd values of the counter (because all odd values have the least significant bit set to 1).

BUT! This is only valid for the first line of the output! In the second line the expression "_ / 21" results in 1, which is then added to the counter.
So, the whole thing now works the opposite way: when the counter is even, it is incremented by one and it becomes odd, causing the whole ternary operator to evaluate to true. When it's odd instead, the increment makes it even, causing the ternary operator to evaluate to false, and therefore a space is printed on the screen.

Finally, the last line of the output begins with TWO spaces!! How is that possible? Well, the reason is in the chosen number "21"!
The output lines infact are aligned on multiples of 20, while the check on alternate spaces is aligned on multiples of 21. This means that the check of alternate spaces is off by one in respect of the alignment of lines.

In the third line you start with iteration 41 (well, actually iteration 42 which gets decremented to 41). This value of the counter makes it so that for the line alignment it represents the first element of the third line (the third set of 20), while for the space alignment it is the last element of the second line (the second set of 21).
    
Now the third and most nested operator is left:
   
strchr(1[l], _ ^ "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2])? 111 : 46
   
this is quite simple, but also quite interesting in the notation being used. 

First we see it returns the two values 111 or 46 (ascii values for "o" and "."), and correspond respectively to a char that is present and to a char that is not present in the input parameter string.
How is the check performed?? Simply with a strchr() function.

This function searches for a char in a string, and returns a pointer to that char if it is found, or NULL otherwise (which logically would evaluate respectively to true or false). The first parameter is weird! 
We have:
                        1[l]
   
no, it's not a typo! 1 is the array, and the char **l is the index. How is that possible?? Well, the semantics of the C language allows this weird notation; normally, when using an array, the code would resolve it like this:
   
type array[n] --> array + n*sizeof(type)

in order to calculate the pointer to reach the n-th element in the array. In the case of this IOCCC code, instead, the compiler does this:
   
char *test = 1[l];
----------------------------------------------
    mov         eax,dword ptr [ebp+0Ch]  ; eax = array address
    mov         ecx,dword ptr [eax+4]      ; ecx = element at address + 4
    mov         dword ptr [ebp-4],ecx       ; element is stored in test



remember that "l" is a char**, meaning that it is an array of pointers, and each element is of size 4. We access element 1, so 1*4 = 4. Therefore even when the index and the array are swapped, we still have
   
type n[array] -> array + n*sizeof(type)
   
that is, the pointer is still calculated and resolved correctly.
So this whole 1[l] thing is simply accessing argv[1], that is, the first parameter passed in the command line.

Going back to the strchr function, we see another similar trick in the second parameter:
   
"pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2]

this time the trick is more intuitive, the string is treated as an array of chars, and C allows you to use the [] notation to access the elements. This is completely equivalent to the most "normal" form:
   
char Array[] = "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z";
...
char test = Array[_/2];
   
Ok, we know that the first parameter of the strchr is the string passed in the command line, but what about this second one? We see the counter makes an appearance. It is divided by two because we have seen it prints alternatively characters and spaces. Also, we see the character extracted from the string array is xored with the counter itself! This looks like some sort of partial encryption :P

Anyway, at every odd iteration a character is extracted from the above string, and after the xor with the counter it results in the n-th character of the keyboard (qwertyuio... etc). This character is then searched in the input parameter string, and if it is found the character "o" is returned by the ternary operator (and printed on the screen), or else the character "." is returned.

Riddle: because of the way the loop is calculated, some characters from the above string are skipped and never used, can you figure out which ones?
   
The last, final, question is: which is the condition that stops the loop?
The loop is performed through the recursive formula we saw in the beginning:
    
6*putchar(…) ^ _ && main(2 + _, l);
   
This is a logical "and": the first part stands for the termination condition, while the second one is the core of the recursion. Both the sides of the formula must evaluate to true in order for the recursion to work; the second part, though, only contains the "main" which doesn't return any value.
   
So this leaves us with only the first part that controls the iteration: the code will evaluate the first part of the formula, and if it is true it will try and evalute the second part (the "main"), but this part is a function, therefore the code will have to call the function in order to determine whether this function will return true or false. But this function is recursive! Therefore it doesn't matter what the "main" evaluates to, the recursion will cause the code to always follow it in any case.
   
The only way to terminate this loop is when the first part of the formula evaluates to false. This way, the code knows a priori that the second part of the formula can't make a difference, so it will not evaluate it, causing the recursion not to happen.
   
The first part of the formula is simply:
   
6*putchar(…) ^ _
   
when is this false? Well, remember that every line of output is of 20 chars? Every time the counter is a multiple of 20 a "\n" is printed (ascii code is 10 decimal).
   
In the last iteration, after the third line, the counter reaches the value 60. This is a multiple of 20, therefore the "\n" is printed by the expression within the "putchar" argument. Also, the printed charater is returned by "putchar" (that is, it returns 10). We see the return value (10) is multiplied by 6, then xored with the counter itself. Let's do the simple math:
   
(6*10) ^ 60
   
the result of this is 0, which logically evaluates to false, so we can see that only when the execution counter reaches the value of 60 the recursion will be interrupted, causing the code to get out of the nested recursions it has done so far, and ultimately terminate the program execution.
   
And that's all for this little one liner: very small code, but dense contents! It is very interesting to analyze tricky codes like this because you find a lot of quirks that you never see in "normal" C source codes, and you also learn a lot about the internal working of the C itself.
  
My congratulations to the author Taketo Konno for designing this very clever piece of code, which well deserves the "best one liner award" :)


No comments:

Post a Comment