VitoPlantamura's Website

Click here to go to the archives page...

"extremely impressive work"
- Mark Russinovich, www.sysinternals.com
(referring to my BugChecker)

Homepage of Vito Plantamura (@, LinkedIn), Windows Researcher and Developer.

[user=Guest] - updated: August 08, 2007

MINI C COMPILER

MINI C COMPILER

INTRODUCTION

At the age of 16/17 I needed to learn the C language in order to cut down my overall development times: I knew perfectly the 32-bit integer, FPU and system assembler for the Intel x86 processors and I was able to program fluently entire games and applications exclusively in assembly language. However, when the need of writing bigger games and programs began to grow, by using a low-level language I was unable to be as productive in development as I needed to be: higher level tools and compilers were required. So, I learned myself the C language from an old book of mine in a couple of days. The C language was the perfect choice because it allowed more productivity, code readability and robustness while being the most near to the always beloved raw machine language.

One of the very first projects I wanted to develop in my just learned language was, not surprisingly, a compiler for the C language itself: I found the idea of writing a compiler for a specific language in that same language extremely intriguing. However, one of the main reasons that pushed me to pursue this project (besides its inherent complexity) was that it could allow me (due to the greater amount of research and specific investigations involved) to improve significantly my knowledge of the new language: I knew perfectly the assembly language and, in order to write good programs, I needed to know perfectly also my new high-productivity language as well. The fact that I had no documentation or books on how to write compilers and linkers or even an internet connection for trying to gain related knowledge on the web, obviously, as always, didn?t stop me from pursuing my project. So, I took a pen and some paper and started to think about how a compiler could work, from its early stage of source code preprocessing to the final architecture-specific object code emission.

After a couple of days of thinking, I started to write the actual code: the entire project took 3 months of works and the result was a single, huge .C source file that actually comprises the "MiniC" language preprocessor and the compiler itself, and that is able to generate proprietary-format .OBJ files from .MC source files. The compiler itself does not implement all the features of standard (ANSI/Microsoft) C language specifications, but however it is just perfect for scripting, for example. Specifically I have used it successfully for this purpose inside my kernel mode debugger (bugchecker) and in my three dimensional CSG editor (MAPGEN). The source file that you can download from this page is the most recent version and is the one that is linked against the BugChecker object files. One of the main advantages of using MiniC is its extremely fast compilation times: for example, in BugChecker, EVERY command typed in the kernel debugger console is instantly compiled and linked against the pre-compiled object modules that are stored in the NonPaged pool memory of the program (the precompiled modules are actually the ones that provide the basic API of BugChecker and the object files representing the macro files defined by the user). Furthermore in BugChecker it is possible to define macros and commands in the MiniC language that consume directly or call functions that utilize FPU instructions and operations: this is due to an unique feature of BugChecker that saves the floating point context of the currently running process prior to entering in the debugger environment (only in the case that the application has carried out non-integer operations previously and so actually does have an FPU state when the debugger is popped up).

You can see that this is one of my first works in C language by the fact that, for example, the entire implementation is grouped into a single source file. If you scroll quickly the source code you can see various patches and corrections (marked with my name and the corresponding date) that span the time of 7 years...

TWO WORDS ON HOW IT WORKS

The preprocessor and compilation stages are pretty standard in MiniC. However the intermediary byte code used internally by the compiler (the so-named PSI code) is (for performance reasons) less abstract and more architecture-specific than the one found in traditional compilers (more on this later). The compilation stages are as follows:

	The source file is taken and each expression, prototype and declaration in it are separated from the rest of the code.

	Each C expression is parsed by the "ParseExpressionTokens" function: here a string representation of the expression is converted in a series of structures that just reflect the occurrence of identifiers and operators inside the expression itself.

	In the "MakeExprOrderedOpList" the language expression (in its non-string, structured format, as the output of the previous function) is ordered so the operators and symbols with an higher priority are relocated first in this series of structures.

	The "ResolveNonConstantOperation" and "ResolveNonConstantExpression" functions take the output of the previous functions and generate the actual target code expressed in a series of PSI instructions. An example of several PSI instructions is shown later in this article. PSI stands for "pseudo-instruction" and it is actually a really low-level byte code used for optimization and code generation later in the compilation process ("LOAD_SIGNED_CHAR_IN_EAX" is an example of a PSI instruction).

	Various functions such as "ParseVariableDeclaration", "ParseStructDeclaration" and "ParseFunctionDeclaration" take care of parsing and transforming in actual target code specific elements of the original source file.

	The "ReducePSICode" function parses the PSI output just generated by the previous compilation stages and searches for groups of instructions that can be simplified. For example, if two compatible "STORE_EAX_IN_INT" and "LOAD_INT_IN_EAX" instructions are found, they are simply removed from the final code emission.

	The "WriteOutputFile" function translates the PSI instructions in actual x86 assembler code and packages the final object module.

PSI INSTRUCTIONS

As explained before, the PSI instructions are used internally as an intermediary low-level proprietary byte code throughout the entire compilation process. For simplifying and accelerating the whole compilation work, the PSI instructions are less abstract than traditional byte-codes that you may find in other compilers.

For example, they are just architecture registers-aware (you may find PSI instructions that target specific x86 and FPU registers). They have the following format in the compiler source file:

/*------------------------------------------------------------------------*/
#define LOAD_SIGNED_CHAR_IN_EAX                                                       0x0000
      // movsx      eax,byte ptr [signed char]
byte psi_0000[]={0x0F,0xBE,0x05,0x00,0x00,0x00,0x00}; // address
psi_t psi0000={7,psi_0000,0,{3,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_SIGNED_CHAR_IN_ECX                                                       0x0001
      // movsx      ecx,byte ptr [signed char]
byte psi_0001[]={0x0F,0xBE,0x0D,0x00,0x00,0x00,0x00}; // address
psi_t psi0001={7,psi_0001,0,{3,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_UNSIGNED_CHAR_IN_EAX                                                     0x0002
      // mov            eax,dword ptr [unsigned char]
      // and            eax,0FFh
byte psi_0002[]={0xA1,0x00,0x00,0x00,0x00, // address
      0x25,0xFF,0x00,0x00,0x00};
psi_t psi0002={10,psi_0002,0,{1,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define LOAD_INT_IN_EAX                                                               0x0008
      // mov            eax,dword ptr [int]
byte psi_0008[]={0xA1,0x00,0x00,0x00,0x00}; // address
psi_t psi0008={5,psi_0008,0,{1,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_INT_IN_ECX                                                               0x0009
      // mov            ecx,dword ptr [int]
byte psi_0009[]={0x8B,0x0D,0x00,0x00,0x00,0x00}; // address
psi_t psi0009={6,psi_0009,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_FLOAT_IN_ST0                                                             0x000a
      // fld            dword ptr [float]
byte psi_000a[]={0xD9,0x05,0x00,0x00,0x00,0x00}; // address
psi_t psi000a={6,psi_000a,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_DOUBLE_IN_ST0                                                            0x000b
      // fld      qword ptr [double]
byte psi_000b[]={0xDD,0x05,0x00,0x00,0x00,0x00}; // address
psi_t psi000b={6,psi_000b,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define ARRAY_FIRST_DIMENSION                                                         0x0018
      // imul        eax,eax,(max2*max3*...*maxn*sizeof_type)
      // lea            ecx,dword ptr (address)[eax]
byte psi_0018[]={0x69,0xC0,0x00,0x00,0x00,0x00, // (max2*max3*...*maxn*sizeof_type)
      0x8D,0x88,0x00,0x00,0x00,0x00}; // address
psi_t psi0018={12,psi_0018,4,{8,-1},{-1},{2,-1}};
/*------------------------------------------------------------------------*/
#define ARRAY_X_DIMENSION                                                             0x0019
      // imul        eax,eax,(max(x+1)*max(x+2)*...*maxn*sizeof_type)
      // add            ecx,eax
byte psi_0019[]={0x69,0xC0,0x00,0x00,0x00,0x00, // (max(x+1)*max(x+2)*...*maxn*sizeof_type)
      0x03,0xC8};
psi_t psi0019={8,psi_0019,4,{-1},{-1},{2,-1}};
/*------------------------------------------------------------------------*/
#define CALL_FUNCTION                                                                 0x001a
      // call        (function)
      // add            esp,constant
byte psi_001a[]={0xE8,0x00,0x00,0x00,0x00, // (dest-instr-5)
      0x81,0xC4,0x00,0x00,0x00,0x00}; // constant
psi_t psi001a={11,psi_001a,4,{1,-1},{-1},{7,-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define ST0_LOGICAL_NOT                                                               0x001f
      // mov            dword ptr [tmp_dword_0],0
      // fcomp dword ptr [tmp_dword_0]
      // fnstsw      ax
      // test        ah,40h
      // je       jump_0
      // mov            eax,1
      // jmp            jump_1
      // jump_0:
      // sub            eax,eax
      // jump_1:
byte psi_001f[]={0xC7,0x05,0x00,0x00,0x00,0x00, // address
      0x00,0x00,0x00,0x00,0xD8,0x1D,0x00,0x00,0x00,0x00, // address
      0xDF,0xE0,0xF6,0xC4,0x40,0x74,0x07,0xB8,0x01,0x00,0x00,0x00,0xEB,0x02,0x2B,0xC0};
psi_t psi001f={32,psi_001f,0,{-1},{2,12,-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define ST0_FLOAT_DIVISION                                                            0x0025
      // fdiv        dword ptr [float]
byte psi_0025[]={0xD8,0x35,0x00,0x00,0x00,0x00}; // address
psi_t psi0025={6,psi_0025,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define EAX_ECX_UNSIGNED_GREATER_THAN                                                 0x003c
// h sufficiente che o eax o ecx siano interi a 32 bit senza segno
      // cmp            ecx,eax
      // sbb            edx,edx
      // neg            edx
      // mov            eax,edx
byte psi_003c[]={0x3B,0xC8,0x1B,0xD2,0xF7,0xDA,0x8B,0xC2};
psi_t psi003c={8,psi_003c,0,{-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define EAX_ECX_LESS_THAN_OR_EQUAL_TO                                                 0x003d
      // xor            edx,edx
      // cmp            eax,ecx
      // setle dl
      // mov            eax,edx
byte psi_003d[]={0x33,0xD2,0x3B,0xC1,0x0F,0x9E,0xC2,0x8B,0xC2};
psi_t psi003d={9,psi_003d,0,{-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define EAX_ECX_UNSIGNED_LESS_THAN_OR_EQUAL_TO                                        0x003e
// h sufficiente che o eax o ecx siano interi a 32 bit senza segno
      // cmp            ecx,eax
      // sbb            edx,edx
      // inc            edx
      // mov            eax,edx
byte psi_003e[]={0x3B,0xC8,0x1B,0xD2,0x42,0x8B,0xC2};
psi_t psi003e={7,psi_003e,0,{-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define ST0_EDXPTR_FLOAT_LESS_THAN                                                    0x0045
      // fcomp dword ptr [edx=float]
      // fnstsw      ax
      // test        ah,41h
      // jne            jump_0
      // mov            eax,1
      // jmp            jump_1
      // jump_0:
      // sub            eax,eax
      // jump_1:
byte psi_0045[]={0xD8,0x1A,0xDF,0xE0,0xF6,0xC4,0x41,0x75,0x07,0xB8,0x01,0x00,0x00,0x00,0xEB,0x02,0x2B,0xC0};
psi_t psi0045={18,psi_0045,0,{-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define JUMP_IF_EAX_ZERO                                                              0x0062
      // test        eax,eax
      // je       (address)
byte psi_0062[]={0x85,0xC0,0x0F,0x84,0x00,0x00,0x00,0x00}; // address
psi_t psi0062={8,psi_0062,0,{4,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define JUMP_IF_ST0_ZERO                                                              0x0063
      // mov            dword ptr [tmp_dword_0],0
      // fcomp dword ptr [tmp_dword_0]
      // fnstsw      ax
      // test        ah,40h
      // jne            (address)
byte psi_0063[]={0xC7,0x05,0x00,0x00,0x00,0x00, // address 0
      0x00,0x00,0x00,0x00,0xD8,0x1D,0x00,0x00,0x00,0x00, // address 0
      0xDF,0xE0,0xF6,0xC4,0x40,0x0F,0x85,0x00,0x00,0x00,0x00}; // address 1
psi_t psi0063={27,psi_0063,0,{23,-1},{2,12,-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define ARRAY_X_DIMENSION_WITH_LSHIFT                                                 0x00ec
      // shl            eax,constant      // 2^constant=(max(x+1)*max(x+2)*...*maxn*sizeof_type)
      // add            ecx,eax
byte psi_00ec[]={0xC1,0xE0,0x00,// constant
      0x03,0xC8};
psi_t psi00ec={5,psi_00ec,1,{-1},{-1},{2,-1}};
/*------------------------------------------------------------------------*/
#define ARRAY_FIRST_DIMENSION_WITH_LSHIFT                                             0x00ed
      // shl            eax,constant      // 2^constant=(max2*max3*...*maxn*sizeof_type)
      // lea            ecx,dword ptr (address)[eax]
byte psi_00ed[]={0xC1,0xE0,0x00, // constant
      0x8D,0x88,0x00,0x00,0x00,0x00}; // address
psi_t psi00ed={9,psi_00ed,1,{5,-1},{-1},{2,-1}};
/*------------------------------------------------------------------------*/
#define STACK_ARRAY_FIRST_DIMENSION_WITH_LSHIFT                                       0x00ee
      // shl            eax,constant      // 2^constant=(max2*max3*...*maxn*sizeof_type)
      // lea            ecx,dword ptr (ebp-x=address)[eax]
byte psi_00ee[]={0xC1,0xE0,0x00, // constant
      0x8D,0x8C,0x05,0x00,0x00,0x00,0x00}; // offset
psi_t psi00ee={10,psi_00ee,1,{6,-1},{-1},{2,-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define RETURN                                                                                    0x00fb
      // pop            edi
      // pop            esi
      // pop            ebx
      // mov            esp,ebp
      // pop            ebp
      // ret
byte psi_00fb[]={0x5F,0x5E,0x5B,0x8B,0xE5,0x5D,0xC3};
psi_t psi00fb={7,psi_00fb,0,{-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define ENTER                                                                               0x00fc
      // push        ebp
      // mov            ebp,esp
      // sub            esp,constant
      // push        ebx
      // push        esi
      // push        edi
byte psi_00fc[]={0x55,0x8B,0xEC,0x81,0xEC,0x00,0x00,0x00,0x00,0x53,0x56,0x57}; // constant
psi_t psi00fc={12,psi_00fc,4,{-1},{-1},{5,-1}};
/*------------------------------------------------------------------------*/
...

As you may see yourself, each PSI instruction is precompiled in native x86 assembler: each instruction is then represented by an instance of the "psi_t" type. A "psi_t" instance has a reference to the length of the final compiled chunk of code, a byte pointer to it and various specific relative offsets inside the byte chunk that refer to addresses, constants and offsets that need to be filled in order to compile successfully that PSI instruction in the final object module.

DOWNLOAD

Download the source file of the MiniC compiler from here (92KB).

Quotes

"Among the Windows experts I know personally, no one can beat Vito Plantamura."

- Francesco Balena, Code Architects SRL

"Your NDIS Monitor application, is amongst the most impressive networking code I have seen on the .Net framework."

- Ben Hakim.

Photos

Various images from italian conferences and events (keep the mouse on a thumbnail for a short description):

Me at the Microsoft/HP/Intel organized Route64 event in Milan in May 2005, explaining how COM+ behaves on 64-bit Microsoft operating systems. I was there with the friends of Code Architects.

Me at the Microsoft Security Roadshow event in Bari in April 2006, explaining how the logon process works in Windows NT. There were 250 attendees.

Microsoft Security Roadshow 2006 in Treviso. This is an image of the huge 700-seats conference room.

Me at the Microsoft Security Roadshow 2006 in Treviso. This is a moment of the 3-hours session.

Everything here (code, binaries, text, graphics, design, html) is © 2010 Vito Plantamura and VPC Technologies SRL (VATID: IT06203700965).
If you download something (compilable or not) from the site, you should read the license policy file.
If you want to contact me via email, write at this address.