MINI C COMPILER
INTRODUCTION
At the age of 16/17 I needed to learn the C language in order to cut down my overall development times: I knew perfectly the 32-bit integer, FPU and system assembler for the Intel x86 processors and I was able to program fluently entire games and applications exclusively in assembly language. However, when the need of writing bigger games and programs began to grow, by using a low-level language I was unable to be as productive in development as I needed to be: higher level tools and compilers were required. So, I learned myself the C language from an old book of mine in a couple of days. The C language was the perfect choice because it allowed more productivity, code readability and robustness while being the most near to the always beloved raw machine language.
One of the very first projects I wanted to develop in my just learned language was, not surprisingly, a compiler for the C language itself: I found the idea of writing a compiler for a specific language in that same language extremely intriguing. However, one of the main reasons that pushed me to pursue this project (besides its inherent complexity) was that it could allow me (due to the greater amount of research and specific investigations involved) to improve significantly my knowledge of the new language: I knew perfectly the assembly language and, in order to write good programs, I needed to know perfectly also my new high-productivity language as well. The fact that I had no documentation or books on how to write compilers and linkers or even an internet connection for trying to gain related knowledge on the web, obviously, as always, didn?t stop me from pursuing my project. So, I took a pen and some paper and started to think about how a compiler could work, from its early stage of source code preprocessing to the final architecture-specific object code emission.
After a couple of days of thinking, I started to write the actual code: the entire project took 3 months of works and the result was a single, huge .C source file that actually comprises the "MiniC" language preprocessor and the compiler itself, and that is able to generate proprietary-format .OBJ files from .MC source files. The compiler itself does not implement all the features of standard (ANSI/Microsoft) C language specifications, but however it is just perfect for scripting, for example. Specifically I have used it successfully for this purpose inside my kernel mode debugger (bugchecker) and in my three dimensional CSG editor (MAPGEN). The source file that you can download from this page is the most recent version and is the one that is linked against the BugChecker object files. One of the main advantages of using MiniC is its extremely fast compilation times: for example, in BugChecker, EVERY command typed in the kernel debugger console is instantly compiled and linked against the pre-compiled object modules that are stored in the NonPaged pool memory of the program (the precompiled modules are actually the ones that provide the basic API of BugChecker and the object files representing the macro files defined by the user). Furthermore in BugChecker it is possible to define macros and commands in the MiniC language that consume directly or call functions that utilize FPU instructions and operations: this is due to an unique feature of BugChecker that saves the floating point context of the currently running process prior to entering in the debugger environment (only in the case that the application has carried out non-integer operations previously and so actually does have an FPU state when the debugger is popped up).
You can see that this is one of my first works in C language by the fact that, for example, the entire implementation is grouped into a single source file. If you scroll quickly the source code you can see various patches and corrections (marked with my name and the corresponding date) that span the time of 7 years...
TWO WORDS ON HOW IT WORKS
The preprocessor and compilation stages are pretty standard in MiniC. However the intermediary byte code used internally by the compiler (the so-named PSI code) is (for performance reasons) less abstract and more architecture-specific than the one found in traditional compilers (more on this later). The compilation stages are as follows:
|
The source file is taken and each expression, prototype and declaration in it are separated from the rest of the code. |
|
|
Each C expression is parsed by the "ParseExpressionTokens" function: here a string representation of the expression is converted in a series of structures that just reflect the occurrence of identifiers and operators inside the expression itself. |
|
|
In the "MakeExprOrderedOpList" the language expression (in its non-string, structured format, as the output of the previous function) is ordered so the operators and symbols with an higher priority are relocated first in this series of structures. |
|
|
The "ResolveNonConstantOperation" and "ResolveNonConstantExpression" functions take the output of the previous functions and generate the actual target code expressed in a series of PSI instructions. An example of several PSI instructions is shown later in this article. PSI stands for "pseudo-instruction" and it is actually a really low-level byte code used for optimization and code generation later in the compilation process ("LOAD_SIGNED_CHAR_IN_EAX" is an example of a PSI instruction). |
|
|
Various functions such as "ParseVariableDeclaration", "ParseStructDeclaration" and "ParseFunctionDeclaration" take care of parsing and transforming in actual target code specific elements of the original source file. |
|
|
The "ReducePSICode" function parses the PSI output just generated by the previous compilation stages and searches for groups of instructions that can be simplified. For example, if two compatible "STORE_EAX_IN_INT" and "LOAD_INT_IN_EAX" instructions are found, they are simply removed from the final code emission. |
|
|
The "WriteOutputFile" function translates the PSI instructions in actual x86 assembler code and packages the final object module. |
PSI INSTRUCTIONS
As explained before, the PSI instructions are used internally as an intermediary low-level proprietary byte code throughout the entire compilation process. For simplifying and accelerating the whole compilation work, the PSI instructions are less abstract than traditional byte-codes that you may find in other compilers.
For example, they are just architecture registers-aware (you may find PSI instructions that target specific x86 and FPU registers). They have the following format in the compiler source file:
/*------------------------------------------------------------------------*/
#define LOAD_SIGNED_CHAR_IN_EAX 0x0000
// movsx eax,byte ptr [signed char]
byte psi_0000[]={0x0F,0xBE,0x05,0x00,0x00,0x00,0x00}; // address
psi_t psi0000={7,psi_0000,0,{3,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_SIGNED_CHAR_IN_ECX 0x0001
// movsx ecx,byte ptr [signed char]
byte psi_0001[]={0x0F,0xBE,0x0D,0x00,0x00,0x00,0x00}; // address
psi_t psi0001={7,psi_0001,0,{3,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_UNSIGNED_CHAR_IN_EAX 0x0002
// mov eax,dword ptr [unsigned char]
// and eax,0FFh
byte psi_0002[]={0xA1,0x00,0x00,0x00,0x00, // address
0x25,0xFF,0x00,0x00,0x00};
psi_t psi0002={10,psi_0002,0,{1,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
...
/*------------------------------------------------------------------------*/
#define LOAD_INT_IN_EAX 0x0008
// mov eax,dword ptr [int]
byte psi_0008[]={0xA1,0x00,0x00,0x00,0x00}; // address
psi_t psi0008={5,psi_0008,0,{1,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_INT_IN_ECX 0x0009
// mov ecx,dword ptr [int]
byte psi_0009[]={0x8B,0x0D,0x00,0x00,0x00,0x00}; // address
psi_t psi0009={6,psi_0009,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_FLOAT_IN_ST0 0x000a
// fld dword ptr [float]
byte psi_000a[]={0xD9,0x05,0x00,0x00,0x00,0x00}; // address
psi_t psi000a={6,psi_000a,0,{2,-1},{-1},{-1}};
/*------------------------------------------------------------------------*/
#define LOAD_DOUBLE_IN_ST0 |