1.3 – What is Compilation and How to Compile and Run a C Program

I have covered briefly about compilation of a C code and running an executable file in the hello world c program post as well as while explaining the basics of c programming language.

In this post, let’s learn a little bit more about what does compilation mean, what happens while compiling a C program such as warning and error reporting during the compilation phase and what is the output after compilation is done.

What is Compilation

When you write any code or program and save the file, initially it is just a file with some text in it, that’s all.

As I have been saying that C is a high level language which is easy to understand for humans, so the C code file just contains some words and some lines which is in human understandable format.

Now, this c file in itself cannot be used by your computer to run as a program or application.

It must be converted to machine or processor understandable language which is in 1s and 0s, also called as machine code.

Before machine code or 1s and 0s, there is something every processor has which is called as instruction set; collection of human readable instructions (or sometimes called as opcodes or mnemonics) which is in the form of MOV, ADD, JMP etc.

Just like C language, there is something called as Assembly language which is written by using these machine instructions.

Think of all these instructions as symbolic representation of combination of binary 1s and 0s.

For example:

  • 100010 is the binary combination for MOV instruction in Intel 8086 processor

Important thing to note is that different kinds of processors such as Intel, AMD, MIPS, etc. have different instruction sets which is usually called as their Instruction Set Architecture (ISA).

The compiler actually converts your c code into machine understandable assembly code. Then there is something called as assembler, which take this assembly file and converts all these instructions into binary 1 and 0 form which can then be run by the processor directly.

When I say compilation, there are many intermediate tools involved such as preprocessor, linker, assembler etc. which I am going to cover in the next post.

For now, to keep it simple, remember that every processor understands different machine language. It is the compiler’s job to convert the human readable C program code to processor or machine understandable binary code.

Different Stages of Compilation in C Programming

While explaining the structure of a c program, I have shown #include and #define in the example code.

These statements starts with a “#” symbol and during the compilation phase, these are the first one to be processed.

The statements starting with “#” are actually processed by a tool called as pre-processor. This is the very first operation that happens during the compilation any C program code.

In the previous section, I gave you a very brief introduction of Assembler and what does it do.

In this section I started with preprocessing.

I intentionally did this, so that you get an idea that there are several different stages involved in the whole compilation process.

Let’s break it down in the exact same order of happening during the C code compilation:

  1. Pre-processing
  2. Compilation
  3. Assembly
  4. Linking

Now, let’s look at what’s happening in each stage of operation.

Preprocessing

As the naming goes, this is the very first stage of processing in the process of compilation or we can say before compilation. That’s why the name preprocessing.

Any statement that starts with a “#” symbol which is called as preprocessor directive, is processed by the preprocessor.

Below are some of the preprocessor instructions:

  • #include – includes a file
  • #define – defines a macro with fixed value
  • #if, #elif, #else, #endif – preprocessor level conditional check
  • #ifdef, #ifndef – checks if a macro is defined or not
  • #error – generates an error during preprocessing
  • #pragma – compiler specific settings

It is also the job of the preprocessor to remove the comments from the code which starts with // or wrapped within /* and */

Summary

  • The preprocessor processes any statement that starts with the preprocessor directive #
    • It processes all these # statemets in order of top to bottom.
  • It also removes the comments.
  • Use the -E option in gcc to generate the preprocessed file for your c source code.
    • The output file is generated in .i extension

Compilation

In the compilation phase, the compiler checks for syntax and semantics. If everything is correct then, it converts the human readable C code assembly code (architecture / processor instuctions) which will be next processed by an assembler.

Let me explain in simple words.

Let’s use the Hello world c program for reference:

#include <stdio.h>

int main ()
{
printf ( "Hello, World!" );

return 0;
}

What is Syntax in C

Syntax are just the basic rules that needs to be followed while writing any statement in C.

For example:

  • A statement in C must be ended with a semicolon “;
    • The printf statement ends with a semicolon. If the semicolon is not there then the the compiler will report an error.
  • A function definition must have a pair of parenthesis “( )” after the name of the function
    • The main () function ends with a pair of parenthesis. Either I will not use parenthesis or use it partially, then this will generate an error.
  • Any statements inside of a function definition must be wrapped within a pair of curly braces “{ }
    • The function definition of main() starts with an opening curly bracket and ends with a curly bracket. If I will miss anything from these, the compiler will report an error.
  • etc.

Look at the below example of program which is full of errors:

#include <stdio.h>

int main ( // Missing closing parenthesis
{
printf ( "Hello, World!" ) // missing semicolon

return 0;
}
} // extra closing bracket

Symantics in C

Symantics have a fixed meaning when it is used in a C program.

These are the very specific terms used to declare a variable, or let’s say a loop etc.

For example:

  • int – used to declare an integer type variable
  • for – used to write the for loop
  • return – used to return from a function
  • etc.

Symantics are the predefined words which has a specific meaning in C programming. You have to keep in mind that none of these words can be used for any other purpose.

From the example of hello world program, let’s list down the symantics:

  • int – This int is used to mention what kind of data will be returned from main function when it ends.
  • main – This is the main entry of execution for any kind of C program. Wherever the main function is, the execution will start from there in a c program.
  • number – (number variable is not there in hello world program. I added that in the next section to explain) this is a symantic which is created by the user
  • return – used to return when a function ends execution.

If a variable name is declared such as int number then the variable name “number” is called as an identifier. Let’s say you have used a variable name in some place of your code but not declared it, then this kind of error comes under symantic check.

Remember, #include is NOT a symantic, it is a preprocessor directive which we have covered in the preprocessing section.

Code Optimization

Code optimizations happen in the compilation phase.

If the compiler is passed with options like -O1, -O2 etc., then the compiler does code optimization while generating the assembly code.

Code optimization is an advanced topic which I will cover in a separate post. This needs to be understood in detail and this is used mostly in embedded systems to increase performance by utilizing less CPU cycles or it is used to reduce code size in minimal systems.

C code conversion to Assembly code

If the syntax and symantics are correct, then the compiler generates the assembly code which is in the form of machine instructions.

I gave an example of how the assembly code looks when it is converted from C code in the C programming introduction post the same I am placing here to give you an idea:

main:   push %rbp            # establish...
            mov %rsp, %rbp       # ...a stack frame (and align rsp to 16 bytes)

            lea str(%rip), %rdi  # load the effective address of str to rdi
            xor %al, %al         # tell printf we have no floating point args
            call printf          # call printf(str)

            leave                # tear down the stack frame
            ret                  # return

push, mov, lea, call, ret are all machine instructions.

Summary

  • Compiler checks for syntax error
  • It also checks for correct usage of symantics
  • When the code has correct syntax and symantics, it converts the code from human readable format into machine instruction format.
  • Optimizes the code if passed with -O1 or -Os etc. compiler options
  • Use the -S option in gcc to generate the assembly file for your c source code.
    • The assembly file will have a .s extension

Assembly Stage

During the assembly stage the assembly file generated in the previous stage will be converted into machine code.

This machine code file is usually called an object file.

This is the final machine understandable code that is generated. Along with machine code, it also keeps some other information in the object file such as: dependent libraries, creates different kinds of code segments (such as: stack, heap, initialized variables and un-initialized variables etc.) and few more things.

Linking

Once the object file is generated which has the current C code written by the user or you, as well as this has information about dependent libraries from the system.

In the linking phase, this user code will be linked with other dynamic libraries and finally the executable file will be created which can directly be run on the system.

In the previous hello world program, after the final linking stage, we got a.out or hellowrold.out file.

If you are using Linux, MacOS, or FreeBSD, you may run the below command based on your output filename to run the program:

$ ./a.out

or

$ ./helloworld.out

Final Takeaway

It is essential to understand that the compilation is a multi-stage operation as explained above.

Once you understand all the different phases and respective activities done in each stage, you will be able to figure out from which phase do you get the errors while you are writing C code.

In addition, this also gives you an insight of how does a system work.

This knowledge has to stay with you as long as you are working with C coding.

Got a question or have a doubt?

Compilation is a very building block topic to understand when you begin your journey to learn C programming.

I have covered as much information as possible but I may have missed something or you may have a doubt or question on something.

I invite you to sign up and use the respective forum for compilation discussions and create a topic, fill the details for which you seek the answers.

Below are the useful links for you when you need on C: