TechECE

Ramblings of a directionless group...

Contributors
Our Personal Blogs
Subscribe to our feed

Enter your email address:

Delivered by FeedBurner

Sunday, May 21, 2006
Compiler, Linker, Loader - An Intro To Relocatable Modules
Machis, I took pains to format this da. Cos a simple copy paste from MSWORD creates problems in the blog. So, kindly give atleast a glance.

First, you must understand exactly why it is necessary to be using this idea of relocatability. The simplest explanation is that it is necessary. The technical explanation is as follows: Computers have a bank of memory in which to store large numbers of programs and data. Traditionally, computer memory indexing started at 0 and ran to 655360 (640 KiloBytes). This of course, isn't the first block of memory to be used in a computer, there were others, but this is the earliest example that makes this easier to understand. Modern computers now have much larger blocks of memory, ranging all the way from 0 to somewhere in the vicinity of a GigaByte.

When programs are loaded into this block of memory, there is no telling exactly where the code will go. It's unlikely that it'll get placed at location 0 and more likely that it will end up somewhere near the top of middle depending on the system being used. Because of this uncertainty there is no way we can determine where our storage space, functions within our code, or any other data we might wish to reference. How do we solve this problem you might ask? The answer is relocatable modules.

A relocatable module is one in which every reference to locations within a module are tagged so that they may be updated with the new address once loaded into memory. Before we get into the actual compiler, linker, and loader examples, let's take a quick peek at something easier.

Imagine a simple computer with some RAM (it doesn't matter how much). This computer will only run ONE program. It starts executing at address 0 and stops when it runs out of code. Here's a quick example in pseudo assembly (based loosely on 8086 code using MASM) with roughly corresponding C functions listed next to it.

.code // No C Code - Start code here
MAIN: // void main() {
MOV [X], 1 // x = 1;
ADD [X], 20 // x = x + 20;
// }
END MAIN // This has no C counterpart
// Is the same as - Code ends here
// and execution starts at MAIN
.data // No C Code - Variable declarations
X DB ?? // int x; // A 1 byte integer

Hopefully this makes some sense. Really, what you have is a variable X created in memory and then a function that utilizes this space to do some elementary calculations. Now, in order to see how this code is run, we'll need to compile it. To do that, we need to compile each line. Rather than use OPCODES, we'll just leave the words the way they are for readability. Before I list the compiled code, here's a short run down on the notation: Anything in italics is a constant value and anything in bold is a memory address that refers to some position in memory. You must also remember that this code will start running at address 0.

ADDRESS VALUE

0 MOV
1 6
2 1
3 ADD
4 6
5 20
6 STORAGE FOR X

Notice the way the destinations of the MOV and ADD operations are all changed to a memory address of 6. The reason for this is that the data segment for this particular program is appended to the end of the code (as shown in the example). Compilers are nice to the human world by making all this machine code a layer of the computer that we shouldn't have to deal with. Because computers operate on spaces in memory everything it is necessary for the program to eventually be broken down into something nearly unreadable so that all operations are performed on a particular piece of memory.

Next, we'll look at what happens when we take our same program and expand it to a computer that can have more than one process in memory at a time. When we allow this to happen, we now must find a way to deal with the fact that we are no longer at location zero. In order to remove this burden from the programmer, a loader was invented. A loader is nothing more than a program that takes a compiled and linked (more on that in a minute) program and replaces all the references to memory with their new address. Because each executable would like to run at position 0 of the memory, it uses that number as its base. This has a convenient benefit for the loader. It can now just find a block of memory and then add the starting space to each of the relocatable addresses inside the code. Basically all the individual memory addresses we used before become an offset from the new base address that the program is placed into. Here's a new table that starts memory at position 810 instead of 0:

OLD ADDRESS (Offset) NEW ADDRESS VALUE

0 810 MOV
1 811 816
2 812 1
3 813 ADD
4 814 816
5 815 20
6 816 STOARGE FOR X

As you can tell by looking at the table, the following formula holds true: New Address = Base Address + Offset. And of course the offset is given as the original address in our source file. Now things are going to get a little more complicated. We're going to introduce the idea of Modularization, or breaking your code into smaller pieces, such as functions, or files.

--------------------------------------------------------------------------------

In this part we're going to use the examples given in the course pack and add a few tables that were left out. For simplicity, every operation takes only 1 operand, LOAD will move data from a memory address into a register (the AX or accumulator register) and the STORE operation will move data from the AX register into the given memory space. When using 'real' assembly language, the same thing takes place except that the lengths are different sizes and are not fixed at one.

MOULDE 1:

INTERNAL GLOBAL MAIN(), Y
EXTERNAL REFERENCE S1()
INTERNAL REFERENCE X

MAIN: LOAD X 0 1
STORE Y 2 3
JUMP S1 4 5

X: CONSTANT 1 6
Y: STORAGE 1 7

In the above example the first three lines are to define what type of data is present in the module. In module 1 it is apparent that the symbol MAIN (which represents a function) and the variable Y are both stored INTERNALLY, and are marked as GLOBAL so that other modules may use them. Line 2 indicates that the function S1 is an external function and should be present in another module (most likely module 2). And the last and final line starts that the variable X is an internal variable and is not to be used by other function (this the lack of the GLOBAL keyword). The numbers to the right of the instructions indicate the offset at which each of these instructions happen. Using this information, we can build the following symbol table:

SYMBOL INTERNAL ADDRESS EXTERNAL REFERENCE IDENTITY

MAIN() 0 - INTERNAL GLOBAL
S1() 5 ? EXTERNAL
X 6 - INTERNAL
Y 7 - INTERNAL GLOBAL

This table represents all the symbols (or non-keywords) inside the file, the address at which they can be found inside the file, the address at where they are referenced externally, and where the variable or function was defined. The function S1 is defined outside of the current file, and because the compiler doesn't actually know where the external function is, we leave this part of the symbol table blank for now.

If we build an address table like we did in the first section, we can see where all the data is in relation to the current file (remember that files start at 0 and continue on till they run out of code). For module 1, here's what a sample object module would look like:

ADDRESS VALUE
0 LOAD
1 6
2 STORE
3 7
4 JUMP
5 ??
6 STORAGE FOR X
7 STORAGE FOR Y

--------------------------------------------------------------------------------

Now we're going to take module two and do the exact same series of steps on it. First resolve it into a symbol table and then write out the object module address table.

MOULDE 2:

INTERNAL GLOBAL S1()
EXTERNAL REFERENCE Y
INTERNAL REFERENCE Z

S1: LOAD Y 0 1
ADD 2 2 3
STORE Z 4 5
HALT 6

Z: STORAGE 1 7

And here's the symbol table for module #2:

SYMBOL INTERNAL ADDRESS EXTERNAL REFERENCE IDENTITY

S1() 0 - INTERNAL GLOBAL
Y 1 ? EXTERNAL
Z 7 - INTERNAL

Again notice that the external reference for the variable Y is a ?. This is because we still don't know anything about module 1. When we introduce the linker, then we'll resolve the question marks.

Lastly, the object module (address) table for module number 2.

ADDRESS VALUE
0 LOAD
1 ??
2 ADD
3 2
4 STORE
5 7
6 HALT
7 STORAGE FOR Z
--------------------------------------------------------------------------------

Now we need to look at how the two object modules are put together to make one relocatable executable file. This extremely important task is accomplished by the linker. Remember the ?? we had in our tables? These unknown items are what the linker is supposed to resolve. It takes 2 (or more) module files, placing the MAIN function at the beginning of the program (at position 0), and then making sure all references are resolved and the relocatable addresses (the ones in bold) are updated to their new positions. When we perform the link operation, we essentially take both object modules and link them together into one file. This is done almost in a copy and paste operation with a few touchups to make sure the addressing is corrected. To make it easier to read, we will list the new module table by addresses in sequence of 2 (remember that our simple machine takes a 1 byte instruction followed by a 1 byte operand).

Address Operator/Data Operand/Data
Module 1 Starts Here
0 LOAD 6
2 STORE 7
4 JUMP 8
6 STORAGE FOR X STORAGE FOR Y
Module 2 Starts Here
8 LOAD 7
10 ADD 2
12 STORE 15
14 HALT STORAGE FOR Z

The first thing you might notice upon looking at the table are the numbers listed in red. These are the numbers that linker had to determine or change in order to link the file into an output module. The first two numbers are left along because the addresses 6 and 7 are still in the same place as when the file was originally created. However, the 8 listed for the jump statement was previously unknown. The linker determined its value by scanning through the symbol tables and the completed object module to find its new addresses. The load instruction's operand was previously unknown as well. It was determined in the same manner that the destination for the jump was. And lastly, the STORE command was previously using the value of 7. If this value were left at that, it is clear from the table that it would be writing into the storage for variable Y rather than for Z, which is what the program called for.

--------------------------------------------------------------------------------

The last thing to cover is the loader. It's function is exactly the same as in our first example, so I'll let you refer there for the technical workings of it. For the course pack examples, we'll use a starting address of 140 in memory and load our compiled and linked program into memory to be executed. Remember that the values in our last object table are nothing more than the offsets added to our base address. Here's what the resulting table will look like:

Address Operator/Data Operand/Data
140 LOAD 146
142 STORE 147
144 JUMP 148
146 STORAGE FOR X STORAGE FOR Y
148 LOAD 147
150 ADD 2
152 STORE 155
154 HALT STORAGE FOR Z

As you can tell, all the loader did was take the original values and offset them so that they were positioned correctly for use in memory.

If you've followed this far and don't have any questions, then you should understand how the compiling, linking, and loading of relocatable modules works!
posted by Ramkumar @ 9:08 PM  
1 Comments:
  • At 12:59 AM, Blogger Iday said…

    Good stuff da Ram. Especially the effort put into the post.

    A small thing to note.
    We follow a diff approach in our team to teach the relocation stuff to the fresh recruits.
    Start with the ELF-32 spec and then make them go from the C code, dump the elf object file and then dump the elf executable.
    This, coupled with the ELF spec, makes it easier to understand all that u have mentioned.

     
Post a Comment
<< Home
 
Previous Post
Blogroll
Archives
Personal Blogs by Indian Bloggers