The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Implementation of char in GCC

For my own purposes, I am very curious about how the compiler gcc works internally. I have found some information on gcc, but it wasn't quite what I was looking for. I'm not interested in RTL or SSA or things of that nature but rather I really want to know how gcc handles code statements such as,

char str[] = "This is a string";

I am sorry if this question seems trivial, but I don't know a whole lot about how compilers work and I was just wondering how gcc computes the length of "This is a string" and then creates an array str[] of that size.

Thank you in advance to anyone who can point me in the right direction for this type of information.
Frank Jaeger
Monday, November 06, 2006
I share this interest and I don't know the answer. "Programming Language Pragmatics" and "Crafting a Compiler" are both good compiler books that I own. (I am also trying to get hold of a used copy of the Dragon Book).
Greg Send private email
Monday, November 06, 2006
GCC computes the length of the string just like you do: it counts the characters one by one until it hits the null terminator.

At that point the required size of the array is known.

Or were you intending to ask something different?
David Jones Send private email
Monday, November 06, 2006
BillT Send private email
Monday, November 06, 2006
The easiest way to get to grips with this sort of thing is to write your own compiler; if you restrict yourself to very basic code generation, using a VM language you invent yourself for the purpose, it's not actually very hard. The most difficult bit (I found) was learning flex and bison, the documentation for which appears clear only once it's clicked. Once you're past that stage, though, the rest is surprisingly straightforward, if a little dull.

Regarding your question specifically, once the program has been chopped up into syntactic elements according the rules you provide (which flex does for you), then the syntactic elements grouped according to the grammar you specifyc (which bison does for you), it is the generally done thing to construct a tree-like structure corresponding to the program. In this particular case, you would probably have a "variable" node in the tree, tagged with the type (array of char), the name ("str"), and the optional initial value (the string "This is a string.".) Compilation then proceeds by navigating the tree, and running some appropriate code for each node encountered.

Continuing our example, when a "variable" node is met, space must be allocated somehow for to store the data, and the name added somehow to the symbol table, which maps names to locations of values. (The symbol table is used when encountering names in the text, so that the compiler knows which action to take and how to verify that it makes sense in that context.)

Assuming this is a global, the result might be to examine the type (array of char), check the optional initial value, determine the amount of space required for the value (in this case with strlen, prehaps), make space for it, fill the space with the initial value, and store off the fact that the name "str" refers to the char array at that location.

If it's a local variable, the behaviour would be broadly similar, but complicated slightly by the fact that the exact location isn't known ahead of time; you'll need to generate some code to allocate the space on the stack or the heap (according to language) and perhaps some more code to set the initial value, and this time the name refers not to a specific location but a location on the stack or on the heap... this is just a question of bookkeeping, and a bit of forward planning. (Like I said... it does get a little dull :)

Writing your own compiler is an interesting exercise, and I recommend it. I don't have any books to particularly recommend, though I probably absorbed something from my attempts to read the dragon book (which I found relaly dull.) I did find the source code to perforce jam quite enlightening, however, as it is nicely put together and, whilst not a compiler, processes its scripting language in the manner I describe.
Monday, November 06, 2006
> I was just wondering how gcc computes the length
> of "This is a string" and then creates an array
> str[] of that size

The str[] array is noting but a pointer. It has no concept of size.

In a nutshell all it does is fill some contiguous memory with 'This is a string' + \0 and points str[] to the start of that memory block.
Jussi Jumppanen
Monday, November 06, 2006
Jussi, it is implemented as an array, it is not just a pointer.

Arrays do, however, *decay* to a pointer when passed by value.

Arrays in C, even dynamic arrays in C99 have a size associated.  Just check the result of sizeof(str) if you don't believe me :)
Arafangion Send private email
Tuesday, November 07, 2006
sizeof is calculated at compile-time.  str is effectively a pointer: it represents neither more nor less than the address of a block of memory, either on the stack or in a data segment of the program. The "array" exists only in the compiler's imagination.
Tuesday, November 07, 2006
This reminds me of the classic:

char *A  = "Hello";      /* vs. */
char B[] = "world";
Wednesday, November 08, 2006
str is a symbol, in the current scope.
when it sees this, the compiler sees that it as a new symbol str and a new static string "whatever", and that the type of str is char[ _len of whatever_. It adds _whatever_ to the current text area, and adds str to the current static symbol table, and assigns it the tbd address of the text area. str doesnt mean anythign out of the current scope. In the current scope, the computer replaces references to "str" with "address of str", which is filled in when the program is linked, because the address of str is not determined till the app is linked. Type checkin is done a compile time. literal strings are assigned at link time.
Monday, November 13, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz