The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Bootstrapping a compiler from nothing.

How do you bootstrap a compiler from nothing?  How are new compilers made?  How was the first compiler made?  How are assemblers made?

I have found these websites -

And they look like they are very interesting and just what I am looking for, but they are all down.  I have not been able to find any others.  So, my main question is, how do you bootstrap a simple compiler from nothing?
Steven Gary Noyce Send private email
Tuesday, June 27, 2006
All compilers start with Lex and Yacc. If you are really intersted with this topic, read a book called "Introduction to Compiler Construction With Unix". It was written in 80s but the techniques did not change since then.
Tuesday, June 27, 2006
flex & bison are the new lex & yacc
outcast (ex-developer) Send private email
Tuesday, June 27, 2006
Really? How were Lex and Yacc written?

I kinda doubt that anything called Yet Another Compiler Compiler was the progenitor of all compilers. In fact, I doubt it was even the first compiler compiler ;)

I'm sure the history went something like

machine code
macro processers of machine code

but, why are you writing a compiler?
PHDude Send private email
Tuesday, June 27, 2006
> Introduction to Compiler Construction With Unix

Excellent book. I've used it a lot over the years.
son of parnas
Tuesday, June 27, 2006
You don't ;-)

I'm assuming the question is how did (for example) the first C compiler get compiled?

It was written in (I beleive) BCPL, a predecessor to the C language.  BCPL's compiler was probably written in assembly.  At some point when C was popular enough, someone wrote a C compiler in C, and that could be used to subsequently compile itself.

For C++, Stroustrup wrote a preprocessor that translated the code to C, which could be compiled, and then wrote a genuine compiler in C++:
Tuesday, June 27, 2006
Type in bootstrapping a compiler from nothing into google and the first link is this one.

But it is down.  Has anyone read that article?  Does anyone know what it is talking about?  I found it though wikipedia's "Bootstrapping (compiler)" page as an external link, and I have found refrences to it on many sites.  Mainly, I am just very frustrated with the fact that it is not working and want to know what it says.  (sorry for my ranting, I am frustrated)
Steven Gary Noyce Send private email
Tuesday, June 27, 2006
Tuesday, June 27, 2006
Steve, you keep asking the same question, but it's not clear why you want to know.  What do you want to do with this information?  Intellectual curiosity only?

The easiest answer to this is to take $100 and go to MicroCenter and buy a copy of Visual C++ Learning Edition.  That "bootstaps a compiler from nothing", after all.  You had $100, now you have a compiler.  yay.
Tuesday, June 27, 2006
You should ask Scott.
Vince Send private email
Tuesday, June 27, 2006
The first assemblers were poked in the machine by toggling physical switches on the front the machines. The trick is to put programs on the machine that can help you in the process of loading programs as soon as possible. Think of loaders, assemblers, punch tape readers, etc. Once the minimal infrastructure is in place, you can start a series of ever more complex tools, helped by the existence of the simpler tools already on the machine.

One of the articles you referred to a language resembling Forth. Studying Forth is an excellent source for understanding compilers, in particular for the compiler back-end. Forth itself uses a very simple compiler that can be written in a few hundred bytes, placed on a kernel of a dew dozen machine language instructions. Once this infrastructure is in place the bootstrapping can really take of.

Lex and Yacc are mentioned. Useful tools, but these are not the pinacle of compiler compilation. There are better tools on the market depending on your needs. In our company we have written our own generator for the compiler of the Carmen and Carmencita programming languages.

If you are going to write a compiler yourself, it is highly unlikely that you will start from absolutely nothing and without the support of existing tools. In that case picking the right source, implementation and target language is very important. From my blog:
Karel Thönissen Send private email
Tuesday, June 27, 2006
Anyone who doesn't know the answer to this inside and out and who doesn't write a compiler at least once per day, doesn't deserve to live.
Scott Send private email
Tuesday, June 27, 2006
I once wrote a compiler that converted SAS Version 5 transport dataset metadata into LaTeX. But I'm guessing that's not what you're talking about (:=)
Steve Hirsch Send private email
Tuesday, June 27, 2006
Fortran was AFAIK the first proper high level language

I remember reading an account of it being developed (in assembler) and how many people at the time doubted whether a high level language was possible or practically possible.

The account that I read was excellent, but sadly I can't find it again.  But here is another page with tons of links:

Subsequent languages basically followed the proven track of Fortran (program from the ground-up in assembler), until at some point, somebody thought of making high level languages, at least partially, in high level languages, perhaps simpler high level languages used as a foundation for more complex ones.
S. Tanna
Tuesday, June 27, 2006
S. Tanna
Tuesday, June 27, 2006

It helps to know exactly what you've got to work with.  There are only a few instructions necessary to implement a general-purpose CPU.  So first you've got to understand those ... basic arithmetic, logic, and conditional jumps.

After that, it helps to know what you *want* to work with.  You'd do well to write down the category of expressions your language will support.  Your goal is then to translate these expressions into the basic CPU instructions you've become familiar with.

For example, the abstract expression defining a procedure:

(procedure (-> (int int) int) (vars V) (code C))

Can be compiled to:

// allocate stack space for local variables
'(sub (reg esp) (dword `(size-of V)))
// compile all sub-expressions
(for-each C compile)
// return to the caller, cleaning up local variables
'(ret `(size-of V))

When you've fully defined this mapping, then you can worry about how to bootstrap it.  In principle, you could achieve this by writing a compiler in your target language, then performing the bootstrap translation by hand.  Many people instead choose to write two sets of translation rules: one to throw away in a common language, and the other in the target language.

The real trick is defining the translation from your abstract program representation to the representation supported by your target CPU.  The rest of it is pretty simple by comparison (and those tools mentioned [yacc, antlr, and such] will only help you with the phase that converts text to abstract syntax trees -- this is not a difficult problem compared to what you'll face once you've got the abstract trees).
Tuesday, June 27, 2006
Not answering the question, but this is very related to the bootstrapping of compilers, and if you haven't seen it before it's a real eye-opener once the lightbulb "ah-ha" moment kicks in:

Ponder that one for a while, while you consider what goes on in developing a compiler. It's actually PrettyDamnedScaryStuff if you ask me.
Wednesday, June 28, 2006
It depends on what your definition of "nothing" is. Nowadays, "nothing" is likely to include at least a processor+bus architecture and an operating system. As a lot of people have suggested, "nothing" can also include tools such as an assembler or a pre-built compiler, and scanner and parser generators like lex and yacc.

The article that you point to (Bootstrapping a Simple Compiler from Nothing by Edmund GRIMLEY EVANS) starts with my basic definition of "nothing": a processor architecture (i386) and an operating system. This takes care of some issues that need to be abstracted away from a compiler writer; for example, getting input and writing output and the final format in which to store the executable code that gets compiled.

Operating systems usually mandate the storage format. Whereas ultimately executable code is numbers that correspond to instructions that are understood by the target microprocessor, it is packaged in a file format which contains additional metadata, which allows the operating system to load the code into memory, load any other code that it is dependant upon, and execute the lot correctly.

In the bootstrapping article, Mr. GRIMLEY EVANS (his name as it appears in the article) began by writing an application in the machine language of the i386 processor. He manually converted the instructions into the numbers that they correspond to, and wrote these numbers as hexadecimal digits into a text file (hex1.he). He also manually converted the metadata required by the ELF file format (the format that is used to package executable code in GNU/Linux) and wrote that into this text file as well. If somehow, the hex digits in the text file could be converted one-to-one into binary and saved, the result would be an executable file. Mr. GE used a program to do that the first time. He does not mention what program he used.

The interesting part comes after that: the instructions (that he manually converted to hex) were for reading hex digits and converting them to binary, and finally writing an ELF header. Thus, beyond the first time, the "compiled" hex1 could be used to compile itself. Mr. GE then proceeded to add more features incrementally, and improving the language beyond simple hex digits.

So that's how you bootstrap a compiler from "nothing".

To the gentleman who said "All compilers start with lex and yacc", that depends on what your definition of "all" is.

I think I am due some attention from Ms. Lewinsky now :-)
Raj Chaudhuri Send private email
Wednesday, June 28, 2006
Wirth wrote the first Pascal compiler in Pascal, on paper, and then ran it *by hand*.  The result was a bunch of machine code (or some other low-level language) that was typed into the computer, and there was the executable compiler.
Richie Hindle Send private email
Wednesday, June 28, 2006
>How do you bootstrap a compiler from nothing? 

I believe Carl Sagan said "...first you need to create the universe"

Honu Send private email
Wednesday, June 28, 2006
When you have a new CPU you just use a cross-compiler.
Wednesday, June 28, 2006
Thanks for all of the info guys!  I will go read some more.  That website I wanted to look at is now up, so I will look at that too.  This is a really good forum, I am finding it hard to keep up with all the replies (but too many is better than too few).  Thanks again, good bye for now.
Steven Gary Noyce Send private email
Thursday, June 29, 2006
These days there are two needs for new compilers.

They typically write it in highlevel language that has existing implementation.
For instance if your target platform A DOESN'T have a compiler for language C you write it in platform B where there is implementation for language C. Then test binaries that you have created on platform B to move them to platform  A for testing, if they work. You can try compiling the compiler with itself and if  works you have moved it to a platform A.
You can do the same thing for EVERY peace of software that is needed on platform A. You write a compiler that RUNS on platform B while targets platform A and after it is finished it should be able to compile its self on platform A.
If platform A doesn't have operating system or anything like that, you need to compile that with the compiler you have written on platform B targetting platform A. Until you have infrastructure ready in platform A to run the compiler, and then compile itself to platform A.

Another need of writing compiler from cratch is that the LANGUAGE doesn't have any implementations.
Then the best choice is writing an implementation of language in existing language with compiler.
And there are lots of tools that are designed to make THAT job easier. The best solution is that there is BACKEND compiler ready and you just write ONE stage between the YACC output and input for the backend.
Jouni Osmala Send private email
Thursday, June 29, 2006
Here's a site with a good graphic of the history of programming languages and their connections to one another. Yes, it begins with Fortran in 1954.

There are also an OS history charts on his site.
Friday, June 30, 2006
Read the following paper:

Schorre, D.V. "META II: A Syntax-Oriented Compiler Writing Language". Proc. 19'th Nat'l. Conf. of the ACM (Aug. 1964),D1.3-1-D1.3-11.

Available via ACM.
Babar K. Zafar
Sunday, July 02, 2006
Peter Vanderwaart Send private email
Thursday, July 13, 2006
There is a principle (the benefits of which I do not know) that a compiler should be written in the language that it compiles. This permitted one of the most outrageous of hacks:

  Historically, back doors have often lurked in systems longer than

  anyone expected or planned, and a few have become widely known.

  Ken Thompson's 1983 Turing Award lecture to the ACM suggested the

  possibility of a back door in early Unix versions that may have

  qualified as the most fiendishly clever security hack of all time

  In this scheme, the C compiler contained code that would recognize

  when the `login' command was being recompiled and insert some

  code recognizing a password chosen by Thompson, giving him entry to

  the system whether or not an account had been created for him.

  Normally such a back door could be removed by removing it from the

  source code for the compiler and recompiling the compiler.  But to

  recompile the compiler, you have to *use* the compiler -- so

  Thompson also arranged that the compiler would *recognize when

  it was compiling a version of itself*, and insert into the

  recompiled compiler the code to insert into the recompiled

  `login' the code to allow Thompson entry -- and, of course, the

  code to recognize itself and do the whole thing again the next time

  around!  And having done this once, he was then able to recompile

  the compiler from the original sources; the hack perpetuated itself

  invisibly, leaving the back door in place and active but with no

  trace in the sources.

  The talk that suggested this truly moby hack was published as

  "Reflections on Trusting Trust", "Communications of the ACM

  27", 8 (August 1984), pp. 761--763.  Ken Thompson has since

  confirmed that this hack was implemented and that the Trojan Horse

  code did appear in the login binary of a Unix Support group

  machine.  Ken says the crocked compiler was never distributed.

  Your editor has heard two separate reports that suggest that the

  crocked login did make it out of Bell Labs, notably to BBN, and

  that it enabled at least one late-night login across the network by

  someone using the login name `kt'.
Peter Vanderwaart Send private email
Thursday, July 13, 2006
I should note that the quote in the previous past is from the Jargon File, specifically this copy:
Peter Vanderwaart Send private email
Thursday, July 13, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz