Chapter 1 -- Preliminaries

1.1 Reasons for Studying Concepts of Programming Languages (Page 2/33)

Note on the "2/33" page numbering:  If you have the ninth edition, this
section begins on Page 2.  If you have the pdf version of the twelfth edition,
this section begins on Page 33.  Hence the "2/33" designation in the title above.
CAUTION: There is also a paperback version of the twelfth edition, with page
numbers wildly different than the pdf version of the twelfth edition.  So the
paperback version will NOT match up nicely with the page numbers listed in these
notes.  The only two options you should consider is the hardcover ninth edition
and the pdf twelfth edition (eText).

Why is CS320 a required course?  This course will likely be your first exposure
to the C programming language, which you will need in later courses (and indeed
it is essential in a significant fraction of real-world programming jobs).
CS320 will also be our vehicle for introducing several side issues (such as
how a compiler actually performs its magic, how a frame stack works, etc.)
and we will look at various features of multiple programming languages, and
get hands-on experience in programming several of them.

There will be mini-tutorials on features of C throughout the course, so that
toward the end of the semester, you should be capable of writing significant
C code.  Earlier in the semester, we will cover the basics of some specific
languages, and then do a hands-on assignment in that language.  APL will
be one of our early programming experiences.  The Sebesta text is the only
required book for this course; I will point you to web resources for language
specifics [but you may find it more convenient to have actual texts to peruse].
At some point in your career, you will probably want to have a C language
reference: C by Discovery by Foster & Foster is a good choice for learning C,
because it explains things thoroughly and contains tons of sample programs to
illustrate the concepts.  edoras:~masc0000/foster.tar.z is a tarball containing
all the sample programs [from the second edition, but most of these are in the
other editions, too].  If you happen to have access to this text, you will
probably want my corresponding notes for it, which is at edoras:~cs320/Foster4th .
[The page numbers in my Foster notes are keyed to the Fourth Edition; but since
you can get older editions dirt cheap (AbeBooks.com had a second edition for
$1.05 plus $2.95 shipping!), it's probably not worth an additional $80
just to have the page numbers match my notes.]  A good place to do comparison
pricing is addall.com; the ISBN number for the second edition is 1881991296
(also 1881991298).

Our 320 text lists several reasons for studying programming language concepts.
An obvious one is the fact that the more tricks you learn, the more capable
you become.  Learning even more languages then becomes easier, since there
won't be as many new syntax patterns to stub your toe upon.

Also, you should hope that your career will advance to the point where you will
have the authority to decide which language will be used for a new project.
A language well-suited to one project may be a disastrous choice for a
different project.  Sebesta points out that it is a natural human tendency
to always choose the language you know best.  [This is the 'if you give
a kid a hammer, everything looks like a nail' syndrome.]  It is important
to have a much broader horizon if you are to be an effective project manager.
While almost any language can perform almost any task given enough effort
[though it may perform it exceptionally slowly], it is better to match the
programming tool [language] to the task.  The fact that you can put a tutu on a
horse does not mean that it's a good idea to put the horse on the ballet stage.

As we delve into the implementation details, you should become more comfortable
with 'the way things work', which will likely make you a more effective
programmer.  You have probably have encountered various mysteries with the
compilation or execution of some of the programs you have written.  An
understanding of the inner workings and concepts will likely reduce the
number of unsolved mysteries.  And with luck, you'll learn some new tricks
for languages that you already thought you knew.

1.2 Programming Domains (Page 5/38)

Scientific Applications:
Many of the programs you have written up to this point probably fall into the
category of Scientific Applications.  FORTRAN, the dominant language at
the dawn of computing, was designed for this task.  Scientific applications
typically employ very simple data structures but lots and lots of numeric
calculations, and FORTRAN did that quite capably.

Business Applications:
The Business Application category is *still* dominated by COBOL, a dinosaur
from half a century ago.  It was designed to make it easy to print reports
(paychecks, profit-and-loss statements, etc.).
The syntax does a remarkable job of simulating natural English, e.g.,
ADD ITEM-PRICE TO TOTAL-ORDER
...giving naive business managers the impression that they could read the code.
(Note that '-' is part of the variable names, and has nothing to do with
subtraction; you need the keyword SUBTRACT to take the difference of two numbers.)

Artificial Intelligence Applications:
The language of choice for the Artificial Intelligence category was Lisp, a
'List Processor': it manipulates linked lists of symbols.  Statements in Lisp
are likewise linked lists of symbols, so a running lisp program can make up
new lines of code on the fly and then execute them.  There are many languages
based on the fundamental 'list processing' idea of Lisp; the one we will
experiment with this semester is Scheme (described in Chapter 15 of Sebesta).

Systems Programming Applications:
In the early days, an operating system had to be written in assembly language
(or worse, in machine language), but then large companies developed higher-level
languages that were suited to the task.  Soon after Bell Labs developed the
C language, this was almost universally adopted for systems work.  It's not
much of an exaggeration to say that all Systems Programming is done in C.
[This is one of the reasons why you definitely want to become fluent in C.]

C maps very directly onto assembly language in much the same way that assembly
language maps directly onto machine code, so C programs run quite efficiently,
which is an essential quality that an Operating System must have.  Whenever
a new CPU is introduced with a new instruction set, that CPU will likely
be capable of running UNIX within an incredibly short period of time.
This is because, with the exception of a few tricky bits, the entire
operating system is written in C.  So, once you have a C compiler for
this new system, you've almost got everything you need to construct a UNIX
system on the new hardware.

An example of one of the 'Tricky bits': when the operating system suspends
the process currently using the CPU and restarts another process on the
CPU; this involves specialized machine instructions that can't be expressed
in C, and must instead be written in assembly language.
There may need to be some adjustments for new hardware features, as well,
but overall, the port is largely accomplished before it begins.

Since C compilers are written in C, porting the C compiler to a new system is
even easier than porting UNIX.  At first glance, that sounds insane: the 'new'
C compiler is supposed to create machine-code instructions for a machine that
uses machine-code instructions that are [presumably] different than the
instruction set for any previously-existing machine.  How can you create
the first C compiler for a new architecture that does not yet have any
usable software at all?
The trick is to use a cross-compiler [or more likely, a cross-assembler].
A typical compiler on edoras is set up to create executable code that can
run on edoras [or other Intel hosts using the same architecture], but a
cross-compiler creates machine code targeted for a different type of host.
So, to 'get things going' on a brand-new architecture design, an existing C
compiler on older hardware is tweaked to produce machine-code instructions for
the new hardware -- the results can then be moved to the new hardware and run.
[You perhaps already had a gut feeling about this process: if you were
creating a new iPhone application, you wouldn't expect the iPhone to
compile it for you, would you?]

Another reason C runs efficiently is that it was not designed for Children
Being Naughty.  Many languages will, while your program is running, do many
checks for validity (like whether you are trying to access beyond the end of
an array), and this slows things down significantly.

C is for grownups.  If you tell C to write beyond the end of an array
(or even into the memory locations before the beginning of an array), it
will merrily do so [and the results will mostly likely be to destroy data
for some of your other variables].  The C programmer is thus responsible
for inserting range-checking code in those circumstances where this might
be an issue -- the compiler does not do it for you [to you].

Likewise, if your main() program has a call to, say, sin(a), where a is
declared to be an integer, a *copy* of the bit patterns in the four bytes
where 'a' lives will be put on the frame stack for sin() to use.  However,
sin() expects the argument to be a double [a high-precision floating point
number], not an integer, so it will read eight bytes from the frame stack
[not just four], and interpret those eight bytes as an exponent and mantissa,
not as a simple integer.  What sin(1) 'sees' will definitely not be a '1'.

This problem can be avoided by making sure the file containing main() has:
#include <math.h>
...which will ensure that main() knows about the functions it may be calling.
In particular, /usr/include/math.h contains the line
double sin(double x);
which is a prototype that declares that sin() is a function that wants a
double as an argument, and will return a double as an answer.  Armed with
this information, the C compiler will reserve 8 bytes on the frame stack
for the argument, and will convert the integer (1 in the above example)
to the correct format (giving it the correct exponent and mantissa to
be interpreted as 1.0).  In the same manner, it will reserve 8 bytes in
which to place the answer the sin() delivers back to main() [by default,
the compiler will assume that any unknown function will return an int,
and will reserve only four bytes].

Web Software:
A markup language (like HyperText Markup Language, HTML) is not a programming
language; it's basically just a way to describe how text should be displayed
on a page, and a way to accommodate web page links.  However, it is often
necessary to generate HTML directives on the fly for customized content.
(Think about what you see when you log on to an on-line bank account; you
are not getting a page that was already stored on disk, but a page that was
constructed on demand, just a moment ago.)  Languages that BUILD these
'dynamic' web pages are indeed programming languages.

1.3 Language Evaluation Criteria (Page 7/41)

I don't have much to add to this section (indeed, it goes on a bit too long,
in my estimation).  Several general principles are defined, and then specific
instances are discussed.

A desirable language feature is 'readability'.  Perl, for example, has
sometimes been derided as a 'write-only' language; once a perl program has
been written, it can be very hard for others to decipher (and sometimes
hard even for the author of the program, after the immediate details have
been forgotten).

If you are employed as a software developer, there is a very good chance that
most of your time will be spent modifying existing code, rather than writing
shiny-new code.  Being able to figure out what the previous programmer did
is essential to successfully modifying the code, so readability is extremely
important.

The next subsection (Section 1.3.1.1, Page 9/43) basically elaborates on the
idea that things should be simple, but not too simple.  If you read through
their examples, you'll see what they mean.  (So READ IT!)

Choosing appropriate syntax can drastically affect the usability of a language.
Bad syntax constructs can make it hard for users to decipher and/or hard for
the compiler to function.  FORTRAN, designed back when we had little practical
experience regarding what works and what doesn't, is pretty bad.  Control
structures like 'while' and 'until' were not in the language, in part because
they wanted to make the compiler simpler -- the idea of a compiler was a
relatively new thing, and writing a working one was a huge undertaking.
(Nowadays, we have compiler-compilers such as yacc and bison, and it's
much easier!)

FORTRAN has a DO loop construct.  An example from FORTRAN66 would be:

DO 30 I = 1,5
...
which will iterate the body of the loop [up through the statement labelled 30]
as the variable I takes on values 1 through 5.

DO 30 I = 1.5
...looks like an error (there's a period where you would expect a comma),
but it is legal.  This is an assignment statement.  This causes the floating
point value 1.5 to get assigned to the variable name "DO 30 I" -- yes, spaces
are allowed in variable names... and it gets worse.  The compiler is required to
ensure that "DO 30 I", "DO30I", and "D O 3 0 I" all refer to the same memory
location!

In order for the compiler to figure out what a line of code means, it first
has to group the characters into separate words or 'tokens'.  It is really
useful for the compiler to know that a space means "we've come to the end
of a token", but the FORTRAN compiler can't make that assumption.  It has
to read all the way to the comma (or period) before it can tell whether
this is a DO loop or not.

It gets worse.
DO = 1.5
is also legal.  "DO" can be a variable name as well as a control keyword.
In most languages that have a DO statement, the moment the compiler identifies
"DO" as a token, it knows for sure that is looking at a control statement (or,
if things are mangled, a syntax error) -- it is not burdened with backtracking
and trying to match the statement to some other sort of construct.  [As we will
see, "DO" would then be described as a 'reserved word' rather than a 'keyword'.]

So, it turns out that writing that compiler was a lot more work than it needed
to be, if only they had made some better decisions when designing the syntax.
And the issues described above make it harder for humans to discover the
meaning of FORTRAN code -- it's not just a compiler issue.

Of course, in days of yore, when multi-million-dollar computers roamed the earth
and we had a $1.25 minimum wage, it was the compiler's time we cared about, not
the human's time.  Indeed, there was a time when using astronomically expensive
CPU time to compile code (which could 'just as easily' be accomplished by a
roomful of assembly language programmers) seemed absolutely frivolous to some.

After we later delve more deeply into how a compiler works, you probably
should come back and read this analysis of FORTRAN again; once you have
more background, all this will be more meaningful.

Another desirable feature is 'reliability'.  The earlier sin() discussion
about parameter passing in C illustrates this.  Early versions of the C
compiler [traditional C] did not do type checking, allowing an integer
bit pattern to be passed to sin() and treated as a double, with disastrous
results.  Instead, programmers could feed all their .c source files
to a program called 'lint' [and you still can; the program is called 'splint'
on edoras], which would do a bunch of sanity checks, including
comparing the actual parameters of each function call to the expected type
of the parameters.  [Recall that a C program can be split up into many
different .c files, and each .c file can be compiled separately, in which
case the C compiler did not know what was declared in those other .c files.]
The [sp]lint program looked at 'the big picture', whereas the C compiler dealt
with the files one at a time.

ANSI C introduced function prototypes [for example, "double sin(double x);"]
...so that as long as the programmer included the appropriate declarations
for all the function calls, the C compiler could ensure that no type
mismatches occurred [by arranging for the needed conversions, or raising
warnings/errors when the programmer asked for something dubious].

Several other features that contribute to reliability, such as the ability
to handle run-time exceptions without crashing, are also discussed in the text.

1.3.4 Cost (Page 17/53)
The author has some good things to say regarding cost, so read over this
subsection.

We will skip most of the rest of Chapter 1 [I expanded on the most important
background in my Chapter 0 notes.]  Section 1.7.1 (Page 26/68) gives a general
description of the compilation process [which I already covered in gory
detail for the C compiler].  Section 1.7.4 (Page 32/76) gives more sophisticated
examples of how the preprocessor works.

A note about the exams:

Rote memorization of a bunch of facts is a disastrous way to prepare for my
exams, since I will be trying to test your understanding of important concepts.
So, you will definitely NOT see questions of the form:
'What are the differences between Fortran66 and Fortran77?'
...because no one needs to carry this sort of information around in their heads;
if your really needed the answer, google could tell you very quickly.

Instead, you should expect questions like:
'Why can't we store global variables on the stack?'
...which you probably can't answer right now, but as you get more comfortable
with the concepts we cover in this course, the reason should become completely
obvious.  By contrast, if you were compiling a list of facts, the answer
would probably not be on your list.  So, it is very important to concentrate
on understanding, not memorizing!  (Plus, you'll be able to bring a page of
notes to the exams for things that are inconvenient to memorize.)

The review questions at the end of each chapter can be a useful study guide.
Some especially useful ones are:
Sebesta questions for Chapter 1 (Page 34): 1,3,4,5,6,12,14,19,20,23,26,27,28
These are for the NINTH edition.  What was question 12 in the ninth edition has
been removed, so what was question 14 became question 12 in the twelfth edition,
and so on.  Here's what the (shortened) list looks like in the twelfth:
Sebesta questions for Chapter 1 (Page 81 TWELFTH ed): 1,3,4,5,6,12,18,19,22,25,26,27

Many of the Sebesta review questions are of the form:
"Give a definition for [something]"
I rarely phrase exam questions that way; I would be more likely to ask:
"Does [some-concrete-example] fit the definition for [something]"
or:
"Give an example that fits the definition of [something]"
Therefore, you must know what the definition *means*; this is far different
than memorizing a sequence of words that you can parrot back.

Some additional sample questions that could appear on our exams:

A. Why would someone want to use a cross-compiler?

B. What are the benefits of #include'ing one of the system .h files?

C. List one or two instances in which FORTRAN language design decisions 
   made the parsing of programs harder than necessary.

D. Why is it useful to have function prototypes (as in ANSI C)?