Chapter 1 -- Preliminaries 1.1 Reasons for Studying Concepts of Programming Languages (Page 2/33) Note on the "2/33" page numbering: If you have the ninth edition, this section begins on Page 2. If you have the pdf version of the twelfth edition, this section begins on Page 33. Hence the "2/33" designation in the title above. CAUTION: There is also a paperback version of the twelfth edition, with page numbers wildly different than the pdf version of the twelfth edition. So the paperback version will NOT match up nicely with the page numbers listed in these notes. The only two options you should consider is the hardcover ninth edition and the pdf twelfth edition (eText). Why is CS320 a required course? This course will likely be your first exposure to the C programming language, which you will need in later courses (and indeed it is essential in a significant fraction of real-world programming jobs). CS320 will also be our vehicle for introducing several side issues (such as how a compiler actually performs its magic, how a frame stack works, etc.) and we will look at various features of multiple programming languages, and get hands-on experience in programming several of them. There will be mini-tutorials on features of C throughout the course, so that toward the end of the semester, you should be capable of writing significant C code. Earlier in the semester, we will cover the basics of some specific languages, and then do a hands-on assignment in that language. APL will be one of our early programming experiences. The Sebesta text is the only required book for this course; I will point you to web resources for language specifics [but you may find it more convenient to have actual texts to peruse]. At some point in your career, you will probably want to have a C language reference: C by Discovery by Foster & Foster is a good choice for learning C, because it explains things thoroughly and contains tons of sample programs to illustrate the concepts. edoras:~masc0000/foster.tar.z is a tarball containing all the sample programs [from the second edition, but most of these are in the other editions, too]. If you happen to have access to this text, you will probably want my corresponding notes for it, which is at edoras:~cs320/Foster4th . [The page numbers in my Foster notes are keyed to the Fourth Edition; but since you can get older editions dirt cheap (AbeBooks.com had a second edition for $1.05 plus $2.95 shipping!), it's probably not worth an additional $80 just to have the page numbers match my notes.] A good place to do comparison pricing is addall.com; the ISBN number for the second edition is 1881991296 (also 1881991298). Our 320 text lists several reasons for studying programming language concepts. An obvious one is the fact that the more tricks you learn, the more capable you become. Learning even more languages then becomes easier, since there won't be as many new syntax patterns to stub your toe upon. Also, you should hope that your career will advance to the point where you will have the authority to decide which language will be used for a new project. A language well-suited to one project may be a disastrous choice for a different project. Sebesta points out that it is a natural human tendency to always choose the language you know best. [This is the 'if you give a kid a hammer, everything looks like a nail' syndrome.] It is important to have a much broader horizon if you are to be an effective project manager. While almost any language can perform almost any task given enough effort [though it may perform it exceptionally slowly], it is better to match the programming tool [language] to the task. The fact that you can put a tutu on a horse does not mean that it's a good idea to put the horse on the ballet stage. As we delve into the implementation details, you should become more comfortable with 'the way things work', which will likely make you a more effective programmer. You have probably have encountered various mysteries with the compilation or execution of some of the programs you have written. An understanding of the inner workings and concepts will likely reduce the number of unsolved mysteries. And with luck, you'll learn some new tricks for languages that you already thought you knew. 1.2 Programming Domains (Page 5/38) Scientific Applications: Many of the programs you have written up to this point probably fall into the category of Scientific Applications. FORTRAN, the dominant language at the dawn of computing, was designed for this task. Scientific applications typically employ very simple data structures but lots and lots of numeric calculations, and FORTRAN did that quite capably. Business Applications: The Business Application category is *still* dominated by COBOL, a dinosaur from half a century ago. It was designed to make it easy to print reports (paychecks, profit-and-loss statements, etc.). The syntax does a remarkable job of simulating natural English, e.g., ADD ITEM-PRICE TO TOTAL-ORDER ...giving naive business managers the impression that they could read the code. (Note that '-' is part of the variable names, and has nothing to do with subtraction; you need the keyword SUBTRACT to take the difference of two numbers.) Artificial Intelligence Applications: The language of choice for the Artificial Intelligence category was Lisp, a 'List Processor': it manipulates linked lists of symbols. Statements in Lisp are likewise linked lists of symbols, so a running lisp program can make up new lines of code on the fly and then execute them. There are many languages based on the fundamental 'list processing' idea of Lisp; the one we will experiment with this semester is Scheme (described in Chapter 15 of Sebesta). Systems Programming Applications: In the early days, an operating system had to be written in assembly language (or worse, in machine language), but then large companies developed higher-level languages that were suited to the task. Soon after Bell Labs developed the C language, this was almost universally adopted for systems work. It's not much of an exaggeration to say that all Systems Programming is done in C. [This is one of the reasons why you definitely want to become fluent in C.] C maps very directly onto assembly language in much the same way that assembly language maps directly onto machine code, so C programs run quite efficiently, which is an essential quality that an Operating System must have. Whenever a new CPU is introduced with a new instruction set, that CPU will likely be capable of running UNIX within an incredibly short period of time. This is because, with the exception of a few tricky bits, the entire operating system is written in C. So, once you have a C compiler for this new system, you've almost got everything you need to construct a UNIX system on the new hardware. An example of one of the 'Tricky bits': when the operating system suspends the process currently using the CPU and restarts another process on the CPU; this involves specialized machine instructions that can't be expressed in C, and must instead be written in assembly language. There may need to be some adjustments for new hardware features, as well, but overall, the port is largely accomplished before it begins. Since C compilers are written in C, porting the C compiler to a new system is even easier than porting UNIX. At first glance, that sounds insane: the 'new' C compiler is supposed to create machine-code instructions for a machine that uses machine-code instructions that are [presumably] different than the instruction set for any previously-existing machine. How can you create the first C compiler for a new architecture that does not yet have any usable software at all? The trick is to use a cross-compiler [or more likely, a cross-assembler]. A typical compiler on edoras is set up to create executable code that can run on edoras [or other Intel hosts using the same architecture], but a cross-compiler creates machine code targeted for a different type of host. So, to 'get things going' on a brand-new architecture design, an existing C compiler on older hardware is tweaked to produce machine-code instructions for the new hardware -- the results can then be moved to the new hardware and run. [You perhaps already had a gut feeling about this process: if you were creating a new iPhone application, you wouldn't expect the iPhone to compile it for you, would you?] Another reason C runs efficiently is that it was not designed for Children Being Naughty. Many languages will, while your program is running, do many checks for validity (like whether you are trying to access beyond the end of an array), and this slows things down significantly. C is for grownups. If you tell C to write beyond the end of an array (or even into the memory locations before the beginning of an array), it will merrily do so [and the results will mostly likely be to destroy data for some of your other variables]. The C programmer is thus responsible for inserting range-checking code in those circumstances where this might be an issue -- the compiler does not do it for you [to you]. Likewise, if your main() program has a call to, say, sin(a), where a is declared to be an integer, a *copy* of the bit patterns in the four bytes where 'a' lives will be put on the frame stack for sin() to use. However, sin() expects the argument to be a double [a high-precision floating point number], not an integer, so it will read eight bytes from the frame stack [not just four], and interpret those eight bytes as an exponent and mantissa, not as a simple integer. What sin(1) 'sees' will definitely not be a '1'. This problem can be avoided by making sure the file containing main() has: #include ...which will ensure that main() knows about the functions it may be calling. In particular, /usr/include/math.h contains the line double sin(double x); which is a prototype that declares that sin() is a function that wants a double as an argument, and will return a double as an answer. Armed with this information, the C compiler will reserve 8 bytes on the frame stack for the argument, and will convert the integer (1 in the above example) to the correct format (giving it the correct exponent and mantissa to be interpreted as 1.0). In the same manner, it will reserve 8 bytes in which to place the answer the sin() delivers back to main() [by default, the compiler will assume that any unknown function will return an int, and will reserve only four bytes]. Web Software: A markup language (like HyperText Markup Language, HTML) is not a programming language; it's basically just a way to describe how text should be displayed on a page, and a way to accommodate web page links. However, it is often necessary to generate HTML directives on the fly for customized content. (Think about what you see when you log on to an on-line bank account; you are not getting a page that was already stored on disk, but a page that was constructed on demand, just a moment ago.) Languages that BUILD these 'dynamic' web pages are indeed programming languages. 1.3 Language Evaluation Criteria (Page 7/41) I don't have much to add to this section (indeed, it goes on a bit too long, in my estimation). Several general principles are defined, and then specific instances are discussed. A desirable language feature is 'readability'. Perl, for example, has sometimes been derided as a 'write-only' language; once a perl program has been written, it can be very hard for others to decipher (and sometimes hard even for the author of the program, after the immediate details have been forgotten). If you are employed as a software developer, there is a very good chance that most of your time will be spent modifying existing code, rather than writing shiny-new code. Being able to figure out what the previous programmer did is essential to successfully modifying the code, so readability is extremely important. The next subsection (Section 1.3.1.1, Page 9/43) basically elaborates on the idea that things should be simple, but not too simple. If you read through their examples, you'll see what they mean. (So READ IT!) Choosing appropriate syntax can drastically affect the usability of a language. Bad syntax constructs can make it hard for users to decipher and/or hard for the compiler to function. FORTRAN, designed back when we had little practical experience regarding what works and what doesn't, is pretty bad. Control structures like 'while' and 'until' were not in the language, in part because they wanted to make the compiler simpler -- the idea of a compiler was a relatively new thing, and writing a working one was a huge undertaking. (Nowadays, we have compiler-compilers such as yacc and bison, and it's much easier!) FORTRAN has a DO loop construct. An example from FORTRAN66 would be: DO 30 I = 1,5 ... which will iterate the body of the loop [up through the statement labelled 30] as the variable I takes on values 1 through 5. DO 30 I = 1.5 ...looks like an error (there's a period where you would expect a comma), but it is legal. This is an assignment statement. This causes the floating point value 1.5 to get assigned to the variable name "DO 30 I" -- yes, spaces are allowed in variable names... and it gets worse. The compiler is required to ensure that "DO 30 I", "DO30I", and "D O 3 0 I" all refer to the same memory location! In order for the compiler to figure out what a line of code means, it first has to group the characters into separate words or 'tokens'. It is really useful for the compiler to know that a space means "we've come to the end of a token", but the FORTRAN compiler can't make that assumption. It has to read all the way to the comma (or period) before it can tell whether this is a DO loop or not. It gets worse. DO = 1.5 is also legal. "DO" can be a variable name as well as a control keyword. In most languages that have a DO statement, the moment the compiler identifies "DO" as a token, it knows for sure that is looking at a control statement (or, if things are mangled, a syntax error) -- it is not burdened with backtracking and trying to match the statement to some other sort of construct. [As we will see, "DO" would then be described as a 'reserved word' rather than a 'keyword'.] So, it turns out that writing that compiler was a lot more work than it needed to be, if only they had made some better decisions when designing the syntax. And the issues described above make it harder for humans to discover the meaning of FORTRAN code -- it's not just a compiler issue. Of course, in days of yore, when multi-million-dollar computers roamed the earth and we had a $1.25 minimum wage, it was the compiler's time we cared about, not the human's time. Indeed, there was a time when using astronomically expensive CPU time to compile code (which could 'just as easily' be accomplished by a roomful of assembly language programmers) seemed absolutely frivolous to some. After we later delve more deeply into how a compiler works, you probably should come back and read this analysis of FORTRAN again; once you have more background, all this will be more meaningful. Another desirable feature is 'reliability'. The earlier sin() discussion about parameter passing in C illustrates this. Early versions of the C compiler [traditional C] did not do type checking, allowing an integer bit pattern to be passed to sin() and treated as a double, with disastrous results. Instead, programmers could feed all their .c source files to a program called 'lint' [and you still can; the program is called 'splint' on edoras], which would do a bunch of sanity checks, including comparing the actual parameters of each function call to the expected type of the parameters. [Recall that a C program can be split up into many different .c files, and each .c file can be compiled separately, in which case the C compiler did not know what was declared in those other .c files.] The [sp]lint program looked at 'the big picture', whereas the C compiler dealt with the files one at a time. ANSI C introduced function prototypes [for example, "double sin(double x);"] ...so that as long as the programmer included the appropriate declarations for all the function calls, the C compiler could ensure that no type mismatches occurred [by arranging for the needed conversions, or raising warnings/errors when the programmer asked for something dubious]. Several other features that contribute to reliability, such as the ability to handle run-time exceptions without crashing, are also discussed in the text. 1.3.4 Cost (Page 17/53) The author has some good things to say regarding cost, so read over this subsection. We will skip most of the rest of Chapter 1 [I expanded on the most important background in my Chapter 0 notes.] Section 1.7.1 (Page 26/68) gives a general description of the compilation process [which I already covered in gory detail for the C compiler]. Section 1.7.4 (Page 32/76) gives more sophisticated examples of how the preprocessor works. A note about the exams: Rote memorization of a bunch of facts is a disastrous way to prepare for my exams, since I will be trying to test your understanding of important concepts. So, you will definitely NOT see questions of the form: 'What are the differences between Fortran66 and Fortran77?' ...because no one needs to carry this sort of information around in their heads; if your really needed the answer, google could tell you very quickly. Instead, you should expect questions like: 'Why can't we store global variables on the stack?' ...which you probably can't answer right now, but as you get more comfortable with the concepts we cover in this course, the reason should become completely obvious. By contrast, if you were compiling a list of facts, the answer would probably not be on your list. So, it is very important to concentrate on understanding, not memorizing! (Plus, you'll be able to bring a page of notes to the exams for things that are inconvenient to memorize.) The review questions at the end of each chapter can be a useful study guide. Some especially useful ones are: Sebesta questions for Chapter 1 (Page 34): 1,3,4,5,6,12,14,19,20,23,26,27,28 These are for the NINTH edition. What was question 12 in the ninth edition has been removed, so what was question 14 became question 12 in the twelfth edition, and so on. Here's what the (shortened) list looks like in the twelfth: Sebesta questions for Chapter 1 (Page 81 TWELFTH ed): 1,3,4,5,6,12,18,19,22,25,26,27 Many of the Sebesta review questions are of the form: "Give a definition for [something]" I rarely phrase exam questions that way; I would be more likely to ask: "Does [some-concrete-example] fit the definition for [something]" or: "Give an example that fits the definition of [something]" Therefore, you must know what the definition *means*; this is far different than memorizing a sequence of words that you can parrot back. Some additional sample questions that could appear on our exams: A. Why would someone want to use a cross-compiler? B. What are the benefits of #include'ing one of the system .h files? C. List one or two instances in which FORTRAN language design decisions made the parsing of programs harder than necessary. D. Why is it useful to have function prototypes (as in ANSI C)?