Chapter 0: Overview/Background (NOT a Chapter in the Sebesta text!) In this course we will discuss languages in general, and carry out programming assignments in several different languages, but when I need to give concrete specifics, I will usually cite the behavior of a C compiler on a UNIX host (such as gcc on edoras or a linux PC). The C compiler is traditionally called "cc" [cc stands for _C_ _C_ompiler]; on linux systems, it's gcc [gnu C compiler]. Our old Solaris host (rohan) had a cc supplied by Sun Microsystems, as well as gcc from the GNU open software project [GNU = "_G_nu's _N_ot _U_NIX"]. On edoras, the only compiler is gcc: edoras% ls -l /usr/bin/cc lrwxrwxrwx 1 root root 3 Nov 30 2020 /usr/bin/cc -> gcc That is, if you invoke 'cc', that name is found under /usr/bin, but it just a soft link (like a Microsoft Windows 'shortcut'] to /usr/bin/gcc. Our primary textbook is "Concepts of Programming Languages" NINTH Edition, by Robert W. Sebesta. (There's a twelfth edition out now, but it's almost identical to the ninth edition, so there's no point in paying the huge premium for the 'newest' version. You can use another edition if you happen to already own it -- the Chapter titles are all the same, as are [almost] all the section titles. The pairs of page numbers I cite in these notes will be for the ninth and 12th editions. CAUTION: The suggested problem numbers I list for Sebesta are for the ninth (and twelfth) editions, and they are NOT always the correct numbers for other editions.) Later on, we'll discuss some (cheap!) supplemental texts you should get. For our main text, I recommend getting a used copy of the ninth edition, and then keeping it indefinitely as part of your CS 'reference library'. A good place to do comparison pricing is addall.com; the ISBN number for the ninth edition is 0136073476. [There were some used ones for under $10 available in July 2020.] You could also 'rent' an electronic version of the twelfth edition (which of course evaporates at the end of the semester), but it will be much more expensive. And of course, almost every book for every class you are taking can be found in pdf form for free on the internet. Still, I recommend spending $10 or so, and getting something you can hold in your hands. CAUTION: My notes include page numbers for the two choices outlined above ONLY. There is also a paperback version of the twelfth edition, with page numbers wildly different than the pdf version of the twelfth edition. So the paperback version will NOT match up nicely with the page numbers listed in these notes. The only two options you should consider is the hardcover ninth edition and the pdf twelfth edition (eText). In order to fully understand the material in our textbook, it is important to have a good grasp of some fundamental issues, so we will delve into some preliminaries before we get to Chapter 1. In particular, we need to know a bit about the architecture of modern hardware, the components of a running program (a 'process'), and how a compiler actually functions. The last two of these three areas are covered in detail in the Sebesta text. All but the tiniest modern computers utilize an addressing scheme that employs virtual memory locations. A computer with a 64-bit address bus can distinguish 18,446,744,073,709,551,616 distinct memory locations (from 0 up through 18,446,744,073,709,551,615). Needless to say, the computer you are using does not have anywhere near this amount of physical memory. Still, your compiler will produce code that references very high addresses within this range as well as very low addresses, but typical programs will only use a very small fraction of the available address space. The blocks [called 'pages'] of these 'fake' [virtual] addresses that actually get used are mapped to various blocks of 'real' [physical] addresses [called 'page frames']. When it is time to execute a machine instruction which contains one of these impossible-huge addresses [a 'virtual' address], the hardware (aided occasionally by software) will translate the virtual address into an actual physical location [the true 'physical' address]. The programmer [or the compiler, for that matter] doesn't deal with these issues at all; the necessary translations needed to 'shrink' the address space to a realistic size is handled transparently by the hardware and the operating system. [The above summary is a drastic simplification of what actually goes on, but it is sufficient for our purposes. Never fear, you'll be subjected to more of the gory details in the Operating Systems (CS480) class.] On a very simple system (e.g., a PC running MS-DOS), protecting programs from one another is a non-issue, since only one process could run at a time. On edoras, a single user may be running dozens of processes at once, and so it's important that a process not be allowed to overwrite memory locations that have been assigned to another process. Indeed, edoras may have 100 users each running a dozen processes, each of which has to be 'compartmentalized'; it would not be good if one user could erase files belonging to another user, etc. Modern systems accomplish this 'compartmentalization' with hardware enhancements that distinguish between 'user mode' and 'kernel mode' (sometimes called 'supervisor mode'). The idea is to distinguish between 'harmless' [user-mode] instructions (e.g., adding 1 to a register you control) and 'dangerous' [kernel-mode] instructions (writing to a disk drive, killing an arbitrary process, rebooting the system). The programs you write run in user mode, to ensure you can't do things that might be destructive to other users or processes. So, how do we manage to ever do anything that falls into the 'dangerous' category? If you wish to (say) write to a file, you issue a system call; when a system call is requested, the system switches from user mode to kernel mode, and then the operating system checks whether you have the authority to do what you have requested (in this case, do you have permission to write on this particular file). On edoras, you'd expect the system call to succeed if you write on a file that you own, and that you would instead get an error message if you tried to write over someone else's file (or if you tried to write too much and exceeded the disk space quota alloted to your account). Page 18 of the xeroxed notes shows a diagram of a process' layout. The machine-code instructions ('Text') is at the bottom ('bottom' meaning at the lowest [virtual] addresses), with the statically-allocated data on top of it (which is further divided into 'Data' and 'BSS' areas, discussed later). Even though the kernel is loaded into the lowest part of physical memory, it has virtual addresses corresponding to the upper range of virtual space. Whenever a process makes a system call, the picture of what memory looks like expands to what I've shown on Page 18 of the class notes. During a system call, the virtual addresses for the process remain unchanged, but now the addresses above the 'top' of the user process refer to data structures in the kernel -- which undoubtedly will be accessed during the system call. [If a process were to try to refer to these high addresses while in user mode, it would result in not just a page fault, but your process would be terminated and you'd be told there was a 'segmentation violation'. These addresses are only legal in kernel mode, not in user mode.] In C, the Data segment contains things like global variables and constant strings (e.g., the messages you might put within double quotes in a printf() statement in your source code). Local variables, of course, are kept in the relevant frames of the stack, e.g., the stack frame for main() is where the address of your command-line parameter array (argv[]) is stored -- assuming you have declared main(argc,argv) rather than just main(). The simple program ~cs320/Source/arg.c illustrates this, corresponding to the argc_and_argv example in these notes. There is a third parameter to main(), if you want it: check out ~cs320/Source/environ.c Note that what main() is given is this argc integer and an ADDRESS of where the array we are calling "argv" lives. So, where does that argv[] actually live? There is a crt (stands for 'C RunTime') function that is responsible for calling main() -- this is something the C compiler is responsible for creating and initializing, and the code in this 'prefunction' (which principally calls load_elf_binary() ) collects the command-line parameters, attaches the standard file streams for input and output, and, once everything is set up, calls main(). The argv[] *data* that is actually made available to main() lives in the stack frame for this crt function, just 'above' the stack frame for main(). When main() returns [exits], the runtime function flushes all the opened streams, closes those streams, deallocates system resources, and passes the exit code that main() returned to the calling environment. Now we will examine what happens when we compile something. When we invoke the C compiler (cc or gcc), we actually do several things in sequence: cpp, cc, as, and ld (the C PreProcessor, the C Compiler itself, the ASsembler, and the linker/LoaDer). cpp strips out comments and takes the statements like #include and #define and creates a 'pure' .c file that no longer depends on those '#' statements (by placing entire .h files into the source code, and [intelligently] making substitutions for what is #define'd). For example, cpp takes the two lines #define STORAGE 255 printf("STORAGE size is %d\n", STORAGE); ...and turns it into: printf("STORAGE size is %d\n", 255); You can use 'cc -E' on a .c file to see just what source code cpp will produce. Note that the #define/printf() lines would have resulted in code that the C compiler would have flagged as a syntax error (if the preprocessor had not massaged it first, because the 'STORAGE' parameter would appear to be an undeclared variable). Syntax warning: Most C statements end with ';' but they are never appropriate in #define statements -- we want STORAGE replaced by "255", not "255;" in the printf() statement. Once the preprocessor has massaged the code, the actual C compiler runs on the result [intermediate files such as the preprocessor output are stored in /tmp, which on many systems is a RAM-disk, which makes the compilation go faster, since we write to memory, not to disk!], and produces object code (the modules ending in .o). You can create these yourself with 'cc -c'. The C compiler 'cheats' a little bit here -- it briefly creates assembly code [temporary files that end in .s], and then calls the assembler (as) to turn them into proper .o files. You can tell cc to stop after it has created the assembler text with 'cc -S'. These .o files are 'not ready for prime time' -- they can't be run, and may not even contain a main() function. They do consist of machine-code instructions from the assembler, but they have lots of unresolved references. For example, if your .c file had a printf() statement, the .o file simply has a place to jump to the correct entry point of printf.o, but it has no clue where the code it is supposed to jump to lives [because we almost certainly don't have a printf() function defined in our .c file]. We do have printf() *declared* in our .c file (courtesy of some .h file), so the compiler knows that it's OK for printf() to have an arbitrary number of arguments, and it knows to set aside four bytes of information for the value that printf() returns, and that we should interpret those 32 bits as an integer value... but that's about all it knows. If you have defined several functions within a .c file (let's call it first.c), then any call to any of those functions anywhere within that file can be 'sanity-checked' by the compiler; the compiler has enough information to make sure that the types of the parameters in the call match the types that the function is expecting, and that the type that the function returns is consistent, too. If the functions you've defined here are used in some other .c file (let's call it second.c), then the stuff in second.c has no clue about how the functions in first.c are defined. So, if you make a call from inside second.c to a function defined in first.c and happen to get the number of parameters wrong, or make them the wrong type, the compiler will just assume you know what you're doing, and set up the function call as you have specified. This is an accident waiting to happen; we avoid the problem by declaring a function prototype for every function referenced [in each .c file]. If we wish to use a predefined math function like sin(), we would add a line in the .c file that calls sin() to specify to the compiler what to expect: double sin(double x); (This is only one line -- it's a declaration, not a definition. The actual definition of what sin() is supposed to do when it gets called was enshrined in a .c file a long, long time ago, compiled into an object module, and then sin.o was dumped into a library archive [see below for more details on libraries]. The declaration just says how many parameters sin() is willing to use (one), what type that parameter is (a double) and what kind of result will be returned (a double).) Armed with this information, the compiler can then look at your actual calls to sin(), and warn you if you (for example) are passing an int to sin() rather than a double. With similar declarations, the compiler can likewise do checks for every function (whether it is a system function or a function you have defined in some other .c file you have written) that you are calling. How do you find out what kind of a thing sin() is? By looking in the C man pages for math functions, e.g., edoras% man 3 sin ...which will tell you: SIN(3) Linux Programmer's Manual SIN(3) sin, sinf, sinl - sine function SYNOPSIS #include double sin(double x); ... Link with -lm. DESCRIPTION The sin() function returns the sine of x, where x is given in radians. ... The parameter specification "double sin(double x);" are listed right there on the third line of the synopsis. However, you generally would not type the declaration line, but instead simply use the recommendation on the second line: #include ...which contains declarations for ALL the math functions (including sin() ). The last line of the synopsis mentions the library directive you will need to get this to link (-lm); we'll talk about what -l does in a little bit. To get a true a.out file, we still need to link a bunch of .o files together. Some of these are probably in .a files or .so files [libc.so is where printf.o lives], and exactly one of the .o files that you are linking must have been created from a .c file which has defined a main() function within it. Some of the options that you can give to cc [or gcc] can give hints as to where to find some of the 'missing' .o files, and all of this highly-tailored information can be placed in a 'makefile', so that you don't have to type those complicated switches over and over again as you develop your program. For example, one such switch might be -lm . Actually, the C compiler ignores this switch; it is intended to be acted upon by the linker/loader (ld). ld knows some standard places to look for .o files (so we don't have to give it any hints in order to find printf.o, for example), but if you have unresolved references to little-used utilities such as the trig function sine, we have to tell ld what library we need. With -lm, it tries to find libm.a [.a indicates 'archive'] or libm.so [.so indicates 'shared object'] in one of the 'standard' places (/lib or /usr/lib), and discovers: edoras% ls -l /usr/lib/libm.so lrwxrwxrwx 1 root root 19 Dec 12 07:41 /usr/lib/libm.so -> ../../lib/libm.so.6 So, after following a complicated chain of soft links: lrwxrwxrwx 1 root 19 Dec 12 07:41 /lib/libm.so -> ../../lib/libm.so.6 lrwxrwxrwx 1 root 12 Dec 12 07:41 /lib/libm.so.6 -> libm-2.17.so -rwxr-xr-x 1 root 311736 Dec 6 13:29 /lib/libm-2.17.so ... we arrive at edoras% file /lib/libm-2.17.so /lib/libm-2.17.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (GNU/L inux), dynamically linked, BuildID[sha1]=b8ec333f31ca74841dd9709500bc6b526ab15 cf8, for GNU/Linux 2.6.32, not stripped ...so the linker digs sin.o out of this shared-object archive. (You can check this out for yourself if you like; the 'nm' command can be used to see the names of everything in an archive: nm /lib/libm-2.17.so | grep sin ...will list a whole bunch of sin()-related objects.) ...And 'whole bunch' will likely exceed your wildest dreams: edoras ~[742]% nm /lib/libm-2.17.so | grep sin | wc -w 270 That is, there are 270 functions with 'sin' in their name in this library (and, if you check, over 4000 math functions in just this one library). If the library had been in some non-standard place (not /lib nor /usr/lib), then we would also have had to have told ld what directory to look in, e.g., -L /usr/local/lib or maybe even -L . ...if we had created our own archive in the current directory. In that case, we'd be specifying both a -l argument and a -L argument. (-L indicates where to look, and -l gives a clue about what name to look for; -lm indicated libm.so, -lAardvark would indicate libAardvark.so, etc.) Once ld has located all the .o files that it needs, what does it do with them? The answer depends on whether we ask it to statically link the files or dynamically link them [the latter is the default -- in fact, many systems refuse to let you link statically any more]. When we statically link a program, the a.out file is huge, because we copy all the needed .o files [from your compiled .c files and from the precompiled archives such as /usr/lib/libc.a] into one big a.out file, and then the linker goes through and resolves all the external references. For example, now that it has created this big pile of .o files, it knows where the entry point address for printf.o lives, and can now fill in the proper address to jump to whenever some other .o file references printf(). This goes on recursively [since printf() most likely calls some other precompiled routines], and, depending on where ld put these other .o modules, further addresses within printf.o will have to be adjusted, and so on. Besides just machine-code instructions, ld has to put the static data (that is, stuff not on the stack) in the a.out file, too. For efficiency, it is separated into two parts: initialized data and uninitialized data. printf("STORAGE size is %d\n", 255); is going to be turned into the machine code equivalent of printf(98765,255) ... assuming that virtual address 98765 is where the 'S' in the character string lives. (The linker doesn't have a clue about whether the number should be 98765 or something else until it has piled all the .o files one on top of the other and then piled all the initialized variables on top of that.) So, all this data means that there are more references that have to be finalized by the linker before we have a proper a.out file. The uninitialized data is another story, and is somewhat easier to deal with. This data by definition consists only of zero bits, and it would be wasteful to store a whole bunch of nulls on the disk. Instead, the linker just figures out how much room these variables will ultimately need, and stores that size [representing how many bytes are needed] instead. When we actually exec() the a.out file, we dutifully copy [a page at a time, as needed] the .o code and the initialized data, but we just allocate an appropriate number of pages for the uninitialized data and set it all to zero -- this part is not copied from disk [since we didn't even bother storing a whole bunch of zeroes on disk]. Of course, before the linker finalized the a.out file, it had to plan out what virtual addresses each uninitialized variable was going to use, and then put each such address in the appropriate places in the machine-code instructions. What is different if we do dynamic [rather than static] linking? Shared library objects [like that print.o module from /lib/libc.so] are NOT dumped into the a.out file. We leave them [sort of] unresolved, making only a reference to the appropriate shared library. Then, when we exec() the a.out file, a little more work is done before we are ready to jump to the entry point for main(). First, the dynamic loader (/lib/ld-2.17.so) takes control, and reads all those hints the linker put into the a.out file about the shared library references. The needed library .o files are then dragged into memory [if they are not already there -- it's highly unlikely that your a.out file is the first one to want to use printf(), so printf.o probably already has been loaded into a physical page frame, and will be efficiently shared by every process that needs printf()]. Much like the static linker did when it was putting together the a.out file, the dynamic linker has to go through all the references to printf() in all the other .o modules, and put in the correct addresses for printf() [now that it knows where it is]. Finally, we can then jump to the entry point in main(). Note that the dynamic linker really does have to do this all over again every time we exec the a.out file -- the next time, printf() and other library routines might well show up in a different place. Despite this repeated work [as opposed to the static linker, which resolved library references exactly once, after which the a.out file was good to go forever], dynamic linking is a big win. Even with the extra overhead, dynamically-linked files probably load faster, since much of what you need is already in memory (such as the printf.o illustration above). We make far better use of RAM, as well. If all programs were statically linked, every single process would need its own copy of printf.o in memory, which is a huge waste of RAM; with dynamic linking, we only need one [set of] page frames to hold printf.o . Dynamically-linked files are kinder on disk space, as well -- a statically-linked executable could be 10 to 100 times larger than the dynamically-linked version! The size of the data segment is initially the Data size plus the BSS size, but this can be changed if more storage needs to be allocated dynamically. If you ask malloc() [C's memory allocator] to give you more space for some new data structure, the data size is extended by the requested amount, and malloc() will return an address which points to the beginning of this new chunk of memory. The heap (which refers to the jumble of storage items that have been dynamically allocated) is located just above the end of the [zero-initialized] BSS region (where the known-ahead-of-time variables have been placed, in a very orderly, non-jumbled fashion). The third major segment of your process [besides the text and data segments] is the stack segment. This includes your environment variable strings, the stuff for that crt function, and the frame for main(). As one function calls another, the stack grows downward. At some point, especially if you have recursive calls in your program, a new frame may extend beyond the bottom boundary of the stack segment, at which point your program is going to have to ask for more memory space to store all this additional information. The different segments are treated differently in memory. The stack segment must of course be writable, but on modern systems it is not executable: an attempt to jump to anywhere in this region and execute an instruction will terminate your process. This is to guard against processes becoming perverted; the C function gets() accepts an address of where to store a string but does absolutely no range checking. A common hacker trick was to find some system function that [very foolishly] used gets() to store user input, and then provide user input that effectively stuffed a very long string in there which just happened to be machine-code instructions that will overwrite the return address of one of your frames, and instead jump to some other instructions in that long string that does what the hacker wants. Such a trick does not work if the OS does not allow instructions to be executed from the stack area of [virtual] memory. By contrast, the text segment is executable (of course), but not writable: any attempt to change the machine-code instructions on a running process causes immediate termination. The assignment statement x=7; is translated into a machine-code instruction which involves the constant 7, and if you were to attempt to overwrite this constant at runtime, your process will die. Succinctly: "constants definitely are!" [Note that we're talking about the place where 0000...0111 (binary 7) lives, NOT the place (on the stack) where x lives -- x can be changed, 7 cannot.] This also means that the string in the earlier printf() example cannot be changed, even if you know the address of where it is. In the following example, x points to the beginning of a (constant) string of characters, whereas y points to the beginning of a (variable) array of characters. Here, x and y each point to strings of the same size and content, but they point to different locations, and only one of them is located in an area where writes are allowed. /* globals x and y are declared outside all functions, hence NOT on the stack */ char *x = "rat"; char y[] = "rat"; /* this is shorthand for: char y[4]; y[0]='r'; y[1]='a'; y[2]='t'; y[3]=0; */ int main(void){ *x = 'c'; /* This may terminate your program. */ *y = 'b'; /*This is allowed. */ } On some older systems, compiling with 'gcc -fwritable-strings' might allow *x = 'c'; to be executed without complaint. [This feature is now deprecated or non-existent.] A caution about man pages on edoras: Typing: whatis printf ...will show that there are many different utilities called printf, each of which has its own man page. Typing man printf ...will bring up the man page for the command-line printf utility from section 1 of the man pages, which has absolutely nothing to do with C programming. Programming commands are in section 3 (on some systems, , in subsection 'c', so we'd type 3c instead of just the 3 shown below). To look at the appropriate page, you will need to type: man 3 printf If you type this, you will see that it is a man page that covers four closely-related printing functions, so make sure you're looking at the parts concerning printf(), not fprintf(), sprintf(), or some of the others. In the SYNOPSIS near the top of the man page, you will find the file[s] that contain the function declarations and other relevant definitions. In this man page, it states: #include ...so this #include line should be near the top of every program you write that uses printf(). Page 15 of these lecture notes has a brief example about using man pages. Indeed, the first 60 pages illustrate lots of topics related to C [many of which are mentioned at the appropriate points in the Chapter lectures]. There are many sample programs discussed, and these are available online if you want to try to compile and play with them yourself. On Page 15, the prompt is 'Week3', which indicates that the relevant source files can be found under ~masc0000/Week3 . Page 13 of these notes shows the basics of navigating a UNIX system. In particular, it shows how to copy a sample program to your home directory, compile it, and run it. ~cs320/Source contains most of the actual .c files, but the 'Week' directories often contain more of the supporting information. The following is repeated in Chapter 1, to ensure no one misses it: At some point in your career, you will probably want to have a C language reference: C by Discovery by Foster & Foster is a good choice for learning C, because it explains things thoroughly and contains tons of sample programs to illustrate the concepts. edoras:~masc0000/foster.tar.z is a tarball containing all the sample programs [from the second edition, but most of these are in the other editions, too]. If you happen to have access to this text, you will probably want my corresponding notes for it, which is at edoras:~cs320/Foster4th . [The page numbers in those notes are keyed to the Fourth Edition; but since you can get older editions dirt cheap (AbeBooks.com had a second edition for $1.05 plus $2.95 shipping!), it's probably not worth an additional $80 just to have the page numbers match my notes.] The Foster text will show you a whole bunch of perfectly-working programs, which is good. My sample programs are designed to show you lots of things that can go wrong, and what it looks like when they do. [That is, these samples will have a lot in common with your first attempts at writing C code.] We'll supplement "C by Discovery" with my samples, which I suppose should be called "C by Mistake"... You can probably find a cheap textbook for the APL language as well, but I think the manual that comes with the free software will suffice [a version of Dyalog APL is available for educational use; there are Windows, Mac, and Linux versions]. You may want to review some of the APL operators and examples discussed in the manual after we go over the sample APL session that is included in these lecture notes [starting on Page 75]. You will be assigned edoras (edoras.sdsu.edu) accounts for use in this course; assignment and exam announcements and spreadsheet entries of your grade scores will be sent to this account (so you may wish to add a .forward file if you don't plan to check it for mail regularly). This account is assigned to you, but owned by me, which means I can sneak into it to examine, test, and collect your programming assignments. The account disappears at the end of the semester. Note: if you are new to the system, edoras has many help files. A good place to start is: http://edoras.sdsu.edu/doc/unixtut/ Lots of free software is available to you; peruse http://edoras.sdsu.edu/software/ You'll need ssh to communicate with edoras; if you are running WinBlows, you can get ssh from: http://edoras.sdsu.edu/files/SSHSecureShellClient-3.2.9.exe [If you're running a real operating system at home (MacOS or Linux, which are both versions of UNIX), then don't bother -- you already have ssh.] Alternately, a free package with ssh and lots of features can be downloaded at: https://download.mobatek.net/10520180106182002/MobaXterm_Installer_v10.5.zip (Again, this is WinBlows only -- MacOS and Linux already have this stuff.) On-line Source: http://edoras.sdsu.edu/~carroll/cs320home.html (This has general course information; but to see our specific example programs, you'll need to log in to edoras -- these are not accessible from your browser.)