Unix shell scripting: C shell/Awk

Tutorial 2, © Ron Hills

Purpose: Scripting is the procedure of combining Unix commands and file operations to automate tasks that would otherwise be tedious, especially for dealing with large amounts of data or managing many files at once. Scripting is useful for files containing textual data, in contrast to programming which is useful for performing numerical computations.


Key Terms: character string vs. integer variable, overflow error; machine code, source code, low/high level languages; shell script; while loop, if-then-else statements; argument vector (passing variables); C shell, awk.


Homework: Discuss the 9 Questions in complete paragraphs. Modify and test the class.csh program; write a 3-paragraph executive summary of your results from this tutorial.


Focus Question: How do programs use logical constructions to interface with a serial processor that only performs simple binary arithmetic?

Review from Tutorial 1:

Though Bash is the default login shell on most systems, we will be learning C shell because it is easier to use. The two shells have completely different syntax, so you will need to enter csh each time you open a terminal window. If you don't see 'csh' when you type ps, try 'tcsh' (Mac users will need to type tcsh instead of csh).

PC: csh change from Bash to C shell each time you open terminal

Mac: tcsh


PC/Mac:

set history=9999

set savehist=(9999 merge)

set filec

set autoexpand

set autolist

ps

echo $$ confirm your current shell is csh

alias l 'ls -lFo' capital G may be needed to color your folders

mkdir ~/day2

cd ~/day2

echo "Day 1 files created:" > review

ls -lrt ../day1 >> review

echo "Directory created:" > tues

pwd >> tues

echo on `date` >> tues


mv tues today

cp today junk

cat review junk > revandjunk

rm junk revandjunk

Part 1: Numerical Data


How do computers represent numerical data? In our ASCII data files, we store integers and decimals as their base ten character string "10.10908". But this takes up 8 bytes (2^64 possible bit strings) and only gives us a few significant figures. This wasted space is why large data files will compress easily using zip compression programs. Computers can store larger numbers, decimals and exponents with greater precision and compression by using binary (base-2 numeral system). This will also make programs faster because the CPU perform all calculations in binary code. Until recently most personal computers had a 32-bit CPU architecture, and would represent numbers as 32-bit short integers or 32-bit single precision floating point decimals (reals). On today's 64-bit CPUs, 64-bit integers (longs) and 64-bit double precision floats are becoming the default in many programming languages.


The correspondence between decimal numbers (base-10) and their binary encoding can be seen by counting from zero:

decimal | binary 3-bit unsigned integers (23 = 8 values)

0 00

1 01

2 10 (the first bit rolls over, carry the one)

3 11

4 100 first 2 bits roll over

5 101

6 110

7 111

8=0! 000 all the bits roll over like on an odometer.


The first bit is used to store the sign (+/-) of an integer variable, leaving us a maximum range of 263 = 9.22 808 Ă— 1018 possible positive integers and negative integers including zero.


A fatal error that can occur in any finite precision calculation is overflow error. Overflow happens when trying to set a variable to a numerical value exceeding the range of possible values it can represent. The binary encoding becomes garbled.


Say we try to store the number of atoms in the universe (1080). Well before a variable can reach 1080, all of the 64 binary digits (bits) will max out to 0 and start counting over, just like a car odometer rolling over at 100,000 miles. Overflow in even a single of a programs many internal variables that go into a calculation will give you a nonsensical result without you suspecting anything. Good programming practice is to put a built-in check in your code. The program can be set up detect if any input data exceeds the maximum allowable value and terminate itself after printing a warning message to the user.


Let's try to generate an overflow error in your Unix terminal. C shell, a simple scripting language, can only store integers and not floating point numbers. Integer variables (denoted int) are assigned using @ followed by a character space, variable name and value. Not too long ago C shell integers were 32-bit, having a range of +/- 231 (only 2 billion).


@ max = 9223372036854775807 in C shell @ stores the number to an integer variable (in this case max).. copy/paste this number using command-c/command-v (Mac) or ctrl-c/shift-insert (cygwin, right-click on terminal to see options). space is required after @.

echo $max a long (64-bit) integer can store up to about 263 = 9.22 808 Ă— 1018 positive and negative integers including zero.

9223372036854775807

@ i = $max - 1 because max is an integer variable (not a character string), we can do integer arithmetic

echo $i

@ over=9223372036854775808 a larger integer overflows in C shell on a 64-bit CPU:

echo $over

-( the variable was stored but we've reached the bit reserved for the overflow flag


Larger stored numbers become nonsensical:

@ over=9223372036854775809 reassigns the value of the over variable. No error is given!

echo $over

-9223372036854775807


Let's find the lower limit..

@ min=-$max

echo $min

-9223372036854775807


@ under=-9223372036854775808

echo $under

-(


Record the lower and upper limits for your OS:

echo min $min > limit.out

echo max $max >> limit.out

cat limit.out


QUESTION 0: How large a number results in overflow error on your computer.. what is the maximum positive and negative integer your shell accepts? Do the possible values including zero add up to exactly 264 = 2 Ă— 9.22 808 Ă— 1018 ?

Why do computers do have fixed precision numbers? Their architecture can only perform arithmetic in binary, and the central processing unit can only operate on a fixed "bandwith" of bits at a time. CPUs are either 32-bit or 64-bit precision, usually the latter even for current iPhones. You can think of the addition of two binary bits 0 and 1 to yield a single sum bit containing: 1. But, to add 1 and 1 you need two bits to store the result: 1 0. When you write a program to perform operations, you need to specify whether a given variable is storing an integer number, a floating point decimal number, or a text string (alphanumeric ASCII characters), which have different range limitations and possible operations.


A notorious example of integer overflow is in the arcade game Pac-Man, whose level counter was stored as a single 8-bit byte. Reaching the 28 = 256th level causes a "kill" screen that terminates play.

Part 2: Shell Scripts


#!/bin/csh -f

# This is a Unix shell script. Each line is 'executed', or performed by the Unix

# 'shell' one after another just as if you had typed them at the terminal prompt.

# Comment lines are preceded by a '#' sign, telling the shell not to interpret

# those characters. The words on the page do not wrap to the next line

# like in a word processor, since commands are separated and interpreted

# line by line. It is common practice to keep each command line under 80

# characters in length for ease of reading.


# Despite the '#' in the first line at the top of this shell script #!/bin/csh is

# not a comment. This line is required to specify which shell, or programming

# language we are using. The exclamation after the pound sign on line 1

# tells the processor which shell to use. The full file path is used. Inside

# the root (/) directory there is usually a system folder named 'bin' containing

# the program called 'csh', which stands for C shell. It is the simplest of

# the Unix shells, which we'll focus on for this course. Another common

# shell in Linux machines is Bash, typically also packaged in /bin or /usr/bin.


# There are some differences in the 'syntax' between Csh and Bash, which refers

# to the special characters that go into creating the structure of the

# program. You may get some errors if you try to run a csh script under bash.


# You can easily switch between shells by simply typing the program name

# at the prompt. Typing 'bash' runs the program /bin/bash, this is because

# your environment settings know to look in the system directory /bin.

# We say that /bin is in your path. To see your default path type:

echo $PATH

# If you have a program of the same name in two directories in your path

# (for example, if you installed two different versions),

# the first one listed gets executed. We can say that the path variable

# is an ordered array. Sometimes the current directory (.) is first

# searched in your path.

# Depending on your OS, your default path can be modified in a settings

# file named ~/.cshrc or .bashrc, which is run each time you login.

# Be careful if you ever modify this file, because errors could

# prevent you from logging back in!

# You may also have a .vimrc file for storing VI settings in.

# Some useful settings in .vimrc are (in cygwin you will need to create a file named .vimrc):

# set nocompatible

# set nowrap

# syntax enable

# set ruler


cat ~/.vimrc # show your settings


# Without .vimrc, each time you open Vim you'd have to enter the commands:

# :syntax on enter shows commands in colors

# :set ruler enter shows line/character number of cursor

# :set showmode enter display insert/command mode


# Today's entire script could be run with one program call if you were to put all the commands into a single text file:

# ./tutorial2

# The single dot reminds the shell to look in the current directory.

# To be able to run, or execute, a program, the user (u) first needs to be given execute (x) permissions:

# chmod u+x tutorial2

# ls -l

# -rwxr--r-- 1 rhills 438738691 18363 Jan 12 2016 tutorial2


# As we previewed in Tutorial 1, C shell is the simplest of programming languages: it can perform integer

# addition and subtraction but not other mathematical operations/floating point arithmetic.

Let's illustrate with a simple program (you will need to be in csh or tcsh):

@ i = 0 # a space is required after @

touch count #create empty file for appending output

while ($i < 10) # control flow/decision statement

echo $i >> count # body of iterated statements

@ i = $i + 1 # enter at least 2 spaces at while? prompt

echo $i # if you make a mistake, cancel with ctrl-c

end

The WHILE condition is a loop that controls the flow of the program. Each statement in the while loop (here indented by three spaces) is performed in succession and repeated so long as the conditional statement in parentheses is TRUE upon each repeat (that is, until i equals 10 or more). The integer variable i is our 'counter', we are incrementing it by 1 during each iteration of the loop. In C shell, '@ ' sets an integer variable and $var returns the value stored in that variable.

Q1: Why were the numbers printed to your screen different than those in the count file? Why couldn't we use echo $i > count inside the while loop? Provide a real-life example of where a while loop would be useful in programming.


Programmers often use shorthand when incrementing a variable:

echo $i #note the variable i is still stored outside the while loop: it is a global variable

@ i+=1 # shortcut for incrementing variable by an number

We generally avoid global variables unless they are absolutely needed (e.g. to supply an input parameter), because referencing i won't work if you failed to initialize i.

echo $i

@ i++ # shortcut for incrementing by 1, specifically

echo $i

@ i*=4 # shortcut for multiplying variable by a number

echo $i > finalcount

Not only is the shorthand simpler to write, it is faster in execution! The reason for this has to do with the fact that the computer processor performs arithmetic in binary. Now, the plain text program you wrote is not in binary-- we refer to this as the source code of the program. To run the program, the computer first needs to compile that list of instructions into binary machine code. In the early days, computers were simple enough to be programmed at or near this binary "lowest level". The precise binary code that results will vary depending on your OS software and hardware vendor. Nowadays, programs are always written in a high level language, which are standards of syntax that can be transferred from one computer architecture, or platform, to another. The binary executable program files, however, can not usually be transferred easily between architectures.


Think of the trade secret chemical formula for Coca Cola. To date no one has been able to duplicate it. A binary program is similar, given the binary data you can not reverse engineer the resulting source code, which you would need in order to modify or adapt the working program or even compile it for a new architecture.


Releasing the source code is what separates free Unix software from proprietary commercial software (e.g. Microsoft). There are thousands of source code programs released under the GNU General Public License free software license. The GNU license is copyleft, meaning you can derive work from it (modify/adapt to your needs), so long as you redistribute the source code under the same (GNU) license. You can even charge a fee for the service (RedHat Linux distributions). GNU is a paradigm shift in licensing and encourages software collaboration and improvement.


For a list of common free Unix applications see:

http://cygwin.com/packages/

http://www.macports.org/ports.php (requires Xcode)


One of the powerful applications you'll find available is gcc -- the Gnu Compiler Collection. GCC contains compilers for your computer for programming languages such as C, C++, and Java, enabling you to use your own computer to develop and run Unix programs. Now, C is a much lower level language than C shell. Low level languages are more efficient for crunching numbers because you must control exactly the data is stored and processed. This also means that low level programs are harder to write and less flexible, especially for processing textual data. One key difference is that in low level languages you must first declare a variable's data type before using the variable..

int i; // declare an integer variable in C:

i = 0; // all statements must terminate with semicolon;

i = i + 1; //c has single-line comments or,

/* Comments spanning

multiple lines! */

Higher level languages are the simplest to write, are more forgiving in syntax, and are generally preferred for processing text or complex/varying data. Python is a higher level language than C, but lower than a Unix shell such as csh, which are extremely limited in their capabilities. HTML for website coding, on the other hand, is just a markup language rather than a programming language: it uses tags to define elements within a document.


Q2: What is the difference between source code and machine code? What makes a programming language lower or higher level? Give examples of low and high level languages commonly used today and order them from low to high level.

Proceed to: Part 3