Programming with Python and C

Tutorial 3, © Ron Hills

Purpose: This module introduces more complete programming than the scripting constructs we previously developed in C shell. One important aspect of coding is modifying an existing program, e.g. to expand what it can be used for. It is also necessary to conduct thorough testing to determine the range of inputs a program will work on, which should be discussed in the software documentation.

Key Terms: floating point precision, round-off error; compiler / interpeter; memory allocation, arrays; for loop; subprogram / library; IP address; Python, C++; histogram, probability function; statistical thermodynamics.

Homework: Explain Discussion Questions 1-10 in complete paragraphs. Modify sts.py to compute the variance and sum of a data set. Summarize the modifications you made and the results you obtain from this tutorial.

Focus Question: Explain the trade-offs associated with programming in languages of different levels by comparing the logical flow and syntax of two programs: sts.py and mkhist.cpp.

Part 1: Numerical Precision, Storing Variables in Python

We will be using Python, which is a much more powerful language than C shell and allows us to perform fast numerical computation. Later, we will briefly introduce the C standard programming language, which is more difficult to write (lower level than python), but can be very numerically efficient. C used to be preferred for computationally intensive tasks, though python 3 is now highly efficient and easier to write.

First, let's fire up the Python command line from a Unix prompt (%, $, etc.):

python

Note: cygwin users can use interactive python: ipython. iPython has: color syntax highlighting, help(command) pages, and will auto-list and auto-complete commands with the tab key.

You will see a new prompt >>> to let you know you're inside the python interpreter and have left the Unix shell. The header output should show your exact version of the older Python 2 or current Python 3. There are subtle syntax differences between the two languages -- we will strive to use code that works for both.

Note: if you ultimately can't python terminal to load, you may try Google Colab. It contains an interactive format known as Jupyter Notebook, which has become a standard for collaborative coding in industry. Colab itself runs python3, but it is difficult to read in and write files through its web browser.

>>> a = 3 integer assignment

>>> a return variable

>>> b=3.0 floating point assignment. Note how we don't have to declare the variable type in python. This is called dynamic typing.

>>> b

3.0

>>> c="te xt here" character string

>>> c

'te xt here'

>>> c=c+"abc" string concatenation

>>> c

'te xt hereabc'

Before looking at a python program script, let's explore using the python prompt as a calculator:

>>> 2*3.14

6.2800000000000002

On older computers you would see all the decimals places shown above, but nowadays python tries to be smart and just show you 6.28. In the computer's memory, however, the number actually stored really corresponds to 6.2800000000000002, because computers can not be infinite precision.

Why are there 16 decimals? This shows the precision used for the binary encoded (64-bit) floating point number. Notice that there is some round-off error. In practical calculations we usually only need to worry about ~3-5 significant digits, so these roundoff errors usually average out close enough to zero and don't hurt us, even when performing a high number of repetitive calculations to obtain a result.

Because python is a high level language, it will automatically interpret whether we are dealing with floats or integers:

>>> 2*3

This calculations is exact because we have multiplied two integers. The 2*3.14 calculation was inexact because in multiplying by a float python had to convert the 2 integer to another float.

We should be able to demonstrate this on your system using float(), the function for converting a string to a 64-bit floating point:

>>> float("0.9999999999999999")

0.9999999999999999

>>> float("0.99999999999999999")

1.0

Decimal numbers are stored with a limited precision. You should find that your 64-bit machine encodes floats in double precision. If we compare this to older 32-bit programs, you see that most of the bits are reserved to represent the decimal places (the significand), while a few are reserved for the scientific exponent:

precision bits | decimal digits | max exponent

single 32 7 38

double 64 15 308

Computers actually store floating points using powers of two. Ex: 101.001 (binary) = 1*2²+0*2¹+1*2⁰+0*2^-1+0*2^-2+1*2^-3 = 5.125 (base-10 floating point).

Let's test the exponent limits of the float data type in Python..

>>> 1.7e308 e or E denotes scientific notation (10 raised to some power)

1.7e+308

>>> 1.8e308

inf

Every time a computer performs a floating point calculation, the results are truncated around the 15th decimal place, resulting in roundoff error. Roundoff errors are small and are as likely to be negative as positive, so they partially cancel but can grow for long calculations, or "simulations".

Lastly, dividing binary integer numbers is a truncation issue in computing:

>>> 7/3

2 # you will see 2.3333.. in python 3

In computing, division with integers lops off the remainder. This is called floor division:

>>> 7 // 3

2 # floor division operator // returns largest possible integer in python 2 or 3

To compute floating point division in python 2, we need to type the decimal to specify a float rather than integer (other programming languages require a separate declaration statement):

>>> 7/3.0

2.3333333333333335

Again, note the round-off error in the last digit inherent in the underlying binary arithmetic that is being performed on a CPU with finite precision. CPU architectures moved from 32 to 64 bits relatively recently, and some software vendors still use 32 bits in their code, essentially wasting computational power.

For more complex calculations we can use external functions that are packaged with the program called libraries. In python a library is called a module, which we need to import at runtime:

>>> import math

We call functions, know as subroutines of a program, in the math library using:

>>> math.log(2.718)

0.99989631572895199

>>> math.log10(10)

1.0

>>> exit() # exit the python interpreter and return to your Unix shell. Alternately, just hit control-d at an empty >>> prompt. You should see the prompt change back to your original Unix shell: $, %, etc.

In a C program, instead of import you'd use:

#include <math.h>

Another standard library for file input/output is:

#include <stdio.h>

import sys

Next, we'll look at a relatively small but useful python program. Download: Files: sts. STS computes the average and standard deviation of a column in a data file.

tcsh

PC/Mac:

set history=9999

set savehist=(9999 merge)

set filec

set autoexpand

set autolist

echo $$

alias l 'ls -lFo'

mkdir ~/day3

PC: cd /cygdrive/c/Users/username/Downloads

Mac: cd ~/Downloads

PC/Mac: ls -lrt confirm file was downloaded

cp sts ~/day3 in Safari you may need to type sts.dms

cd ~/day3

ls -lrt confirm the file was copied

cp ~/day2/pdb.txt ~/day3 copy pdb.txt into day3 folder

ls -lrt

Check that the 6 kB sts program was downloaded and correctly moved to your day3 directory.

Just like our C shell scripts, this python program can be run at the Unix prompt and does not need to be previously compiled (technically it compiles at runtime):

chmod u+x sts gives user execute permissions

./sts

Because you have not supplied any arguments on the command line passing data to it, the program will terminate. It gives you a message explaining that it requires two arguments to be specified: a file name and column number...

head pdb.txt prints top of file

./sts pdb.txt 7

# Mean Std dev Min Max n # file=pdb.txt, column 7

10.11447 1.268593 8.299 12.363 15

This output tells us that there are 15 values in column 7 ranging from 8.3 to 12.4 with a mean of 10.1 and standard deviation 1.2.

Q1: a) What error occurs if you supply the third column as input from pdb.txt? Explain why this error occurs. What is the difference between string and float? b) Is the error coming from the Unix shell, a syntax error in your sts code, or the way python is trying to execute on the input?. To really answer this question, let's start dissecting the code as explained below...

./sts pdb.txt 3

We'll dissect this program by line number. (Vi will show (shown at bottom-right of VI) to see how it operates..

vim sts

:set ruler make sure Vi shows the line and column number in bottom-right

At the same time, open another terminal by clicking on the File menu > New (click on the Cygwin icon in the upper left corner of your terminal window) so that we can test command syntax at the prompt while viewing the source code in the original Vi terminal.

Line 1: #!/usr/bin/python Remember the first line after #! gives the full pathname for your shell. Find your python installation by typing which python at the Unix prompt.

Line 34: checks whether the user specified two arguments on the command line (stored to sys.arg variable). In Vi command mode, hit 34G (capital G) in quick order to make the cursor jump to that line.

Line 42: opens the input file named in sys.argv[1] for reading data

Line 47: stores the user-specified column number to integer variable colSel

In programming, recall that an array is a data structure storing a collection of elements of the same data type. In python we can store a pseudo array as the versatile List data type. A List in python is an ordered collection of any type of object (integer/string/float/other list). Experiment with the following at the python prompt (>>>) in your new terminal:

python

>>> L = [2, 4.0, 'six'] the elements in a list can be of different type

>>> L

>>> L[0] return value of the first list element from the left (0th index), 2nd element (L[1]), etc.

The indices for list elements start with 0, 1, 2,.. etc. Calling an element which does not exist results in an error that will terminate your program abnormally:

>>> L[3]

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

IndexError: list index out of range

In other programming languages, these indices are called pointers, which can be thought of as the address locating where a variable is stored in memory. In older programming languages, arrays were static, meaning their length and size had to be predetermined (fixed). In python, lists are mutable, you can grow and remove elements or even change them:

>>> L.append(9)

>>> L

>>> L[1] = 3.0 index assignment (changing in place)

>>> L

There are many other useful built-in functions, or methods, for the List object type that make python a versatile and user-friendly language. For a smaller static collection of items, it can be handy to use the Tuple in python, which is unchangeable (immutable):

>>> T = (3, 4, 5)

>>> T[0:1] lists and tuples support slicing: accessing part of a sequence.

>>> L[1:2] = [3,6] Lists support index/slice assignment but not tuples

>>> L

Tuple assignment (used on line 133) allows you to put multiple variable assignments together on one line:

>>> avg, mean = 'average', 3.4

>>> avg

>>> mean

>>> x = (2, 4) pack the values on the right into a new tuple x

>>> (b, c) = x unpack the tuple on the right into new variables on the left

>>> b

While tuples can not grow, you can always reassign a tuple to something new. A tuple can also reference a mutable object (e.g. list), which can change.

Line 47: Note the use of the function int(). The two command line arguments specified are actually stored as text strings in sys.argv: "pdb.txt" and "7". int() converts "7" to an integer which we can do math with. Also useful is float(), which was needed in Lines 75, 110 and 118.

Lines 42-43 File Read: The readlines() function dumps the entire text file into a list. Each entry in the list is one long character string containing all the text on a given line.

Lines 55-58 For Loop: This block parses through the text data from our input file one line at a time. The function .split() is a string method that splits a string containing whitespace or another separator into a list of the separate items. We use .split() first to remove lines that are commented out with hashtag and then to create a list of items for each line of data. We end up with a list of lists, which can be thought of as a 2-D data matrix. In scientific computing such data structures are known as an array.

Let's make a simple data matrix in python...

>>> L = [[1, 2],[3,4]]

>>> L[0] references first row of data

[1, 2]

>>> L[0][1] references 2nd element of first row

Here are the rows and columns of our matrix in visual form:

(1 2

3 4)

So our for loop has read the input file and essentially stored the data as a 2D array of character/field strings.

Lines 73-75: Now, we need to do math on a single column of pdb.txt which is numerical data, so we extract the column from the data matrix and convert it to a 1-D list of floating point variables (numList). We iterate over each line in the file using:

for item in listDMAT:

body statement1

statement 2..

In list iteration, the variable item takes on the value of each item in the sequence of elements inside our iterable object (list). The block of indented coding statements is executed once for each item. Iterating over all items in the sequence is called traversal.

The for loop is python's foreach routine. Notice that in STS the statement blocks in the for or if-then constructs are each indented 3 spaces. This is good programming practice as it reveals the structure of the program and shows nested indentation levels. In the Python language consistent spacing is actually required because parenthesis are not used to define the coding block. Loops are not ended with an 'end' statement or enclosed in brackets like in the C language. Python interprets the end of loops and if-then(-else) structured programming based on their being fewer indented spaces on the next line. You don't have to use 4 spaces (it can be 1, 2, 3 etc.), but you must be consistent. Also, it is recommended to use character spaces rather than the 'tab' key.

Q2a: What do the range() and len() functions do? Give examples of when they might be useful.

>>> list(range(6) ) sequence stops before specified number

>>> L1 = range(10,0,-2) the start and step arguments are optional

>>> L1

>>> len(L1)

>>> for i in range(len(L1)): # this is the literal code for a python for loop

... print("index is",i, end = " ") # print each on same line

... print("value is",L1[i]) # be sure to indent 4 spaces

... <return> # enter on empty line to execute, ctrl-c if you make a mistake.

Note: it is generally recommended to just use the simpler for loop construction:

>>> for x in L1:

... print(x)

Q2b: Write the general python code for repeating a command n times.

Lines 84-120 User-Defined Function: We could just use our floating point data to calculate the average and print it to the screen. The guts of a program are often buried within another construct, however. In this example, the math is actually done within a construct defined as its own function, or subroutine. This is useful because the defined function can be called repeatedly or used in another program, because the function does not depend on the other lines of code in the program. Functions and module files are a way of logically organizing code. The program calls our statfun function after it is defined in sts, on line 133. Note the data variables (results) returned by the function are assigned at the bottom of the function definition in the return statement on line 120. Reaching a return statement ends execution of the function call. All other variables used within the function are local variables (not accessible outside the function).

def statfun(L): define function statfun, operating on input L

statements operating on input (list L)..

return variable1, variable2 (avg, std..)

The function statfun uses a common technique known as the running sum. The logic of the program is summarized briefly below as pseudocode (not actual code):

sum = 0.0 #variable initialization

loop_begin:

read value in record

sum = sum + value #iterate over each record

loop counter

goto loop_begin

loop_end

avg = sum/n #divide by number of records (remember this has to be a float not an integer)

Lines 139-140 Print: Finally, the program prints the desired values to standard out. You can specify the number of decimals printed using a standard notation from C known as printf formatting.

Lines 85-103 docstring: Python functions can begin with a block (multi-line) comment given in triple quotes. The "docstring" describes describe what a function does, how it works, number of arguments it takes, their type, and the object returned by the function. Contained in the docstring may be doctests, which let you test your code by running and verifying examples with sample input (following >>>) and expected output on the next line.

Q3: Enable the doctest by uncommenting lines 125-126 and running your code on a file like pdb.txt. a) How many of the four tests pass? b) Explain what the results mean in terms of the limitations of the input data the program will work on.

Modify the sts.py program

Finally, let's create your own copy of the sts program that you will modify:

cp sts my.sts

vim my.sts modify my.sts per Question 4..

Q4: a) Modify STS to instead print out the sum of a data column and its sample variance (standard deviation squared). Compute the sum and variance for column 2 of rmsd.dat and explain the lines of code you changed. b) Check your answer: multiplying the mean by the number of data points to obtain the sum and square the standard deviation to obtain the variance. To what decimal place do the two results agree?

This concludes are introduction to python. Next, we will turn to a program written in C, which we will use to solve our problem from Tutorial 2 regarding histogram generation.

Programming with Python and C

Part 1: Numerical Precision, Storing Variables in Python

Modify the sts.py program

Proceed to: Part 2