Tutorial 3) Part 2: Histograms and Probability Functions

C is a low level language, so we will need to first compile our C source code using the GCC compiler. Compiling creates a binary executable file that we can run on our specific machine/computer architecture.


Instructions for Mac:


Since your Mac doesn't ship with the C compiler (gcc requires Xcode), download the binary executable I compiled on my Mac OS 10.13: Files: mkhist.

Mac: mv ~/Downloads/mkhist ~/day3 move mkhist to current directory (.)

Safari: mv ~/Downloads/mkhist.dms ~/day3/mkhist (Safari adds .dms to the file name.)

chmod u+x mkhist add execute permission to the program for user

./mkhist if the executable works on your OS, you'll get my "usage: " syntax warning message.


Computing histograms and probabilities (C++) PC and Mac


Download and move the source code file: Files: mkhist.cpp

cd ~/day3

PC: mv /cygdrive/c/Users/username/Downloads/mkhist.cpp ~/day3

Mac: mv ~/Downloads/mkhist.cpp ~/day3


ls -lrt

PC: g++ mkhist.cpp compile using GCC

ls -lrt

By default GCC should create a binary executable file named a.exe or a.out. Rename it:

PC: mv a.exe mkhist

./mkhist run the program and it should tell you the proper syntax


Sometimes programmers will compile using a flag:

PC: g++ -O2 mkhist.cpp -capital 'O'

ls -lrt You'll know the executables are different from their file sizes (column 5).

The capital 'O' stands for optimization level. A value or 2 of 3 is the highest recommended--it will result in faster code (usually 5% or so), but when it crashes you are less likely to get a meaningful error message.


C is a bit outdated so we're using a variant known as C plus plus. The main advantage to C++ is a feature known as dynamic memory allocation. In the early days of programming memory was allocated statically. Every time you ran the program, the same amount of memory would be used, no matter how small the data file. This is because an array had to be first declared in the program and you may not be able to predict the size of the data. The main problem with this was that for larger data sets than the array that was declared the program would crash. With C++ or languages like Python, you can dynamically increase the size of the array as needed at runtime to accommodate big data sets.


With dynamic memory allocation, the size of an array is only limited by the amount of RAM in your computer, which nowadays is gigabytes (GB). The top Unix command or the Activity Monitor in your OS will tell you how much memory or CPU each process is using.


Memory usage is not an issue with simple programs such as discussed in this course. In a running sum calculation, for example, we reach one line at a time and add it to the single variable: sum. What was in the previous records is no longer needed--in other words we don't need to store the entire data file into a single array all at once because it doesn't affect each step of the calculation. In the field of big data, the running sum would only be limited by how fast the computer can read the data from your hard drive.

mkhist.cpp potentially uses more memory because it must store one array for your data values (double *dvals on line 40), and another for the histogram (int *hist on line 50). A histogram is a count, or probability distribution, grouping numerical values into ranges. The count or bar height is how many times data points fall within a given range; each interval is known as a bin, or bucket. While a bar graph is a plot of categorical variables, histograms are used for continuous data. The first step in creating a histogram is determining the intervals and spacing for your range of values: bin intervals must be adjacent and are usually equal size.

Because C does not have a built-in min/max function like python, we must create an algorithm to first scan the data file for the minimum/maximum. Look at line 26 of the source code in mkhist.cpp and see if you can identify the lines of code that correspond with the pseudocode below...

vim mkhist.cpp

26G


// Notice in C that statements must end in semicolon: ;

// Comments begin with two slashes: //

loop_begin (while end-of-file is not reached): (pseudocode)

read value on next line of file (fscanf)

if value < current min,

then update min to current value // set variable min to new minimum found

if not EOF (end of file), // we need a check to escape the infinite loop (EOF)

then goto loop_begin

loop_end


Close the source code file (:q) and let's try the program on some data in Unix:

echo 1 > integers

echo 2 >> integers

echo 2 >> integers

echo 4 >> integers


./mkhist

Mac: you may get an OS message that this is from an unidentified developer. Open System Preferences, Security & Privacy, General and click the button below to 'Allow' the mkhist app.


PC: if you get an error running mkhist, check that gcc and g++ are the same version:

gcc --version

g++ --version

If the versions are different, you will need to run the Cygwin setup.exe file again. Don't re-install the other programs (change python/tcsh/etc. tabs to keep), then change show Full/Pending package to list by Category>All>Devel. Change the version for gcc-core and gcc-g++ to the older version and install them (close your terminal so it can uninstall the wrong version). Version 9.3 is recommended, but 10.2 may worker on newer computers. In a new terminal, now try compiling again:

g++ mkhist.cpp

Executing ./mkhist with no arguments should give you a usage message telling you what your input should be. mkhist expects a file name followed by the number of bins to divide your data range into. The bins are equally spaced intervals, or buckets, which your data values will be counted in to make the histogram. If you have discrete integer data you may wish to simply count the number of times each integer occurs for all numbers between the min and max. Supplying -1 (minus one) as the second argument causes the program to do this as a separate subroutine, called make_histd (line 80). Make_histd simply counts the frequency of whole numbers observed between the min and max.

./mkhist integers -1


The file integers.hist has been created telling us how many of each integer we have:

cat integers.hist


Q5: Create a few different text files of sample data. Explain the histogram that mkhist creates based on the input. Compare what happens if you have closely spaced numbers versus numbers that are different in magnitude. E.g.:

echo 30 > initial

echo 52 >> initial

echo 52 >> initial

echo 54 >> initial

./mkhist initial -1

cat initial.hist


Run mkhist using continuous decimal numbers (double floating point precision). This calls the make_hist subroutine algorithm starting on line 19:

./mkhist initial 3 #divides the data into only 3 bins

cat initial.hist


Let's look more at the .cpp source code using Vi. C programs begin with their subroutines [void subroutine(input)] and then must end with the main() function below (line 138). Our subroutines get called inside the main() function. mkhist consists of two subprograms: make_hist and make_histd (defined on lines 19 and line 80). To complicate things, variables can either be used within specific functions, passed to the functions as arguments inside the parentheses, or can be global to all functions (used throughout the entire program).


On line 145, main() checks that two arguments are passed to the program: a file name and the integer number of bins to divide the data into. mkhist counts the number of times a value in the supplied data file falls in each bin between min and max. The count is the probability frequency distribution, or histogram.


We could label the bins however we want. mkhist prints their midpoint in the first ouput column.. Create the file:

echo 10.01 > 10.dat

echo 10.5 >> 10.dat

echo 10.8 >> 10.dat

echo 11.2 >> 10.dat

echo 11.99 >> 10.dat


./mkhist 10.dat 2

mkhist generates the 10.dat.hist output file (overwrites any previous file):

cat 10.dat.hist

10.505 3

11.495 2

Two equal width bins were created with the dividing line at

(10.01+11.99)/2.0 = 11.0

The first column in 10.dat.hist lists the center of each bin; the second column is how many data points were counted in that bin.


Recall from yesterday that overflow is a persistent source of error in computer calculations. C is a fast language that employs finite variable precision. make_histd employs 32-bit int integers, which have a range of +/- 231. make_hist on the other hand, for continuous data employs 64-bit double floats, and has a much larger range of allowed exponents. Create another data file:


echo 2147483647 > overflow

echo 2147483648 >> overflow 231 = 2.14 648 Ă— 109

./mkhist overflow -1


echo -2147483649 > underflow

echo -2147483648 >> underflow

./mkhist underflow -1

cat underflow.hist

Q6: Compare with a classmate to check that you both have 32-bit integers in C and explain why you get an error. What is the range of integers make_histd can accept (for which it produces the correct result)?


Conclusion- mkhist, and especially the make_histd subroutine, has some limitations. It makes assumptions about the input data that limit the cases to which it produces a correct result. Second, it does not run a check to make sure the data fall within the assumed range before proceeding with the calculation. Lastly, it does not give a message to the user or provide comments in the source code about these limitations, although it is generally assumed that the programmers are familiar with the concept of overflow error.


C is difficulty to write for a variety of reasons. Because you have to declare a variable as an integer or float, we had to write two subroutines for different types of data, which ended up having different permissible input ranges. Compare this to the built-in function in python, which is smart enough to operate on integers or floats:

>>> max(3.674,4000,-4000000000.4,1.9E200)

1.9e+200


Before 64-bit CPUs became the norm, double precision calculations would take twice as long on old 32-bit processors, and so most programs traditionally employed single precision float variables.


Let's return to our problem from Tutorial 2: analyzing the data in rmsd.dat. The mkhist program expects only a single column of data, and since we desire a histogram for only the y-axis data we'll extract it using awk:

awk '{print $2}' ../day2/rmsd.dat > rms

head rms print first 10 lines of file

Notice that the first data point (0.07 nm) is an outlier from the two predominant states. The zero time point corresponds to the starting state for the simulation. It is common practice to throw out an early portion (up to 10%) of the trajectory, since the simulation has not yet equilibrated. The data actually used in analysis is then called the 'production' data. We'll throw out the t = 0 data point to prevent it from affecting our histogram generation:

wc -l rms prints no. lines in file

tail -n 2000 rms > rms2

The tail command leaves us with the last 2,000 data points, a sufficient sample for estimating the probability density function.

./mkhist rms2 30 let's see if 30 is truly the golden number in statistics

2000 data values will be processed

Minimum value is: 0.168221

Maximum value is: 0.450382

cat rms2.hist

cp rms2.hist rms2.hist30

./mkhist rms2 300 now let's see what happens when you have too many bins

cp rms2.hist rms2.hist300

./mkhist rms2 5 or in this case too few bins.

cp rms2.hist rms2.hist5

Next, we will visualize our histograms to determine the optimal number of bins.

Proceed to: Part 3