Because C does not have a built-in min/max function like python, we must create an algorithm to first scan the data file for the minimum/maximum. Look at line 26 of the source code in mkhist.cpp and see if you can identify the lines of code that correspond with the pseudocode below...
vim mkhist.cpp
26G
// Notice in C that statements must end in semicolon: ;
// Comments begin with two slashes: //
loop_begin (while end-of-file is not reached): (pseudocode)
read value on next line of file (fscanf)
if value < current min,
then update min to current value // set variable min to new minimum found
if not EOF (end of file), // we need a check to escape the infinite loop (EOF)
then goto loop_begin
loop_end
Close the source code file (:q) and let's try the program on some data in Unix:
echo 1 > integers
echo 2 >> integers
echo 2 >> integers
echo 4 >> integers
./mkhist
Mac: you may get an OS message that this is from an unidentified developer. Open System Preferences, Security & Privacy, General and click the button below to 'Allow' the mkhist app.
PC: if you get an error running mkhist, check that gcc and g++ are the same version:
gcc --version
g++ --version
If the versions are different, you will need to run the Cygwin setup.exe file again. Don't re-install the other programs (change python/tcsh/etc. tabs to keep), then change show Full/Pending package to list by Category>All>Devel. Change the version for gcc-core and gcc-g++ to the older version and install them (close your terminal so it can uninstall the wrong version). Version 9.3 is recommended, but 10.2 may worker on newer computers. In a new terminal, now try compiling again:
g++ mkhist.cpp
Executing ./mkhist with no arguments should give you a usage message telling you what your input should be. mkhist expects a file name followed by the number of bins to divide your data range into. The bins are equally spaced intervals, or buckets, which your data values will be counted in to make the histogram. If you have discrete integer data you may wish to simply count the number of times each integer occurs for all numbers between the min and max. Supplying -1 (minus one) as the second argument causes the program to do this as a separate subroutine, called make_histd (line 80). Make_histd simply counts the frequency of whole numbers observed between the min and max.
./mkhist integers -1
The file integers.hist has been created telling us how many of each integer we have:
cat integers
cat integers.hist
Q5: Create a few different text files of sample data. Explain the histogram that mkhist creates based on the input and how it determines the bin width. Lastly, compare what happens if you have closely spaced numbers versus numbers that are different in magnitude.
E.g. copy and paste this block of code:
echo 30 > initial
echo 42 >> initial
echo 42 >> initial
echo 44 >> initial
./mkhist initial -1
cat initial.hist
Then run mkhist using continuous decimal numbers (double floating point precision). This calls the make_hist subroutine algorithm starting on line 19:
echo 0.00000013 > decimal
echo 2.0 >> decimal
echo 2.0000001 >> decimal
echo 3.99999992 >> decimal
./mkhist decimal 2 # divides the data into only 2 bins
cat decimal.hist
Let's look more at the .cpp source code using Vi. C programs begin with their subroutines [void subroutine(input)] and then must end with the main() function below (line 138). Our subroutines get called inside the main() function. mkhist consists of two subprograms: make_hist and make_histd (defined on lines 19 and line 80). To complicate things, variables can either be used within specific functions, passed to the functions as arguments inside the parentheses, or can be global to all functions (used throughout the entire program).
On line 145, main() checks that two arguments are passed to the program: a file name and the integer number of bins to divide the data into. mkhist counts the number of times a value in the supplied data file falls in each bin between min and max. The count is the probability frequency distribution, or histogram.
We could label the bins however we want. mkhist prints their midpoint in the first ouput column.. Create the file:
echo 10.01 > 10.dat
echo 10.5 >> 10.dat
echo 10.8 >> 10.dat
echo 11.2 >> 10.dat
echo 11.99 >> 10.dat
./mkhist 10.dat 2
mkhist generates the 10.dat.hist output file (overwrites any previous file):
cat 10.dat.hist
10.505 3
11.495 2
Two equal width bins were created with the dividing line at
(10.01+11.99)/2.0 = 11.0
The first column in 10.dat.hist lists the center of each bin; the second column is how many data points were counted in that bin.
Recall from yesterday that overflow is a persistent source of error in computer calculations. C is a fast language that employs finite variable precision. make_histd employs 32-bit int integers, which have a range of +/- 231. make_hist on the other hand, for continuous data employs 64-bit double floats, and has a much larger range of allowed exponents. Create another data file:
echo 2147483647 > overflow
echo 2147483648 >> overflow 231 = 2.14 648 × 109
./mkhist overflow -1
echo -2147483649 > underflow
echo -2147483648 >> underflow
./mkhist underflow -1
cat underflow.hist
Q6: Compare with a classmate to check that you both have 32-bit integers in C and explain why you get an error or nonsensical result. What is the range of integers make_histd can accept (for which it produces the correct result)?
Conclusion- mkhist, and especially the make_histd subroutine, has some limitations. It makes assumptions about the input data that limit the cases to which it produces a correct result. Second, it does not run a check to make sure the data fall within the assumed range before proceeding with the calculation. Lastly, it does not give a message to the user or provide comments in the source code about these limitations, although it is generally assumed that the programmers are familiar with the concept of overflow error.
C is difficulty to write for a variety of reasons. Because you have to declare a variable as an integer or float, we had to write two subroutines for different types of data, which ended up having different permissible input ranges. Compare this to the built-in function in python, which is smart enough to operate on integers or floats:
>>> max(3.674,4000,-4000000000.4,1.9E200)
1.9e+200
Before 64-bit CPUs became the norm, double precision calculations would take twice as long on old 32-bit processors, and so most programs traditionally employed single precision float variables.
Let's return to our problem from Tutorial 2: analyzing the data in rmsd.dat. The mkhist program expects only a single column of data, and since we desire a histogram for only the y-axis data we'll extract it using awk:
awk '{print $2}' ../day2/rmsd.dat > rms
head rms print first 10 lines of file
Notice that the first data point (0.07 nm) is an outlier from the two predominant states. The zero time point corresponds to the starting state for the simulation. It is common practice to throw out an early portion (up to 10%) of the trajectory, since the simulation has not yet equilibrated. The data actually used in analysis is then called the 'production' data. We'll throw out the t = 0 data point to prevent it from affecting our histogram generation:
wc -l rms prints no. lines in file
tail -n 2000 rms > rms2
The tail command leaves us with the last 2,000 data points, a sufficient sample for estimating the probability density function.
./mkhist rms2 30 let's see if 30 is truly the golden number in statistics
2000 data values will be processed
Minimum value is: 0.168221
Maximum value is: 0.450382
cat rms2.hist
cp rms2.hist rms2.hist30
./mkhist rms2 300 now let's see what happens when you have too many bins
cp rms2.hist rms2.hist300
./mkhist rms2 5 or in this case too few bins.
cp rms2.hist rms2.hist5
Next, we will visualize our histograms to determine the optimal number of bins.