Tutorial 3) Part 2:  Histograms and Probability Functions

C is a low level language, so we will need to first compile our C source code using the GCC compiler. Compiling creates a binary executable file that we can run on our specific machine/computer architecture.

Download and move the source code file:  Files: mkhist.cpp

cd ~/day3

 PC:   mv /cygdrive/c/Users/username/Downloads/mkhist.cpp ~/day3 

 Mac:     mv ~/Downloads/mkhist.cpp ~/day3 


 ls -lrt

 g++ mkhist.cpp  compile with GCC

 ls -lrt  an executable file should be created: a.out (Mac) or a.exe (PC)

Special Instructions for Mac:


If your Mac can't install the C compiler and doesn't create the a.out file, download the binary executable I compiled on my Mac OS 10.13:  Files: mkhist.

Note:  for Mac is 2022 or later try this file compiled on 2024 M3 OS 14.7:  Files: mkhist2.

 Mac:     mv ~/Downloads/mkhist ~/day3   move mkhist to current directory (.)

 Safari:   mv ~/Downloads/mkhist.dms ~/day3/mkhist   (Safari adds .dms to the file name.)

    chmod u+x mkhist  add execute permission to the program for user

    ./mkhist  if the executable works on your OS, you'll get my "usage: " syntax warning message.


Computing histograms and probabilities (C++)


By default GCC should create a binary executable file named a.exe or a.out. Rename it:

 mv a.out mkhist  (Mac)

 ./mkhist  run the program and it should abort and tell you the proper syntax


Sometimes programmers will compile using a flag:

 g++ -O2 mkhist.cpp    -capital 'O'

 ls -lrt   You'll know the executables are different from their file sizes (column 5).

The capital 'O' stands for optimization level. A value or 2 of 3 is the highest recommended--it will result in faster code (usually 5% or so), but when it crashes you are less likely to get a meaningful error message.


C is a bit outdated so we're using a variant known as C plus plus. The main advantage to C++ is a feature known as dynamic memory allocation. In the early days of programming memory was allocated statically. Every time you ran the program, the same amount of memory would be used, no matter how small the data file. This is because an array had to be first declared in the program and you may not be able to predict the size of the data. The main problem with this was that for larger data sets than the array that was declared the program would crash. With C++ or languages like Python, you can dynamically increase the size of the array as needed at runtime to accommodate big data sets.


With dynamic memory allocation, the size of an array is only limited by the amount of RAM in your computer, which nowadays is gigabytes (GB). The top Unix command or the Activity Monitor in your OS will tell you how much memory or CPU each process is using.


Memory usage is not an issue with simple programs such as discussed in this course. In a running sum calculation, for example, we reach one line at a time and add it to the single variable: sum. What was in the previous records is no longer needed--in other words we don't need to store the entire data file into a single array all at once because it doesn't affect each step of the calculation. In the field of big data, the running sum would only be limited by how fast the computer can read the data from your hard drive.