HONR269L
January 31, 2019
Today I confirmed that I was working with Rebecca Taylor on the IceCube project mentored by Dr. Erik Blaufuss.
February 3, 2019
Rebecca and I confirmed with Dr. Erik Blaufuss that we will meet weekly at 2:30 PM on Fridays.
February 5, 2019
Today I took a deep look into the logbooks of Jake Cassell, Kun Do, and Rachel Broemmelsiek from last year. They all worked on an IceCube project of their own. From these logbooks I gathered the following resources:
https://en.wikipedia.org/wiki/Neutrino_detector
http://icecube.umd.edu/Home.html
http://icecube.wisc.edu/
https://sites.google.com/a/physics.umd.edu/honrxxx/logbook/268n-2016/juan-dupuy
https://sites.google.com/a/physics.umd.edu/honrxxx/logbook/268n-2016/matthew-kirby
https://sites.google.com/a/physics.umd.edu/honrxxx/logbook/268n-2015/anat-berday-sacks
http://iopscience.iop.org/article/10.1088/1475-7516/2018/01/025?pageTitle=IOPscience
http://iopscience.iop.org/article/10.3847/1538-4357/aa9d94/meta
Some of these resources are log books from years prior to last years, which I was not sure included IceCube projects until now.
February 8, 2019
Today was Rebecca and I's first meeting with Dr. Erik Blaufuss. We discussed some ideas for what our project could focus on. We are still open to new ideas at the moment, but for the time being, the two areas we are considering are:
1) Using Boosted Decision Trees with public IceCube data
2) Point Source Search looking for a catalog with significance with neutrinos detected
The first idea stems from Rebecca and I's interest in machine learning while the second idea stems from the IceCube projects of prior years. At the moment these ideas are very broad, as they should be since we have not settled on our focus yet and are still open to more ideas than these two. Dr. Blaufuss will soon send us literature relevant to the first idea, so after that I will include further details about the idea in this logbook. As for the second idea, the basic premise is to look for significance in certain parts of the sky by comparing data from those areas with pseudo-random data. Within the next two weeks we should have finalized our focus for this project.
In the coming week, Rebecca and I will develop new ideas, receive more literature from Dr. Blaufuss, and return to Dr. Blaufuss with further questions.
February 11, 2019
Today Dr. Erik Blaufuss sent us some literature which I will begin reading shortly. One of the papers is an overview of IceCube that I cannot share here. Here are the rest:
Boosted Decision Trees:
https://arxiv.org/pdf/physics/0408124.pdf
IceCube and Blazars:
https://arxiv.org/abs/1807.08816
https://arxiv.org/abs/1807.08794
IceCube Point Source Searches:
https://arxiv.org/abs/1811.07979
February 12, 2019
I have begun reading the literature from yesterday, here are my findings for the IceCube Overview and the paper on Boosted Decision trees thus far:
IceCube OverView:
The IceCube Neutrino Observatory is able to perform neutrino astronomy by detecting Cherenkov light (photons emitted by charged particles moving through a medium faster than light moves in that medium) from particles produced when neutrinos interact with ice nuclei near the observatory. IceCube is able to do this through the use of 5160 sensors in the very clear antarctic ice. These sensors contain light detectors able to detect Cherenkov light from nearby particles. In summary the chronological order of a neutrino event being observed by IceCube is as follows:
A neutrino strikes the nuclei within the antarctic ice near icecube, producing secondary particles.
These secondary particles are at such high energies that, in the ice, they are able to move faster than light does in the ice. This causes protons to be emitted, called Cherenkov light.
Due to the clarity of the ice and the strength of the light detectors in IceCube's sensors, this Cherenkov light is observable to IceCube.
Through the collaboration of the thousands of sensors, the neutrino event can be reconstructed.
Questions:
What is the significance of the "knee" and "ankle" seen in the energy spectrum of the observed flux of cosmic rays (figure 3)?
If high-energy neutrinos may reach Earth "unscathed" from cosmic distances, how are neutrino sources differentiated in terms of their distance from Earth?
Boosted Decision Trees:
Before discussing boosted decision Trees, decision trees must be discussed. In this example the decision tree being discussed is a method to sort elements into two groups based on their features. In order to create a decision tree, you need a data set of elements with known associations to the two groups. You determine variables of the elements that may be used to separate the elements into the two groups. Then, you look at what value each variable splits the elements into two groups the best, identifying which combination of variable and value splits the set the best. You use this to separate the two groups, leaving the more homogenous group remain untouched and continuing the process with the other repeatedly until a select amount of final semi-homogenous groups, called leaves, remain.
Questions:
How is the order of distinguishing features determined? Is this what the weighing process determines? (for decision trees, not boosted decision trees)
It seems that the order of features is simply determined by trying every splitting value for each variable and determining which combination best separates the two groups. If so, this answers the above question. But also, if so, what does the weighing process determine?
Perhaps the weighing process is just the way of judging which combination is best when trying different splitting values and different variables. If so, this answers this question.
In the most simple examples, does splitting a branch always result in at least one leaf?
When determining the purity of a branch P, what does P(X) mean? Specifically, how does one calculate P(1-P)?
February 15, 2019
Today was our second meeting with Dr. Blaufuss. Primarily our discussion focused on furthering our understanding of IceCube overall and deciding what approach we would take with our project moving forward. As we discussed IceCube, many of my questions regarding IceCube from last week's literature review were answered. Moving forward, we have decided to go with the machine learning approach. In discussing this approach Dr. Blaufuss told Rebecca and I that this approach does not limit us to boosted decision trees, but neural networks and other machine learning techniques as well. This week we will be learning more about techniques and getting our feet wet coding with python in Enthought Canopy. But at this moment we have only researched boosted decision trees as far as computing is concerned. During our meeting many of my questions regarding boosted decision trees were answered as well. Below are notes of what we discussed in our meeting, seperated by whether it was more relevant to IceCube or Boosted Decision Trees.
Topics of our Meeting
IceCube
The graph below features the energy spectrum of the observed flux of cosmic rays as mentioned last week. I asked about the knee and ankle labeled in the graph. The knee features a gap in observed events, theorized to be the difference between particles coming from our galaxy and particles from outside of it(extragalactic). The ankle is less understood, but one theory is that this separates particles from within our galaxy and does that are from within our galaxy but have higher energies due to an extragalactic component.
Also about the above graph; The scale is logarithmic and the x axis refers to the energy of the observed particles while the y axis refers to the number density of such events.
Since space is so empty, neutrinos can reach Earth completely unscathed from even the edges of the universe.
There are three types of neutrino events observable by IceCube, following the three types on neutrinos from the standard model.
Electron neutrinos release electrons when they interact with ice nuclei. These electrons only last tens of centimeters before releasing most of their energy and producing lower energy electrons. These produce showers of electrons that look like a nearly spherical shape when observed by IceCube. The electron neutrino has to interact with ice within IceCube in order to be observed because they move such short distances.
Muon neutrinos likewise release muons when they interact with ice nuclei. However, muons move much farther than electrons do before decaying and show up in IceCube as long tracks. This also means that muon neutrinos that are only near IceCube can still be observed.
Tau neutrinos are the strangest of the three flavors of neutrinos. They have not actually been observed by IceCube but in simulations it appears as two spherical shapes due to the tau decaying into an electron. These events are actually possibly found within electron events, but nothing is concrete in that line of thinking yet.
The three types of neutrinos actually may originate from any type of neutrino. This is due to neutrino oscillations. While neutrinos are traveling through space, they actually exist in superposition as a mix of the three types of neutrinos. When the neutrino finally interacts with something, it collapses into one of the three types of neutrino.
Boosted Decision Trees
The weighing process is a technique that essentially gives splits that are more important more power in sorting the elements.
In the case of boosted decision trees, there may actually be boosted decision tree forests, that is large sets of boosted decision trees that can be compared and averaged in order to produce a final boosted decision tree that is optimized.
Some useful tools for boosted decision trees and IceCube computing are Scikit Learn for Python and the Amazon Web Services cloud computing system that allows hundreds of hours of computing power to be access by students like Rebecca and I for the purpose of optomizing boosted decision trees and other sorting methods.
February 17, 2019
Today I started coding in Python 2 in Enthought Canopy based on recommendations by Dr. Blaufuss.
This is what canopy looks like upon startup. Installation was extremely basic and worked how installing any software on windows works. Below is the screen that appears after clicking "Editor".
The layout is 4 main parts. The left vertical box is the file browser that shows what folder you are working in. The horizontal box on the very bottom (clear in this image) is the editor status bar and shows information like line number and character number. The horizontal white box above it is the Python Pane which allows for quick experimentation and output of scripts. These scripts are written in the box above this called the code editor. This is more visible below which is what the layout looks like after clicking "Create a new file".
Above, in the code editor is a simple script that prints "hello world", which is visible in the python pane.
Python can perform operations with Numbers, a type of data within python encompassing data referring to numerical values. Values such as 1, 2, 3, 10, 100, etc. are all integer numbers and thus are of the type int. Decimal numbers such as 1.0, 1.1, 2.1, 2.2 100.1, etc. are of the type float. There are more types of Numbers than these but for now int and float are the most useful numerical data types to be aware of. It is important to distinguish between these data types despite their similarities because certain operations can only work with specific types of numbers.
Some basic numerical operations are shown below (Addition: +, Subtraction: -, Multiplication: *, Division: /).
It is important to note that the / operation (division) always returns an integer when working with integers. If given a float (e.g. 5.0), the / operator will return a float.
Division that ignores any remainder is called floor division (or integer division) and is done in python when both operands are integers. Finding the remainder is done with the % operation.
Before moving ahead to other types of data within Python, it is important to note that Python operations are performed in the standard order of operations. This means that the * and / operations will be done before the + and - operations when they are within a single line. This is clear when examining certain examples of lines of codes like those below.
In the first example, it is clear that the * operation is performed before the - operation because the result is 0. Rather than Python doing 2 - 1 = 1 and then 1 * 2 = 3, Python performs 1 * 2 = 2 and then 2 - 2 = 0. In the second example, the result of 1.5 shows / is performed before -; A result of 0.5 would be expected if the operations were simply performed left to right.
Despite the order of operations being explicitly defined, they can still be confusing when dealing with a large amount of operations or with more complicated operations. Fortunately Python allows for parentheses, (), to be used within the code to force Python to perform the operations within before any other operations. The code below should show how parentheses can significantly change what code does, especially when compared to the previous code example.
Another useful mathematical operation within Python is ** for raising a number to a power.
That is enough information to get started with working with numbers, but there is much more to learn too of course. For now it is probably more useful to get used to the basics of working with numbers and to move on to working with other types of data.
Now that the basic layout of canopy has been shown several times, I will crop future screenshots to save space and not show the entire canopy layout.
February 18, 2019
Today I was looking for a paper on neural networks since Dr. Blaufuss recommended finding one. I could not find a paper explaining neural networks in the context of IceCube, but I did find one explaining neural networks in the context of classifying neutrino events (so still significantly relevant to IceCube).
Here is that paper:
Neural Network Neutrino Event Classifier:
https://arxiv.org/pdf/1604.01444.pdf
February 20, 2019
Today Dr. Blaufuss contacted AWS, Amazon Web Services, about granting Rebecca and I access to some of their student resources for cloud computing. He sent a link for us to apply to the services which I have completed. Hopefully the application will be accepted soon.
February 22, 2019
Today Dr. Blaufuss is at a conference for IceCube, so we did not have our weekly meeting. I have received an email from AWS saying my application was accepted so I can use their computing resources. We are not yet at a point in our project to start using these services, but I may start looking into how to use the services soon.
February 24, 2019
Today I started working on some computing with some simple exercises with basic python concepts such as strings, numbers, variables, and functions. Below is a script with all of those concepts.
The output of the script is shown below, The print statements in the code above are what produce the output below. The comments in the script above explain what is being printed out below.
The basics of Python were actually explored last semester on top of what I have done this semester, so moving forward I will begin learning sci-kit learn, a library for machine learning in Python.
In order to learn scikit-learn, I am using the tutorial found on their site here: https://scikit-learn.org/stable/tutorial/basic/tutorial.html
The code for today is recreated from the code in that tutorial.
The following script is me learning the basics of exploring data sets within scikit-learn and actually training a classifier model with the sklearn.svm.SVC class.
The data set I focus on in the script below is the "Pen-Based Recognition of Handwritten Digits Data Set". This is a large set of handwritten numerical digits that are labeled with what the digits are meant to be (e.g. 1, 2, 3, etc.). The code creates a classifier that learns from the digits data set and predicts the label of a new digit.
The code is commented to explain precisely what I am doing.
The output of the print statements in the script above is visible below. In these print statements you can see the the data in the digits data set,the labels of the set, and the predicted label for the last drawn digit.
As can be seen in the output above, the classifier predicted the label for the last digit to be 8. That last digit can be seen below.
It is clear that this digit is hard to decipher, but it is also impressive that the classifier identified it as an 8 considering it does look like an 8 more than any other digit.
February 25, 2019
Today Dr. Blaufuss met with Rebecca and I (on a Monday) because he is unexpectedly busy this Friday. We discussed our progress this week and what we would like to do next week. For the next week Dr. Blaufuss wants us to try developing a neural network, boosted decision tree, or other machine learning method with a data set. He recommended using the census dataset from a tutorial I have linked below:
https://medium.com/district-data-labs/building-a-classifier-from-census-data-18f996c4d7cf
February 27, 2019
Today I did a little more research into neutrino oscillations. Specifically I wanted to understand the graph below:
I think I have come to understand it now.
The graph shows the probability of an electron neutrino collapsing to a certain type of neutrino (electron (black), Muon (blue), tau (red)) over varying distances as it travels. It is visible that these probabilities sum to 1 throughout the graph. If the neutrino collides with a particle at a certain distance, looking at this graph will show the change of the neutrino being one of the three neutrino flavors.The graph is not exactly quantitative, but is indicative of the overall behaviour of neutrino oscillations.
March 2, 2019
Today I started working through the tutorial discussed in the following article, which Dr. Blaufuss introduced to Rebecca and I on February 25 (as I mentioned on that day in the logbook).
https://medium.com/district-data-labs/building-a-classifier-from-census-data-18f996c4d7cf
The tutorial walks through creating a machine learning classifier on census data. The code below is primarily from the tutorial, but has some changes by me along with additional comments.
The process of applying a classifier to a dataset consists of the following steps:
Data Ingestion
Data Exploration
Data Management
Feature Extraction
Label Encoding
Imputation
Model Build
Model Operation
These steps are accounted below:
Data Ingestion
The code below is about retrieving the census dataset from the internet.
The code above does not print anything, but it does successfully download the census dataset
Data Exploration
The code below is all about utilizing python to examine the data and learn about it before focusing in on machine learning.
The code above prints the first five entries of the dataset as seen below as well as two graphs below that.
The code produces the graphs below. These graphs compare occupation category with annual income, and education level with annual income respectively.
From the information above, one can start to consider attributes one might want to use to train a model and what attributes may be good for prediction.
Data Management
The code below includes the necessary steps needed to make the code usable by scikit-learn for machine learning purposes.
The code above does not print anything, but does much of the work needed in order to prepare the census dataset for machine learning.
Feature Extraction
Feature extraction is a large step that prepares the sections of the data that will be used as features for machine learning. This section is made up of label encoding and imputation.
Label Encoding
Label Encoding is all about taking categorical data (like family status) and mapping it to numerical data so it can be used for machine learning purposes.
As an example. encoding the gender labels for ['female', 'female', 'male', 'female', 'male'] results in a set of integers including 1 or 0 as seen below.
Imputation
As the second half of feature extraction, imputation is about making incomplete data usable, usually by removing missing data or replacing it somehow.
The code above does not print anything but makes the 'workclass', 'native-country', and 'occupation' features usable despite missing values.
Model Build
Model building is the actual code that makes up the classifier. In this case that classifier is a logistic regression classifier.
The code above prints the statement below which reflects the quality of the model. Precision refers to the percentage of accurate predictions in each category. while recall represents the percentage of a prediction group that was predicted accurately.
Model Operation
Finally, model operation is putting the model to use with user inputs in order to make new predictions on never before seen data.
The code above creates prompts for the user as seen below. At the end of these prompts the program prints out a prediction of '<=50k' or '>50k' regarding the user's income. This process is seen below, including the responses the program produces when a prompt is given an incorrect value.
March 4, 2019
Today Rebecca and I met again with Dr. Blaufuss. Dr. Blaufuss told us about a possible approach we may take moving forward. The idea is that currently IceCube notifies the world of notable neutrino events, but sometimes the algorithm identifying these events incorrectly identifies a less notable event as significant. Dr. Blaufuss would have to provide us with datasets with many examples of misidentified events, so that may be difficult. Machine learning would be a good approach in improving this classifying model. For the next week Rebecca and I will explore using different classifying methods within scikit-learn with the census dataset.
March 9, 2019
Today I began preparing to explore scikit-learn's many machine learning methods. Scikit-learn's beginner tutorials utilize certain machine learning methods, but a multitude of other machine learning approaches are available in scikit-learn. Below is a link to scikit-learn's API reference, which has all of these approaches and more.
https://scikit-learn.org/stable/modules/classes.html
The first machine learning method I will try out is neural networks, more specifically I will utilize a Multi-layer Perceptron classifier. I learned about the Multi-layer Perceptron classifier from scikit-learn's website at the page linked below.
https://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised
Multi-layer Perceptron classifiers are examples of supervised learning algorithms, meaning the model is trained using labeled data in order to predict labels of unlabeled data. These classifiers are made of an input layer, an output layer, and any amount of hidden layers in between.The input layer is made up of the features of the data set, while the output layer is the predicted label given to the entry corresponding to the input feature values. The central idea behind multi-layer perceptrons is that the hidden layers somehow operate on the input values in order to produce the output layer. These hidden layers are each made up of any amount of nodes. Each node in a layer calculates a weighted sum of the values produced by every node in the previous layer. The node then takes this weighted sum and transforms it somehow, one such transformation is the unit step function which takes the weighted sum and either does nothing or multiplies it by 0 (depending on if the value of the weighted sum exceeds a certain threshold. Finally the node adds a bias (some constant) and produces the output to all of the nodes in the next layer. The image below is from the last link and features a multi-layer perceptron with a single hidden layer.
I will soon delve deeper into how to implement this type of classifier into scikit-learn. After that I will most likely work more with decision tree classifiers as described on scikit-learn's site as linked below.
https://scikit-learn.org/stable/modules/tree.html#tree
March 11, 2019
I have decided to take a step back after finding out about dummy classifiers in scikit-learn. Dummy classifiers are very simple classifiers in scikit-learn that I will explore first since they are simpler than the decision tree and neural network classifiers. The page below describes these classifiers.
https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators
I was able to get the dummy classifier working with the census data set. I simply changed the code within the "Model Build" as shown below.
The code is not significantly changed from the previous version. The main difference I have made is I imported the proper dummy library and placed the dummyClassifier() inside the existing pipeline. I did this for three types of dummy classifier. The dummy classifier has a strategy argument that significantly changes the classifier. The three I tried above are stratified, most frequent, and uniform.
Now the dummy classifier must be explained. The name speaks for itself, the dummy classifier is simply a classifier that works with very basic rules to classify elements. The details of the classifier depend on the strategy. The stratified dummy classifier makes random predictions, but keeps the same distribution as the training set. The most frequent classifier simply guesses the most frequent label for every single element. The uniform dummy classifier makes completely random predictions for the labels. All of these are implemented in the code above and their reports can be seen below.
As can be seen, these dummy classifiers can actually make somewhat accurate predictions.
March 13, 2019
Today I moved forward from dummy classifiers onto multi-layer perceptron (neural network) classifiers as well as decision tree classifiers.
I started with decision tree classifiers, learning from the link below.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
Once again, I changed the "Model Build" section of the census code. I changed it to construct several different decision trees that have different parameters. The parameters I changed were: criterion, splitter, and max_depth. Criterion can be "gini" or "entropy" which describes the algorithm used in order judge the split quality of the nodes in the tree. Splitter can be "best" or "random" which refers to how the decision tree decides how to split the nodes of the tree. Max_depth can be None or an integer and refers to how many levels of splits may exist in the tree. In the code below, I simply make classifiers with these parameters, train them with .fit(), and print the quality of the tree.
The code below shows the quality of each of the decision trees. It is clear that a complex machine learning approach like decision trees are not as prone to change as the Dummy classifier from before. The max_depth parameter seems to be the most influential as too few levels can severely limit the ability of the tree.
After that exploration into decision trees, I explored neural networks in scikit-learn, specifically multi-layer perceptrons as described in the link below.
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
MLPs are even more complicated than decision trees, so I have only explored the activation and learning_rate parameters. Activation can be "identity", "logistic", "tanh", or "relu" and refers to the function used in each neuron in order to transform the value produced by the rest of the neuron.Learning_rate can be "constant", "invscaling", or "adaptive" and refers to the rate by which the neural network changes via backpropagation. In the code below, I simply make classifiers with these parameters, train them with .fit(), and print the quality of the tree.
The quality of the classifiers is shown below. Changing these parameters do not yield significant differences for predicting labels in this dataset. Much like the decision trees shown before, the MLP classifiers are not as sensitive to change as the dummy classifiers.
From exploring both decision trees and MLPs, I have found that changing a single parameter for these complicated classifiers is not very useful. My next step should be carefully identifying multiple parameters to change at once so I can make more classifiers that are more diverse and show different results in the charts like those shown above.
March 24, 2019
Today I went back to the digits dataset and began exploring using more classifiers with this dataset. For the most part, I used 70% of the dataset for training and 30% for testing the accuracy of the classifiers. As a reminder, the digits dataset is 1797 images that are 8x8 pixel squares that contained handwritten numerical digits. The purpose of applying machine learning to this dataset is being able to accurately decipher what number the handwritten digit is.
From what I learned of scikit-learn exploring census data, I was able to easily make a basic decision tree classifier with the code below. I loaded the digits dataset using the built in load_digits() function in the datasets module.
The code above prints several details about the digits dataset and finally prints a report of how well the decision tree predicts the test set of the digits dataset.
In the report of the individual precision and recall scores can be seen for each of the 10 different digits. The code for decision trees only had to be slightly changed to work with a multi-layer perceptron, this code is below.
The classification report for the multi-layer perceptron is shown below.
The report shows that a basic multi-layer perceptron from scikit-learn performs better than a basic decision tree for this digits dataset.
Further exploration into the digits dataset may be helpful in preparing using classifiers with IceCube data since the icecube data entries are essentially three dimensional images while the digits data entries are two dimensional images.
March 26, 2019
Today Professor Blaufuss sent Rebecca and I IceCube data. The data was pickled, meaning in order to read it, it has to be unpickled. A basic snippet of code from Professor Blaufuss is able to read the file and print the entries within.
Inside 'open("coinc.pkl", "r")', the "r" was originally "rb". This produced an error "could not convert string to float", but was easily fixed by changing "rb" to "b". I am not exactly sure why this works, but it has something to do with using windows on my machine rather than linux, which professor Blaufuss uses.
March 29, 2019
Today we met with Professor Blaufuss. We discussed the IceCube data that we will receive soon. This data is essentially the data from above but much larger. We also got to see some visualizations of both notable neutrino events and insignificant events, like those that will be fed to our machine learning model. These images are below/
Actual notable event examples
Insignificant neutrino events
The arrows is an approximation of the path made by the neutrino. The size of the spheres refer to the magnitude of the charge detected by the DOMs while the color resembles the time that the charge is detected (red is earlier, blue is later).
April 1, 2019
Today I was able to download the data Professor Blaufuss discussed with us at our last meeting. I was easily able to read the data using the code below.
The code is almost identical to the code used to read the sample code from last week, only the file name has changed. While this code was functional, the size of the downloaded files was large enough to severely slow down Enthought Canopy. This proves that I can access the data. The next step is to actually use this data in machine learning.
April 4, 2019
Today I made a little more progress in making the IceCube data usable for machine learning. The data is sent to us in a .pkl format, meaning it has been pickled. In python, pickling is the process of turning a hierarchical data structure into a stream of characters. In our instance, the original data is a python dictionary which has then been pickled, which we then need to unpickle in order to access. Unpickling is the reverse process to the pickling process. It is actually very easy to unpickle the data. We did so with the following simple code.
In this instance, the data is stored in the 'coinc_test.pkl" file. The code above reads the file and then unpickles it, turning it back into a python dictionary.
As far as data structures go, Rebecca and I believe we want to convert the python dictionary into a Pandas data frame. We will attempt this next based on information from the webpage below.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html
April 8, 2019
Today Rebecca and I dove deeper into constructing a Pandas dataframe from our python dictionary. We found that pandas built-in functions were not capable of making our dataframe exactly how we wanted it. So we created our own code that creates a list where each element itself is a list. These internal lists are made up of 5161 elements. The first 5160 elements are each either 1 or 0. A 1 indicates that the corresponding IceCube DOM was hit during the event (the first element of the list corresponds to the first DOM, etc.). The last element is a boolean that indicates whether or not the event was a coincident event, in the instance below these are all true.
Once the list of lists is created, it can be made directly into a Pandas Dataframe using pd.DataFrame(). This dataframe is partially visible below.
The dataframe above can now be used for machine learning. Each row of the dataframe represents a single event. while each column represents a feature. In this case most of the features are simply whether or not a certain DOM was hit, but the last feature is our target and indicates whether or not the event was a coincident.
April 10, 2019
Today Rebecca and I wanted to ensure that the dataframe we made before could be saved and read so that it would not need to be generated every single time the dataframe was needed (which is almost constantly). We did this using pickle via the code below. We did this with each of the datasets nearly identically, simply swapping the file to be the appropriate file for the dataset.
The code above saves the dataframe, which can then be read easily as is done below for each of the datasets.
The code above also includes print statements to print the dataframes being pulled from the pickle files. This is done to verify that the dataframes are indeed the dataframes we had before they were saved. Below is a small excerpt from the printed dataframes.
The dataframes are pretty large, so I have only included a small portion above. The dataframes are still the same as before they were read, meaning this saving process is valid and we will continue to use it.
April 15, 2019
Today Rebecca and I started to try to apply machine learning to the dataframe we just made. This involved many steps that are within the code below. Essential these steps are making the classifier, training the classifier, and finally using the classifier to predict the training set. There are other steps between these larger steps, including retrieving the target values from the dataframe as well as the dataframe without those labels.
This code should allow for machine learning, we simply stopped here due to time. Soon we will move on from this to see if the dummy classifier worked and if we can can a neural network to work as well.
April 16, 2019
Rebecca and I now tried to see if the dummy classifier worked before moving on to a neural network. The code below essentially uses scikit-learn's classification report functionality to judge the quality of our machine learning model.
The code above runs without error and produces the classification reports below.
Fortunately, the code is able to produce results, proving the machine learning models are working. Unfortunately, these models are clearly not very good. This is to be expected for the dummy classifier, but that is included merely for comparison. The MLP neural network is slightly better than the dummy classifier. This is good as it means that our data is able to teach a classifier something. Hopefully we can make significant improvements to this classifier or perhaps a decision tree classifier. Our next step is to dive deeper into making classifiers for our data much better.
April 19, 2019
Today we restructured our code so that we could generate larger Pandas dataframes. This meant generating the dataframe iteratively rather than keeping all of the data in memory at once. This was done simply by putting more code into our preexisting while loop that goes through the original pickle files. The code below works exactly the same way as the code from before, except the while loop means that the memory is only looking at one IceCube event at a time.
This code makes the necessary list that is fed into a dataframe constructor. This allowed us to create a dataframe with more than 40,000 events, each with 5160 features.
April 23, 2019
After using the code from April 19 to construct larger dataframes for the single events, we were able to run the code through the scikit-learn classifiers (dummy and neural network).
The classification report below is with the adjusted single test set of size 20000.
The classification report below has the same adjusted single test set from the previous screenshot, but with an adjusted single training set size of 5000.
It is clear that due to the large amount of single events compared to coincident events, the precision and recall for coincident events is extremely poor. Rebecca and I will work to refine our data sets and our classifiers to improve on these numbers.
April 26, 2019
Today Rebecca and I began wrapping up our research. So far we had a dummy classifier and neural network that each had been trained with both a smaller set of data and a larger set of data. First we added a built-in decision tree classifier, then we generated dataframes that utilize the time values of the detected hits. We did this basically by replacing the 1's in the sparse matrix with the time value representing when the corresponding Dom was hit. The 0's were also replaced with -1 because -1 is an unreachable time value for a hit Dom while 0 is not.
Below are the 4 sets of our final results thus far. We have yet to fully analyze this information.
This first set is the latest results we obtained. It represents training and testing with the most data. The data includes time values as well.
This second set of results is similar to the first, except the data is in the form of a sparse matrix of 0s and 1s, as we originally designed it.
This third set of results is also similar to the first set of results, except it is trained off of less data (The original amount we trained with).
This final set of results is similar to the third set, except it does not utilize the time values (like the second set of results).
Soon Rebecca and I will take a closer look at these results, analyze them, and draw final conclusions.
HONR268N
September 13, 2018
Today I ran through some simple commands that I have listed here and given simple explanations.
[cms-opendata@localhost ~]$ echo $0
bash
The echo $0 command finds your shell, in this case it is bash.
[cms-opendata@localhost ~]$ echo $SHELL
/bin/bash
The echo $SHELL command states what directory bash is located in.
[cms-opendata@localhost ~]$ pwd
/home/cms-opendata
The pwd command shows the directory you are in.
[cms-opendata@localhost ~]$ ls
CMSSW_5_3_32 Desktop Downloads
The ls command lists the files and directories within the current directory.
[cms-opendata@localhost ~]$ ls -l
total 12
drwxr-xr-x 17 cms-opendata cms-opendata 4096 Aug 31 13:45 CMSSW_5_3_32
drwxr-xr-x 3 cms-opendata cms-opendata 4096 Sep 6 12:03 Desktop
drwx------ 2 cms-opendata cms-opendata 4096 Sep 6 12:10 Downloads
The ls command with the -l option lists the files and directories within the current directory with more information such as file size.
[cms-opendata@localhost ~]$ ls -l -h
total 12K
drwxr-xr-x 17 cms-opendata cms-opendata 4.0K Aug 31 13:45 CMSSW_5_3_32
drwxr-xr-x 3 cms-opendata cms-opendata 4.0K Sep 6 12:03 Desktop
drwx------ 2 cms-opendata cms-opendata 4.0K Sep 6 12:10 Downloads
The ls command with the -l and -h options do exactly what the prior command does but with data sizes in a more readable format.
[cms-opendata@localhost ~]$ mkdir new_directory
The mkdir command makes a new directory with the name of whatever text comes after the command.
[cms-opendata@localhost ~]$ ls
CMSSW_5_3_32 Desktop Downloads new_directory
Performing the ls command at this step shows that the new directory has been created within the directory we are located in.
[cms-opendata@localhost ~]$ cd new_directory
The cd command changes the current directory, here I move to the new directory "new_directory".
[cms-opendata@localhost new_directory]$ pwd
/home/cms-opendata/new_directory
This use of the pwd command shows that I am in the new directory.
[cms-opendata@localhost new_directory]$ ls
Now when I use ls there is no visible output because nothing is inside the directory.
[cms-opendata@localhost new_directory]$ ls ..
CMSSW_5_3_32 Desktop Downloads new_directory
Using ".." refers to the directory that the current directory is within, essentially ".." refers to the directory one level broader.
[cms-opendata@localhost new_directory]$ cd ..
Using ".." with cd changes the directory to the directory one level up.
[cms-opendata@localhost ~]$ pwd
/home/cms-opendata
Using pwd now shows that the directory has in fact changed to the prior directory.
[cms-opendata@localhost ~]$ ls
CMSSW_5_3_32 Desktop Downloads new_directory
ls here shows that the new directory is still present.
[cms-opendata@localhost ~]$ rmdir new_directory
The rmdir command removes an empty directory as is done here.
[cms-opendata@localhost ~]$ ls
CMSSW_5_3_32 Desktop Downloads
Using ls again here shows that the new directory has been removed.
[cms-opendata@localhost ~]$ ls -CFx /usr/bin
ls here shows everything in the /usr/bin directory and additional details.
[cms-opendata@localhost ~]$ ls -l /bin
ls here shows everything in the bin directory.
[cms-opendata@localhost ~]$ find /usr -name "*g++*" -print
This command finds and prints the files with "g++" within the /usr directory.
[cms-opendata@localhost ~]$ which g++
/usr/bin/g++
The which command here finds where g++ is and outputs the directory.
[cms-opendata@localhost ~]$ ls /home/cms-opendata/
CMSSW_5_3_32 Desktop Downloads
ls here shows the directories in the home directory.
[cms-opendata@localhost ~]$ whoami
cms-opendata
The whoami command simply shows who the user is.
Today I also started learning about the online 3D event display.
The yellow curves are low energy tracks and green tracks may indicate electrons while dashed purple tracks indicate missing energy often due to neutrinos. Clicking tracks gives more information about what does tracks indicate while shift clicking allows comparison between two tracks.
September 14, 2018
Today I continued trying out commands in linux.
[cms-opendata@localhost ~]$ cd ~
[cms-opendata@localhost ~]$ mkdir TEST
[cms-opendata@localhost ~]$ cd TEST/
[cms-opendata@localhost TEST]$ touch file1 file2 test1 test2
[cms-opendata@localhost TEST]$ ls
file1 file2 test1 test2
[cms-opendata@localhost TEST]$ find . -name "test*"
./test2
./test1
[cms-opendata@localhost TEST]$ find . -name "file*"
./file1
./file2
[cms-opendata@localhost TEST]$ find . -name "*1"
./file1
./test1
[cms-opendata@localhost TEST]$ find . -name "*2"
./test2
./file2
[cms-opendata@localhost TEST]$ rm file1 file2 test1 test2
[cms-opendata@localhost TEST]$ ls
The touch command in bash seems to create new empty files and assigns them the input of the command. We include the ‘.’ after the find commands so that ‘find’ searches through the current directory because ‘.’ refers to the current directory.
The ‘*’ command serves the purpose of representing any string, hence the term wildcard. In the instances of "test*”, "file*", "*1", and "*2" the placements of ‘*’ represent the possibility of any string being in its place, for example the find command, with "test*” as an input, returns test1 and test2.
The ‘rm’ command deletes files. The input of ‘rm’ is the name of the file to be deleted. This is different from ‘rmdir’ because ‘rmdir’ deletes empty directories exclusively while ‘rm’ deletes files and directories that are not empty.
The command ‘rm *’ should never be run because ‘rm’ deletes whatever its input is. If ‘*’ is used as the input to ‘rm’ then it will delete everything because ‘*’ can represent any string.
[cms-opendata@localhost TEST]$ echo "test"
test
[cms-opendata@localhost TEST]$ echo "test1" > log.txt
[cms-opendata@localhost TEST]$ ls
log.txt
[cms-opendata@localhost TEST]$ cat log.txt
test1
[cms-opendata@localhost TEST]$ echo "test2" > log.txt
[cms-opendata@localhost TEST]$ cat log.txt
test2
[cms-opendata@localhost TEST]$ echo "test3" >> log.txt
[cms-opendata@localhost TEST]$ cat log.txt
test2
test3
The ‘>’ operator takes the output of the prior command and puts it into the file after the ‘>’. If something is already in the file after the ‘>’ operator, then it will be replaced by what is being put in by the ‘>’ operation. The ‘>>’ operator is similar in that it also places the output of a prior command into a file. The difference between ‘>’ and ‘>>’ is that ‘>>’ adds onto the file instead of replacing the contents of the file.
A large danger in using the ‘>’ operator is that it removes whatever is in the target file. It would be easy to intend to use the ‘>>’ operator but accidentally use the ‘>’ operator and remove vital contents within a file.
cms-opendata@localhost TEST]$ cat log.txt
total 0
-rw-rw-r-- 1 cms-opendata cms-opendata 0 Sep 14 14:20 log.txt
[cms-opendata@localhost TEST]$ ls -l
total 4
-rw-rw-r-- 1 cms-opendata cms-opendata 70 Sep 14 14:20 log.txt
It would be expected that the directories and files (as well as information about them; this is the output of ls -l) within the current directory would be listed as a result of "ls -l > log.txt" and "cat log.txt". This is because “ls -l > log.txt” redirects the output of “ls -l” into log.txt and “cat log.txt” displays what is contained in log.txt, thus it displays the output of “ls -l > log.txt”. This is proven true in the commands above.
I did some research and found some useful linux commands that I then demonstrated within the terminal.
uname
The command “uname” command displays information about the system. If the command is not given any options then it only displays the kernel name, but with the -a option, the “uname” command displays other information about the system (such as the network node hostname and the machine hardware name). The command ran with no options and with the -a option are shown below.
history
The command “history” shows a list of the commands that have been executed within the terminal. The commands are numbered and do not show the output of the commands. Below is an example of “history” being ran; and it can be seen by the numbered commands that the output of the “history” command is larger than the terminal window in this case.
file
The command “file” takes a file as a target and returns the type of data within that file. As can be seen below; some of the data types that “file” can identify are directories, ASCII text, and empty files.
head/tail
The command “head” takes a file as an input an an option “-n” (where n is a natural number) and prints the first n lines within the file. In a similar fashion, the command “tail” takes the same inputs and options but prints the last n lines within the file instead of the first n lines. The screenshots below show “head” and “tail” being used on a large file containing the output of a use of the “history” command.
whatis
The command “whatis” accepts a command as an input and returns a (usually simple and short) definition of the command. Below are a few examples of “whatis” outputting definitions of commands, including an instance of it defining itself.
cp
The command “cp” can accept inputs wherein the first input is the name of the file to be copied while the second input becomes a new file that is a copy of the first file. In the instance below, file1 is copied onto another file called file2. It is demonstrated below that file2 did not exist until after the “cp” command.
whereis
The command “whereis” locates the source of a command it takes as an input. Similar to the “whatis” command, the “whereis” command can take itself as an input; this is demonstrated below.
alias
The command “alias” allows a command to be attached to a name that can then be called to run the attached command. This alias can be removed by the “unalias” command. This can save time as frequently used commands can be replaced by single characters, such as below where “cd ..” is attached to “x” using “alias”.
paste
The command “paste” combines the contents of multiple files as an output with multiple lines wherein each column consists of the contents of one of the target files of “paste”. Below I have used the “>” operator to put the output of “paste” into another file which then has its contents displayed using “cat”.
.
clear
The command “clear” seemingly empties the terminal of prior commands. If you scroll up the prior commands can still be viewed, so essentially “clear” just creates enough empty space to make the terminal appear empty without scrolling up. Below is a before and after demonstrating the effect that “clear” has on the terminal window.
Today I also learned how to access a text editor within the terminal called "emacs". The commands below brought me to the editor and allowed me to read the ASCII file I created.
[cms-opendata@localhost TEST]$ cd ~/TEST
[cms-opendata@localhost TEST]$ emacs -nw test.txt
[cms-opendata@localhost TEST]$
[cms-opendata@localhost TEST]$ ls
HWDIR file1 file2 file3 historyFile test.txt
[cms-opendata@localhost TEST]$ emacs -nw test.txt
[cms-opendata@localhost TEST]$ cat test.txt
Hello World, Hello emacs.
testing testing 1 2
[cms-opendata@localhost TEST]$ emacs -nw test.txt
[cms-opendata@localhost TEST]$
Many of the commands here (cd, ls, and cat) have been used frequently in earlier sequences so their use here is clear; cd is simply changinf the directory, ls shows the names of the files within the current directory, and cat outputs the contents of a file. The new command here is "emacs" which is used twice here with the option "-nw" which means emacs opens up within the terminal and not in a new window (nw). Within emacs I created an ASCII file; the emacs editor and the text can be seen below.
September 16, 2018
Today I continued using emacs and learning more about the different tools it provides.
In order to exit the emacs editor, hold control and press x then c. A prompt should appear asking if you would like to save, typing y should save the file and exit the emacs editor. Something important to note about using emacs to write alot of text is that pressing enter starts a new line, but if you do not press enter then emacs will attempt to save everything on a single line which can result in unwanted effects. Below is a comparison between a paragraph that was written without mindful use of new lines and a paragraph written using new lines to make sentences easier to read.
Poor Formatting
If the use of lines is not considered, words can split apart when viewing the text and make the text difficult to interpret.
Better Formatting
When being aware of the lines that words appear on, it can be ensured that words do not get split apart just like above. This makes the text much easier to read.
September 23, 2018
Today I experimented using shell scripts to automate performing simple series of commands. In order to do this, variables must be created that are in place for the inputs of the shell script. These variables can then be used in commands written with the variables as inputs. The following is this first shell script.
This script is meant to search for files and directories, with the search term being the argument written in place of <argument>, and print the results. In the script, a variable is created and used within commands, but when the script is ran the variable is replaced with whatever string is passed into the script. The comments have no effect on what the script does, but it does help to explain what the script does to anyone reading it.
However, the command does not seem to run correctly as it returns "Permission denied". This is demonstrated below as well as the technique used to solve this issue.
The chmod command changes the permissions of what users can use or access certain files. This is clear in that at first the test script returned "Permission denied", but after using chmod the script runs correctly.
As an expansion to this last script, I made a similar script that does the same thing except it searches within the home directory and instead of printing the results to the terminal, the results are logged to a new file named after the argument. Below is the script within emacs.
Additionally here is an example of the script being used and printing the contents of the file created by the script containing the results of the find command within the script.
The output of this script, when the script is simply ran by itself, is just the argument put into the script because of the bug statement I left in the script. However, behind the scenes the script is finding the files that have the argument within the name and putting that suppressed output into a file created with the name homesearch_<input>.txt. As in the above image, using the cat command on the file created by the script reveals that the script logged the names of two files with the argument in the name, one being a file made prior to the script running and the other being the file created by the script itself.
I also tried something called piping in linux with a very basic example. In linux, piping is done with | and takes the output of a function and supplies it as the input to another function. The function that grants an output comes before the | and the function taking that output as an input is after the |. This can be used to create a “pipeline” of commands where several commands or programs are linked together wherein the inputs are the outputs of prior commands. Below is a very simple example of using piping to channel the output of the history command to the less command so that the history can be moved through easily with the space bar.
Then I used sed "-i -e 's/one/two/g' test.txt" as a model to create a script that accepts two inputs $THIS and $THAT and replaces all of the $THIS words with $THAT words in test.txt. Below is that basic script.
If you use the script like so:
All of the was' will change into is' in test.txt so the following text changes from:
into this:
So, this script found all of the "was" words in test.txt and replaced them with "is" words. By changing the inputs, any word could be replaced with any word.
October 2, 2018
The following work through October 9, 2018 is part of HW4.
Today I started using C++ in scripts I made using emacs. The following is a script (main.cpp) I made with C++ that outputs "Hello World!" to the terminal:
The g++ command compiles this and tells the compiler that this is C++ code. The ./a.out command runs this script and gets the correct output:
Note: ./a.out runs the last made C++ executable.
Using the find command we can find what versions of C++ are contained in my CERN Virtual Machine:
We can use g++ -dumpspecs and search for "*version:" to see we are using version 4.7.2
Header files include a set of operations that are relevent for specific goals in mind. Here I use ls and more to explore two of these header files:
October 5, 2018
I continued on working on scripts with C++.
The following is a script and its results demonstrating different types of variables within C++, specifically the int and double variable types.
As can be seen above, the 'int' variable stores a value that is purely an integer while a 'double' variable stores a value with decimal places. Even when an int variable stores the value of an int times a double, the value is stored as an integer.
The following is a script and its results demonstrating the ++ and -- operations.
As can be seen above, the ++ operation is a quick way to redefine a variable as its old value plus one, thus quickly adding 1 to the value of a variable. The -- operation is similar in that it is a quick way to redefine a variable as its old value minus one, thus quickly subtracting 1 from the value of a variable.
The following is a script and its results demonstrating the bool variable type in C++.
As can be seen above, the bool variable can only be 2 values, 1 and 0. These values correspond to true and false (1 is true and 0 is false). This is because bool is a boolean, meaning either true or false. This means in C++ bool can be assigned to a logic statement and the bool variable will compute to either 1 or 0 based on the truth or falsehood of the assigned logic statements.
The following is a script and its results demonstrating how while loops work in C++
As can be seen above, while loops are essential sections of code that are ran through several times; the amount of times the code is iterated is dependent on a logic statement that must be true for the while loop to run. This typically means that the logic statement must depend on a variable that is altered within the while loop so that the while loop will hace a definite end and will not run forever. In the above example the while loop is dependent on variable n being more than 0, n starts at 10, n decreases by 1 each iteration of the while loop, and the value of n is printed each time the code wihin the while loop is ran. This is why the results show n going from 10 to 1.
The following is a script and its results demonstrating how for loops can be used in place of while loops. In fact the following for loop does the same thing as the while loop above.
As can be seen in the results of the above script and the results of the script above that, despite swapping a while loop with a for loop, both scripts produce the same results. The first line of the for loop defines the variable that the loop depends on, defines the logic statement that must be satisfied to keep the loop going, and states how the variable will change in value after every iteration. In the above example, the loop variable is 10 and decreases by one every iteration and the loop will end once the loop is no longer above 0. This makes the for loop equivalent to a while loop that uses a variable starting at 10, runs so long as the variable is positive, and decreases the variable by one within the while loop.
The following is a script and its results that uses a while loop within another while loop.
By using a while loop within a while loop, the inner loop is exhausted through possibly several iterations each time the outer while loop runs through a single iteration. This is demonstrated in the results of the script as the list for n adds a digit each row. The rows are indicative of each iteration of the outer loop and each digit is indicative of an iteration of the inner loop. This shows that as the outer loop goes through later iterations, the inner loop is ran through more and more times. This is useful for more complex computations where something is repeated more and more over time.
The following is a script and its results wherein the same results as the above script will be printed, but the script will use for loops instead of while loops.
As seen above, this script produces an identical output as the above script. However it can be plainly seen that instead of while loops this script uses for loops. Just as before, each row is an iteration of the outer slow loop while each digit is an iteration of the inner fast loop. The key here is that for loops contain all of the necesary looping information in the header so that the body of the loop is just what the loop was intended to accomplish, but while loops require some of the looping information to be within the body of the loop, or even before the loop.
October 14, 2018
Today I worked primarily on logic statements and pointers in C++.
The following script demonstrates how logic statements work in collaboration with while loops and if statements. In C++ a while loop includes a condition (a logic statement) that determines when the code within the loop will run. That means that so log as the condition is satisfied, the commands within the loop will be ran through. If statements also work with logic statements as conditionals. An if statement has a chunk of commands that will run if the condition is met and a chunk of code that will run if the condition is not met (within the else statement).
The above code demonstrates an if statement as well as a while loop. First off all the while loop condition is satisfied as n = 10 which is greater than or equal to 10, so the code within the loop is ran. Note: the code within the while loop ends with return 0; so the while loop will actually only run once. Within the while loop is an if statement that is satisfied so long as n is greater than 5, which is true as n = 10, so the cout in the if statement is ran and the cout statement in the else statement is ignored, These results are made clear in the output of the script.
The below script is meant to demonstrate how pointers function in C++.
This program demonstrates that variables are just names given to values stored at certain "addresses". The pointer variable is a way to save the address of a value in the form of a variable. This is done by adding * to the variable declaration (e.g. int, double). In this script we see that the variable p stores the address of the variable i using the & operator.
The following script does not use pointers but will be used to demonstrate a significant characteristic of pointers
This script simply proves that the two variables i and j do not affect the value of each other even though j was initially defined as the value stored by i. This simply means that j was given the value of i initially but neither i nor j will change their values to match one another.
The script below, in comparison to the script above, shows how variables and corresponding pointers relate to one another dynamically.
The above script proves that a variable and its corresponding pointer variable remain connected to each other even when the value is changed. This means that the value of a variable i and the value at the address pointed to by pointer variable p will be the same even when one is changed. This is unlike two variables where one is defined in terms of the other.
The below script demonstrates how to create a pointer variable without using a variable. This means creating a pointer variable that points to the address of a value that is not assigned to a variable.
The above script shows that using the 'new' construct allows a pointer variable to be created that points to an address of a specified value even though the value has not been assigned to a variable. The script shows that new int(5) can be used to create a pointer variable pointing to the address of 5. This value can even be changed without the address being changed.
The below script is a further look into how pointers work in C++. Specifically it shows that a pointer variable pointing to another pointer variable acts identically.
The above script shows how changing the value at the address of one pointer variable also changes the value at the address of any pointer variables pointing to the first pointer variable.
The below script shows examples of if and while logic statements. The code is explained in cout statements.
The script primarily runs through a while loop with the condition n <= 20. Prior to the while loop n is defined as 1. Inside the while loop is an if statement with the condition that n == 13. == is used to compare two values within C++ since the typical = sign is used to define variables. The commands in the if statement use cout to state the value of n and add 8 to n. The commands in the else section also state the value of n but add 1 to n rather than 8. This means thats the scripts counts up from 1 in increments of 1 until hitting 13, where the cout message is changed and 8 is added to n in order to make n = 21 which causes the while loop to stop since 21>20.
October 16, 2018
The script below outputs all of the files that were edited on the same day and month that the script was ran. It also outputs the values of the month and day.
The script uses an if statement to determine if the current day is 2 digits because that changes how the date would be processed. So if the day is less than 10 the program does not add a space to the day, but if the day is 10 or more a space is added so that the grep command will properly parse through the files and output the ones made in the same day and month.
The script below is an alteration of the above script that I made that outputs all files made within the current month rather than the current month and day. It is clearly much simpler than the above script.
Since the script only cares about the current month, the if statement from the prior script could be completely removed as could any mention or defining of variables for the day. This leaves the script to be just the definition of a variable for the current month (using backticks and the tick command), an echo command simply outputting what the current month is, and use of ls and grep to find and print all files edited within the last month.
October 28, 2018
Today I did a basic study of arrays in C++. The following code shows the nature of one dimensional and two dimensional arrays in C++ by creating arrays and iterating through them with while loops,
The output of the basic array script can be seen above. It is clear that arrays start at 0 rather than 1. Two dimensional arrays are accesses via two indices, one for the "row" and another for the "column".
The script below performs the same commands as the script above but uses more comments to make the actions of the program more clear.
As is clearly seen in the output, the additional comments are only for readability and have no effect on the performance of the actual script.
The script below is a simple demonstrating on how, once <fstream> is included, files can be opened, edited, and closed within a script. The "prefix" myfile can have any name so long as it remains consistent through the script.
The output shows that performing this script has put the string "write some junk." into the file "example.txt".
The next through scripts are actually intertwined with one another. This script below is the main script "main.cpp" which outputs the dot products of two vectors and the scalar multiples of those two vectors with a scalar. The new aspect of this script is that the actual calculation of the dot product and scalar multiplication is done by separate scripts that are included in the top of this main script. This example of using multiple scripts keeps the code organized and easier to read. This technique will be useful and necessary as our scripts get larger and larger.
Before looking at the output, let's look at the two other scripts included in the main script.
The first of these scripts is "dotprod.cpp" and is responsible for calculating the dot product of the two vectors given to it. This script is not very complicated but it makes sense to include it in a separate script because dot products could be necessary in multiple scripts, so having this script around is useful, and because the main script is already complex so having this script separate make it a little cleaner.
The second script is "scalarmult.cpp" and is responsible for calculating the scalar multiple of a given vector with a given scalar. It is useful to make this a separate script for the same reasons as for dotprod.cpp.
The file "vectors.txt" is necessary as it provides the vector and scalar values used by the main script. For reference it is below.
Finally we will compile the main script and see that it works perfectly even with the work divided amongst three scripts.
It is easily verifiable that these are indeed the correct values for the dot product and scalar multiples. Note: it was not necessary to run g++ on any script besides main.cpp.
The script below makes use of the <math.h> C library in order to perform pseudo-random number generation and display the results on a histogram. Some new aspects of C++ are seen below. These include the use of "const" to ensure that a declared variable cannot actually change value within the script, and the use of "cin" to take the users' input for the value of a variable.
Note the "int&" in the first argument of "getFlatRandom". We will soon observe the effects of removing the & from this argument. But for now let's look at the script and output unchanged.
Also note the seed which is called "inew" in the script and is equal to 2345 at this point. Later on we will change this value to something else and observe the results.
When the script is ran the user is prompted to enter a number of loop iterations. This is the number of numbers generated and plotted on the histogram. We will use 1000 unless we choose a different number just to observe the affects this number has on the histogram. In the case of 1000 iterations the histogram seems strikingly uniform.
Now let's remove the & I noted earlier and observe the histogram to find out what the & actually does. The new histogram is below.
It appears that 0 has been selected for each of the 1000 iterations. This is clearly not intended and is the result of a logical error in using int rather than int&. Based off this new histogram, the issue may be that when “int” is used rather than “int&”, getFlatRandom always produces the same value rather than a random one. This is sensible because "getFlatRandom" itself does not change, so randomness must be the result of something changing in the arguments. This is precisely what the “&” in “int&” accomplishes. “int&” tells the function its in to use the actual variable entered in as an argument rather than using the value of the variable. This means that after the function is ran, the variable itself may have changed. This is the case in getFlatRandom, “int&” causes the entered variable itself to be used within the function; So when the function gets called again with the same argument, the value of the argument is different than the last time the function was called and thus different numbers can be produced by getFlatRandom and thus randomness is possible. This is why removing & causes the histogram to no longer show random selection but constant selection.
Now I will make a slight modification to the script so it outputs the first 10 "random" numbers generated. I did this by simply adding an if statement. The new script is shown below.
The script is mostly unchanged, but the result can be seen below.
I ran this modified code several times and the first three of those times are below. For the sake of readability I did not include more but the results never differed from what appears below.
It is clear that the same 10 numbers appear every single time the script is ran. This is why the number generation is not random but pseudo-random. The histogram still appears mostly uniform, so this pseudo-random number generation must contain roughly the same amount of each number when looking at large sample sizes.
I changed the seed from 2345 to 3457. Once again I have performed the script several times and put the first three outputs below. Once again the results never differed from what is seen below.
Now the first 10 numbers are different but still remain the same when the script is repeatedly ran. Clearly this “pseudo-randomness” is somehow based on the seed, although it is unclear exactly how the seed correlates with the first ten random numbers.
Next we will examine the effects of changing the amount of iterations.
I have ran the script three more times, this time with 10, 100, and 100000 iterations respectively.
The histogram gradually appears more uniform with higher amounts of iterations. At 10 iterations the histogram is not uniform whatsoever, at 100 iterations the histogram is nearly uniform but not quite. At 10000 iterations the histogram is completely uniform. This makes this pseudo-randomness appear random when given large sample sizes.
Now let's look at a built-in random number generator (when using <stdio.h>, <stdlib.h>, <time.h>, and <<iostream>).
The output above is very simple and shows that the number selected randomly was 83.
The below script is root code called Resolutions.C and is intended to show the importance of resolution in experiments. The script below simulates a certain amount of collisions between bosons and produces a histogram with the measured mass of the mother particle from the collisions. The integer N below tells us the amount of bosons.
The file secretparameters.txt below is also necessary
In order to run resolutions.C, one must open root like so:
Now we will examine the histograms from resolutions.C
Histogram for N=1
From this data alone I would guess that the mass of the mother particle is 76 as that is the only plot on the histogram. Since it is only one piece of data there is no real way to guess how different this is from the true mother mass; it could be completely accurate or completely wrong.
Histogram for N=10
From this data I would guess that the true mother mass is 84 as there are more lines there than anywhere else. Now that we have a range of data from about 80 to 130 (difference of 50), I would estimate that this measured mass could be about 25 off of the real mass (50/2).
Histogram for N=100
From this data I would guess that the true mother mass is 90 since 90 has more plotted on it than any other value. The density of the histogram is greater from around 70 to 110 (difference of 40) so I would say this measured mass is about 20 off from the real mass (40/2).
Histogram for N=1000
From this data I would guess the mother mass is about 95 because most of the data is between 90 and 100. There is a high density of data between 80 and 110 (difference of 30) so I would say the measured mass is about 15 off from the real mass (30/2).
Now we will change the secret parameters as follows and repeat our examination.
Histogram for N=1
From this data alone I would guess that the mass of the mother particle is 19 as that is the only plot on the histogram. Once again since it is only one piece of data there is no real way to guess how different this is from the true mother mass; it could be completely accurate or completely wrong.
Histogram for N=10
From this data I would guess the mother mass is about 5 because most of the data is at 5. The range of data goes from 0 to 30 (difference of 30) so I would guess this measured mass could be 15 off (30/2).
Histogram for N=100
From this data I would guess that the true mother mass is 1 since highest density seems to surround 1. The density of the histogram is greatest from 0 to 20 (difference of 20) so I would say this measured mass is about 10 off from the real mass (10/2).
Histogram for N=1000
From this data I would guess the mother mass is about 12 because most of the data is between 0 and 25. There is a high density of data between 5 and 20 (difference of 15) so I would say the measured mass is about 7.5 off from the real mass (15/2).
I have learned that as you increase the resolution (by increasing N) more can be said about the reality behind the data as you get a greater idea for the chances that certain data will be produced. In these histograms, as N increases the expected error for the measured mass decreases. Since quantum mechanics is probabilistic, resolution becomes even more important because probabilities are all that can be calculated.
It is also likely that the first number in secret parameters determines the actual mother mass since 91 is near all of the approximate masses for the first set of histograms and 1 is near all of the approximate masses for the second set of histograms.
November 10, 2018
Today I explored using MadGraph from the terminal. I downloaded MadGraph from http://madgraph.physics.illinois.edu/.
After unzipping the file I was able to start madgraph by doing ./bin/mg5_aMC.
Below is an example of generating a process within madgraph.
Below is how to display processes within madgraph.
Similarly, below is how to display particles.
Even more similarly, below is how to display multiparticles.
Below we add a second process with a W+ decaying leptonically.
By doing "output MY_FIRST_MG5_RUN" then "launch MY_FIRST_MG5_RUN", the results of the processes are calculated within a webpage. The end result of outputting and launching the processes above is below.
The above commands also place a zipped file of the generated events for pp→ ttbar called "unweighted_events.lhe.gz" in MY_FIRST_MG5_RUN/Events/run_01/. The zipped file is below.
The file above can be unzipped using gunzip. In order to convert the lhe file into a root file we will download lhe2root.py and perform the command below.
Now that the lhe file has been converted to a root file, it can be open in root like below.
Then within root we can use TBrowser to view the data generated using madgraph. Below is all of the graphs generated from the above processes using madgraph and viewed through the TBrowser from root.
That is all for my current exploration of madgraph and converting lhe files into root files in order to view the data using TBrowser.
Now I will explore calculating Z-boson and Higgs boson invariant mass using a TLORENTZ vector class and a provided file LHC-Higgs-Graviton.tgz.
After the LHC-Higgs-Graviton.tgz file is unzipped, four new root files will appear. Below I open up one of the new root files within root.
Below I perform the HZZ4LeptonsAnalysisReduced->MakeClass("HiggsAnalysis") coomand.
The command creates a new class called "HiggsAnalysis" that can be used to calculate the invariant mass of Higgs boson and Z-boson. This also results in new files appearing, "HiggsAnalysis.C" and "HiggsAnalysis.h", that are directly used in calculating the Higgs invariant mass as will be seen below.
Below is the code I am putting into HiggsAnalysis.C. It creates a histogram for Z-boson invariant mass but does not yet produce a histogram for the mass of Higgs.
I followed the instructions in red from the script above in order to properly run the script within root. Below is the steps: .L HiggsAnalysis.C, HiggsAnalysis t, and t.Loop().
The results of t.Loop() is above.
With those steps done, TBrowser can then be used to view the histogram from the script. Below is the current histogram resulting from the script.
The histogram above shows the invariant mass of the Z-boson.
The next steps are done in order to make the script produce the histogram for invariant mass of another Z-boson and the resulting Higgs boson as well as the previous histogram.
First we will go within the script and create two more TLorentzVectors el3 and el4 on top of the existing el1 and el2 TLorentzVectors. Below is what this looks like.
Then we will add the two new TLorentzVectors el3 and el4 to get a new vector TLorentzVector zCandidate2. Below is what this looks like.
Now we will add zCandidate and zCandidate2 to get a new TLorentzVector Higgs that will be used to calculate the invariant mass of Higgs Boson.
Next we will define a new histogram for the second Z-boson, Z2_ee, and another for the Higgs Boson, H_zz.
Then we will fill each of the new histograms with their proper masses (Z2_ee with the mass of zCandidate2 and H_zz with the mass of Higgs).
Now we will write each of the new histograms with ->Write();.
For reference, the entirety of the script is once again included below but now with all of the new changes.
Finally the script is able to produce a histogram for the invariant mass of Higgs. Also the script produces a histogram for the second Z-boson. These new histograms can be seen below.
Note: The first histogram is still produced by the script but is identical to the histogram shown earlier.
The script now makes proper use of the TLorentzVector class in order to produce a histogram for the invariant mass of Higgs.
November 19, 2018
Today I worked on python within Google colaboratory to complete HW9.
Syntax, Variables, and Numbers
The following code examples are practice for syntax, variables and numbers in python.
The code belows shows basic variable defining along with addition and printing.
The code below shows how the area of a circle can be calculated using variables.
The code below shows how temporary variables can be used to switch the values of two variables. Trying to do so without a temp variable will result in both variables having the same value as shown below in method 1.
Below is another example of using a temporary variable to switch the values of variables, but this time between three variables.
Functions
The following code examples show using functions within python
The code below shows how to create a function that rounds a number to two decimal places using round
The code below shows how to create a function similar to that above, except it accepts a number argument that determines how many decimal places the number is rounded to.
Built-in functions min(), max(), sort(), sum(), len()
The code examples below show how to use built-in functions within python such as min(), max(), sort(), sum(), and len()
The example below has examples of using min(), max(), sort(), sum(), and len() on defined variables.
The example below shows using max within a function that computes the product of the maxes of two lists
Booleans and Conditionals
The following examples show how booleans and conditions can be used within python
Below is a few examples of conditionals that show what booleans result from the conditionals
Below are specific examples of using conditionals with max, min, and sum
The code below shows an if statement paired with an else if and else statement that compares two variables. Below shows an example where a > b
The examples below show the code above, except they show the result when b > a and a == b
The code below shows a function that uses an if statement with else if and else that determines which of two strings and longer and returns that string
The code below shows a function that determines if cake can be made from given ingredients using a conditional using 'and'
Modulus operator ('%')
The example code below shows using the modulus operator % in python that finds the remainder between two numbers and how it can be utilized to find if a number is odd
Lists
The code examples below show how lists can be utilized in python
The code below shows how indices work in python, the indexing starts at 0 and lists within lists can be accessed using multiple brackets.
The code below shows a function that access the 2 element of a list that is itself the 2nd element in a list
The code below switches the first and last values ina given array using -1 as an indice that accesses the last element of an array
Loops
Python has for loops and while loops which are shown below
For loops
The code below shows how to use a for loop within a function to access every element in a list
While loops
The code below shows how to use while loops to run a set of commands so long as a certain conditional is true
The code below counts the amount of numbers in a list that are divisible by 7 using for loops and modulus (%).
Dictionaries
The code examples below show how dictionaries work within python
The code example belows shows two examples of dictionaries and how to use .get() to find the value associated with a key
The code below shows how dictionaries can be used to represent a deck of cards as well as how to use .fromkeys() to add values, .items() to view dictionary pairs, .keys() to find the keys in a dictionary, .values() to print the values of a dictionary, .pop to remove elements from a dictionary, .setdefault() to add a dictionary, and .clear() to clear a dictionary
External Libraries
The code examples below show how to access external libraries from python, specifically math libraries
The code below shows how the math library can calculate pi, logarithm, greatest common denominator, and cosine
The code below shows how to create and plot a sin curve using external math library
The code below shows how to use external math libraries to create and show a scatter plot based off a random distribution
The following code shows how to use external math libraries in python to plot t^3
PYTHON FOR CMS ANALYSIS
The following code shows a bar graph showing the amount of dimuon events/
November 27, 2018
Today I worked with Google colaboratory to complete HW10.
I used the Google deep dream program within Google colaboratory to make interesting photos.
What does the Google deep dream program do?
Google’s DeepDream program takes a photo and modifies it to resemble other visuals
that have been shown to the program. The DeepDream program has seen countless images of
things like dogs, faces, lizards, and much more. The program takes an image given by the user
and searches it for anything that resembles the visuals that Google has fed into it. Once the
program identifies part of the image that resembles something, such as a shadow that resembles a
dog’s face, it will modify that part of the image to look even more like what it resembles. This is
down across the entire image; the result is an image that has bizarre colors and shapes that
resembles things like dogs and eyeballs, even though the original image may not have appeared
to resemble those things to the human eye. Since the original photo and resulting photo can be so
different, the resulting photo can be unsettling. Thanks to deep learning, Google’s Deepdream is
able to find the slightest resemblance of a visual in a given photo, and modify the photo until that
resemblance is clear to the human eye.
Below are the code blocks in Google colaboratory that run the Google deep dream program.
I put the image below into the deep dream program and put the result into google drive.
The picture below is the result of the deep dream program running on the above picture.
It is clear that the Google Deep Dream program can create interesting photos from just about any photo.
End of HONR268N