Brian Reed

Famous Opening Quote

"Hello, how are you doing today?"

-Annabelle

Logbook:

September 13th, 2018

Bash practice:

echo $0

returns

bash

echo $SHELL

returns

/bin/bash/

These commands can be used to tell what language you are using in that terminal, and the directory of that language (if you needed to add commands to it)

pwd :print working directory, not used often as it is always written to the left of where you enter commands

ls: prints all the files and subdirectories of your working directory. (not sure what ls stands for, maybe "list stuff")

ls -l :more details of ls such as size and last modified.

mkdir : makes a directory (folder) with the name following the command in current folder

cd : change current directory to the one listed

-can take absolute directions from root or home

-can take relative directions from current directory

-can also take .. in order to move up a directory out of the folder

rmdir : remove directory with the name following the command

ls –CFx /usr/bin : I think this lists all commands for the terminal

ls –l /bin : I think lists base commands or something? (all I recognize is echo, su, and mount commands

HW3: Shell Scripts

Chmod allows you to change the permissions on a file, for example giving anyone the ability to run a executable bash script, but allowing noone to change it.

ls -l allows you to see the file permisions of all the files in a directory.

The following program takes one input, and then outputs the result of "find -name <the input>" to the test.txt file.

#bin/bash

Set arg=$1

Find ~ -name “*arg*” > test.txt

To do a find and replace command using bash do the following:

In file named wordreplace:

#bin/bash

Set arg1 = $1

Set arg2 = $2

sed -i -e 's/$arg1/$arg2/g' test.txt

What this does is replace arg1 with arg2, which for the exercise was done with:

Command:

./wordreplace was is

HW4 : C++

(This section is a little bare as I already know C from CMSC 216)

(So just some basic stuff, and commenting on differences between c and c++)

#include <iostream> //includes the library for terminal output (probably file output too)

using namespace std;// idk

int main() {// main method that will be run when the executable is run

cout <<"Hello World!" << endl;//Print hello world with new line (\n works too)

return 0; //Exit the program, and say that it ran correctly

}

find /usr/include/c++ -name "4*"

C++ 4.4.4, and C++ 4.4.7

I am using 4.4.7

G++ increments variable by 1, same as G+=1;

G--; decrements variable by 1, same as G-=1;

Booleans, declared bool, hold true and false values. (Is 0 and 1 in the background)

While (boolean or condition){

//Code to be executed repeatedly;

}

Just like c syntax, which I have a lot of experience with from CMSC 216.

Differences to C:

- Main method declaration : in c is int main (void){} in c++ is int main(){}, in essence you don’t need to specify when you don’t need arguments
- output : C uses printf and scanf for console input and output, C++ has these too but usually instead uses cout << and cout >>
- For loops : can declare variables inside of for loops for(int i=0;i<10;i++), while in c you have to int i=0; for(i=0;i<10;i++)

“Your code is not doing what it should. Maybe it is even crashing. What do you do? Do you give up? Do you start randoming changing things? Of course ” -HW 4

On the danger of misleading quotes.

g++ compiles both cpp and c files, not sure what the difference is.

g++ works like gcc in that -o filename code.cpp creates a different file as the executable than a.out

HW5:

Pointers work in c++ just like in c.

However, unlike in c, c++ has a keyword called new.

the new keyword allows you to declare a pointer to a variable without first declaring the variable itself.

Used:

int * p = new int(5);

Now p would point to an integer with the value of 5.

HW 6: October 30th

Just like c syntax, which I have a lot of experience with from CMSC 216.

#include <iostream>

#include <fstream>

using namespace std;

int main() {

ofstream myfile;

myfile.open("example.txt");

myfile<<"write some junk.";

myfile.close();

return 0;

}

What this does is write “write some junk” into example txt.

Basically this works by replacing cout with another IOStream, in this case the file you can use the same syntax as printing output to console to add stuff to a file.

Computing

#include <iostream>

#include <math.h>

using namespace std;

// this function is the actual random number generator

// this code is stolen from the book numerical recipes for fortran

// it relies on random generated from overflow of memory locations

// and is a pseudo random number generator

const int a = 7141;

const int c = 54773;

const int mmod=256200;

double getFlatRandom(int& inew) {

double mranflat = 0.;

inew = inew%mmod;

double aa = double(inew)/double(mmod);

mranflat=aa;

inew = a*inew+c;

return mranflat;

}

// in this code, we will call the pseudo-random number generator and learn some things

// about its properties by filling and then displaying a histogram

int main() {

int num;

cout << "Enter the number of loop iterations: ";

cin >> num;

int inew = 2345; // This is the "seed" for the random number generator

// we will put the results from the call into a histogram that we can look at, to learn some of its

// properties. This histogram has 10 bins.

int histo[10] = {0,0,0,0,0,0,0,0,0,0};

double atmp; //

// call the random number generator 1000 times and fill a histogram

for (int i = 0; i<num; i++){

atmp=getFlatRandom(inew); // call the random number generator

histo[int(atmp*10)]++; // increment the histogram bin the number falls within

}

// print the histogram to the screen

for (int i = 0; i<10; i++){

cout<<i<<": ";

for (int j=0; j<int((double(100*histo[i])/double(num))+0.5); j++) {

cout << "=";

}

cout << endl;

}

return 0;

}

By removing the & you make inew passed by value rather than passed by reference, and so you end up always calling getFlatRandom with the same value, returning the same “random” number.

getFlatRandom works through integer overflow, which I know well.

The reason the numbers are called Pseudo random is that with the same seed, and the same number of iterations you get the same number, while this is terrible for small iterations, with 10000 iterations the random number generator is fairly good, but there are better.

Calorimeter:

1. N=10000

Physics:

1. The LEP tunnel (google it) has a circumference of 27 km. Assuming the maximum field strength the best magnet experts can currently produce for a large bore magnet is 8.4 Tesla, what will the energy of the protons in the LHC beams be?
  1. T = JoulesAm2, 27km=27000m

P=.3qrB

Pc=.3qrBc=.3qrB

Pc = .3(27000/2π)q(8.4), maybe q=1 for a proton

Pc=10.828 GeV

E2=(Pc)2+(mc2)2 , m=938.54 x 106eV, m=.93854GeV

E=10.869Gev

1. What is the tunnel were along the earth's equator? What would the maximum energy be?
  1. .3(40075000/(2π))(8.4)Gev=16072.898Tev

HW 7: November 12th

Done in class.

In essence we just installed mad graph and used it a bit, which was handled through very detailed instructions on the HW page.

- Write a brief description of major parts of the CMS detector. Beam pipe, tracker, calorimeter, solenoid, muon tracker. What are the physical dimensions of each part?
  - Beam pipe: 6.3 cm in diameter
  - tracker: 1.1 meter radius
  - Calorimeter, about 1.5 m radius
  - Solenoid about 1 meter radius
  - Muon tracker: 4+ meters
- Find how to change the display to show you a view along the beam pipe?

- Which color tracks are electrons?
  - Green
- Which color tracks are muon?
  - Long red
- Which color track missing energy is represented with?
  - A purple dashed line
- How are jets depicted in this visualization?
  - Yellow tubes
- Can you figure out if a track has clockwise curvature, it is +vely charged or negatively charged? (when looking at the x-y view, with x along the horizontal axis)
  - It is positively charged
- Study at least 2 events from each category in the files available with the display and specify how many objects of each kind are seen in every event?

Electron.ig:Events/Run_146644/Event_718940510 [1 of 25]

1. Electrons: 1
2. Muons: 0
3. Photons: 1
4. Jets: 8
5. missing energy: 1
6. Vertices: 0

Electron.ig:Events/Run_146644/Event_718948926 [2 of 25]

1. Electrons: 1
2. Muons: 0
3. Photons: 1
4. Jets : 14
5. missing energy: 1
6. Vertices: 0

Mu.ig:Events/Run_146436/Event_90625265 [1 of 25]

1. Electrons: 0
2. Muons:2
3. Photons: 0
4. Jets: 0
5. missing energy: 0
6. Vertices: 0

Mu.ig:Events/Run_146436/Event_90626440 [2 of 25]

1. Electrons: 0
2. Muons: 2
3. Photons: 0
4. Jets : 15
5. missing energy: 0
6. Vertices: 0

4lepton.ig:Events/Run_178424/Event_666626491 [1 of 3]

1. Electrons: 0
2. Muons: 4
3. Photons: 0
4. Jets: 30+
5. missing energy: 1
6. Vertices: 0

4lepton.ig:Events/Run_193575/Event_400912970 [2 of 3]

1. Electrons: 4
2. Muons: 1
3. Photons: 4
4. Jets : 30+
5. missing energy:2
6. Vertices: 0

diphoton.ig:Events/Run_199319/Event_641436592 [1 of 10]

1. Electrons: 0
2. Muons:2
3. Photons: 2
4. Jets: 15+
5. missing energy: 1
6. Vertices: 0

diphoton.ig:Events/Run_199699/Event_336259924 [2 of 10]

1. Electrons: 0
2. Muons: 0
3. Photons: 2
4. Jets : 15+
5. missing energy:1
6. Vertices: 0

MinimumBias.ig:Events/Run_148031/Event_441572355 [1 of 25]

1. Electrons: 0
2. Muons:0
3. Photons: 0
4. Jets: 12
5. missing energy: 0
6. Vertices: 0

MinimumBias.ig:Events/Run_148031/Event_441584555 [2 of 25]

1. Electrons: 0
2. Muons: 0
3. Photons: 0
4. Jets : 0
5. missing energy:0
6. Vertices: 0

HW 8: November 12th

HW9

CODE:

pandas=10# no semicolons

zebras=20# declare vars like all other languages

total_animals_in_zoo=pandas+zebras

print(total_animals_in_zoo)#print stuff to console

pi=3.1415926535 #Approx.

diameter=10

area = pi*diameter*diameter/4

print(area)

a = [1, 2, 3]#list of ints

b = ['cat', 'dog', 'fish']# list of strings

a=b

b=a#this part fails as the values of a have been over written

print('Method 1')

print(a)#unlike other languages this will actually print all elements in the list

print(b)

a = [1, 2, 3]

b = ['cat', 'dog', 'fish']

temp = a#here a is saved so it is not overwritten

a = b

b = temp

print('Method 2')

print(a)

print(b)

dog_legs=0

human_legs=4

goldfish_legs=2

#code

temp = dog_legs

dog_legs=human_legs

human_legs = goldfish_legs

goldfish_legs=temp

#def is keyword to indicate function

def round_to_two_places(num):#end functs, if , for, while with a :

return round(num, 2)#function name is longer than function call

print(round_to_two_places(4.325456463))

def round_to_n_places(num, places):

return round(num,places)#AMAZING

a=[1,5435,324,3,645365,23,4,5665,435,31, -543]

b=-10

c='cat'

print(c)

print(abs(b)) #absolute value

print(min(a)) #min value

print(max(a)) #max value

print(sum(a)) #sum

print(len(c)) #length of a string

print(sorted(a)) #sorts a list

a=[3243,645645,54346]

b=[7,3,2]

def product_of_maxes(a,b):

return max(a)*max(b)

print(3==10)

print(len('abc')==3)

print(max([10,40,53]) < 25)

print (5 != 3)

#prints False, True, False, True

print(min([1,2,3])!=1)

print(max([4,5,6])<7)

print(sum([1,2,3,4])==10)

#prints False, True, True

a=10#b>a

b=20

#a=20#a>b

#b=10

#a=10#a=b

#b=10

if(a>b):

print(a, 'is greater than', b)

elif(a<b):

print(b, 'is greater than', a)

else:

print("they're equal")

def longer_string(string1, string2):

if(len(string1) > len(string2)):

return string1

elif(len(string1) < len(string2)):

return string2

else:

return

print(longer_string('abc', 'jfkdla;s'))

print(longer_string('abcfdsafdasg', 'jfkdla;s'))

print(longer_string('abc', 'def'))

#prints jfkdla;s, abcfdsafdasg, and then nothing

def can_make_cake(eggs, flour, milk, almond_milk):

#I can make cake if I have some kind of milk AND flour AND eggs

return (milk or almond_milk) and flour and eggs > 0

print('Can I make cake?')

#What is the following line implying?

print(can_make_cake(10, True, False, True))

# you can make a cake with 10 eggs, flour, and almond_milk, will print True

print(can_make_cake(0, True, False, True))

# you can't make a cake with no eggs, flour, and almond_milk, will print False

a = [-35423,-5432654,-352435,53252]

def abs_of_min_is_odd(a):

if(abs(min(a))%2==0):

print("Even")

else:

print("Odd")

print(abs_of_min_is_odd(a))

def select_third_element(my_list):

if(len(my_list) < 3):

return None

else:

return my_list[2] #remember index 0, so third element is index 2

foo = ['a', 'b', 'c', 'd']

print(select_third_element(foo))#prints c

list1=[1,2,3]

list2=[4,5,6]

list_of_lists=[list1,list2]#[[1,2,3],[4,5,6]]

print(list_of_lists[1][2])#prints 6

print(list_of_lists[0])#prints [1,2,3]

print(list_of_lists[1])#prints [4,5,6]

print(list_of_lists[0][1])#prints 2

print(list_of_lists[0][2])#prints 3

print(list_of_lists[1][0])#prints 4

#like base 4!

real_madrid=['zidane', 'ramos', 'marcelo']

barcelona=['valverde', 'messi', 'pique']

teams = [real_madrid, barcelona]

def losing_team_captain(team_list):

return team_list[0][1]

print(losing_team_captain(teams))

standings=['mario', 'bowser', 'luigi','peach']

def purple_shell(racers):

temp=racers[0]

racers[0]=racers[len(racers)-1]

racers[len(racers)-1]=temp

print(purple_shell(standings))

list=[1,21,7,-14,49,74,700]

def seven_counter(listy):

r=0.0

for i in listy:

if i%7==0:

r+=1

return r/len(listy)

print(seven_counter(list))

deck={} #represents a deck of cards where the keys are the suits and values are lists of card values

hearts=[]

clubs=[]

diamonds=[]

spades=[]

values=['ace', 'two', 'three', 'four', 'five', 'six','seven', 'eight', 'nine', 'jack', 'king', 'queen']

def create_cards(suit):

for value in values:

suit.append(value)

create_cards(hearts)

create_cards(clubs)

create_cards(diamonds)

create_cards(spades)

##################################Add your code here according to the comments ##########################################

#Use the strings 'h', 's', 'c', 'd' as keys, and add the lists of values to the dictionary.

deck['h']=hearts

deck['s']=spades

deck['c']=clubs

deck['d']=diamonds

#Print a view of dictionary (key, value) pairs

for i in deck:

for j in deck[i]:

print(j+" of "+i)

#Print a view of all of the keys

print(deck.keys())

#Print a view of all of the values

print(deck.items())

#Remove all of the spades from the deck

del deck['s']

#Add a new entry to the dictionary with key 'j' and values 'joker1' and 'joker2'

deck['j']=['joker1','joker2']

#Clear the dictionary

deck.clear()

import math as m

import matplotlib

import matplotlib.pyplot as plt

import numpy as np

data = {'a': np.arange(50),

'c': np.random.randint(0, 50, 50),

'd': np.random.randn(50)}

data['b'] = data['a'] + 10 * np.random.randn(50)

data['d'] = np.abs(data['d']) * 100

plt.scatter('a', 'b', c='c', s='d', data=data)

plt.xlabel('entry a')

plt.ylabel('entry b')

plt.show()# this outputs a scatter plot of the randomly generated data

# NOW MY OWN

data = {'a': np.arange(100),

'c': np.random.randint(10,30,100),

'd': data['c']}

data['b'] = np.square(data['a']-50)+np.random.randint(-20,20,100)

data['d'] = np.abs(data['d']) * 10

plt.scatter('a', 'b', c='c', s='d', data=data)

plt.xlabel('entry a')

plt.ylabel('entry b')

plt.show()

# imports some software packages we'll use

import pandas as pd

%matplotlib inline

import matplotlib as mpl

import matplotlib.pyplot as plt

inline_rc = dict(mpl.rcParams)

# a hashtag tells the program "don't read the rest of the line"

# That way we can write "comments" to humans trying to figure out what the code does

two_u = pd.read_csv('https://github.com/adamlamee/HEP-data/raw/master/Double_Muon_Run2011A.csv')

# two_e = pd.read_csv('https://github.com/adamlamee/HEP-data/raw/master/Double_Electron_Run2011A.csv')

# one_u = pd.read_csv('https://github.com/adamlamee/HEP-data/raw/master/Single_Muon_Run2011A.csv')

# one_e = pd.read_csv('https://github.com/adamlamee/HEP-data/raw/master/Single_Electron_Run2011A.csv')

data = two_u

# The .head(n) command displays the first n rows of a file.

data.head(3)

# The .shape command displays the (number of rows , number of columns) in a file.

data.shape

# You can specify a column by dataset.columnName (e.g., two_u.E1)

# This makes a new column called "totalE" and fills it with (E1 + E2) for each event

data['totalE'] = data.E1 + data.E2

# This makes a new column called "Esquared" and fills it with E1^2 for each event

data['Esquared'] = data.E1**2

# makes the histogram

plt.hist(data.totalE, bins=10, range=[0,120], log=False)

plt.title("Dimuon Events")

plt.xlabel("x-axis label")

plt.ylabel("number of events")

OUTPUT:

30

78.5398163375

Method 1

['cat', 'dog', 'fish']

Method 2

['cat', 'dog', 'fish']

[1, 2, 3]

4.33

cat

10

-543

645365

656743

3

[-543, 1, 3, 4, 23, 31, 324, 435, 5435, 5665, 645365]

False

True

False

True

False

True

20 is greater than 10

jfkdla;s

abcfdsafdasg

None

Can I make cake?

True

False

Even

None

c

6

[1, 2, 3]

[4, 5, 6]

2

3

4

ramos

None

0.7142857142857143

ace of h

two of h

three of h

four of h

five of h

six of h

seven of h

eight of h

nine of h

jack of h

king of h

queen of h

ace of s

two of s

three of s

four of s

five of s

six of s

seven of s

eight of s

nine of s

jack of s

king of s

queen of s

ace of c

two of c

three of c

four of c

five of c

six of c

seven of c

eight of c

nine of c

jack of c

king of c

queen of c

ace of d

two of d

three of d

four of d

five of d

six of d

seven of d

eight of d

nine of d

jack of d

king of d

queen of d

dict_keys(['h', 's', 'c', 'd'])

dict_items([('h', ['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'jack', 'king', 'queen']), ('s', ['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'jack', 'king', 'queen']), ('c', ['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'jack', 'king', 'queen']), ('d', ['ace', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'jack', 'king', 'queen'])])

Text(0,0.5,'number of events')

HW10:

For hw10 we used google deep dream to create some cool images:

Code here:

https://github.com/keras-team/keras/blob/master/examples/deep_dream.py

The image I ran the code on is below:

Then after running the code in colab, I ended up with:

Which certainly looks incredibly terrifying.

### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###

Beginning of Honr269L Logbook

### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ###

Week 1:

In the first week I am just getting back into college routine

Sent email to Bill Dorland after class on Thursday

Friday snow day

Week 2:

Got response next Wednesday, set up meeting for Friday

On Friday’s meeting met with Bill Dorland and Matt Landreman about the project

Bill told me about how Matt has created several million fusion magnetic containment designs, that he computed through a brute force search.

The "stellarator" we saw last semester is one such example of a reactor design, but one that was done by hand.

All of these designs are sitting as mostly unprocessed data, and my project will be figuring out how to use machine learning to see if any of the reactor designs cluster usefully around any particular qualities.

Week 3:

On Monday all 3 of us met again The main purpose of the meeting was for Matt to show us the format the data is in and what sort of things he has done to generate the data.

He generated the curve that defines the design, by using a small fourier series with a small discrete choice of constants for each level of the series.

The Fourier series creates a continuous curve, which using advanced math is evaluated for certain properties all of which are logged in a format called NetCDF

How to Read NetCDF:

Manually from a terminal using command ncdump

In python using a NetCdf4 library:

ex:

from netCDF4 import Dataset

dataset = Dataset('../data/quasisymmetry_out.20190105-01-012_nfp1_stellSym.nc')

print(dataset.dimensions.keys())

print(dataset.variables.keys())

sz=dataset.dimensions['N_scan'].size

for i in range(0,sz):

print(dataset.variables['iotas'][i])

This prints out all the sizes used in the file, whatr each variable stored is called, and then accesses and print each element of the "iotas" variable from the data file matt gave me.

Next on Wednesday met again with Bill, talked about ML, and how I was going to start with it

Start with something simple and ensure it works to make sure result is going to make sense

Planning to use TensorFlow

Bill suggested using K-means clustering

Week 4:

Did not meet with Bill Dorland or Matt this week, Bill was not in on Monday and school was canceled due to snow on Wednesday.

In the mean time I have been reading up about tensorflow, much of it is above me for right now.

What I am trying to learn about is K-means clustering, and what I have determined is that it is not the right ML algorrithum, what we want is to create some circles in data that indicate clusters, however

the K-means algorithum to the best of my abiltiy to understand only separates data radially:

But I want an algorithum that can pick out circles like below:

So while I learned a bit about Tensorflow tensors, and how to move data from Netcdf to a tensor, I have not made progress towards something like the image above (which I made by taking a image provided my matt and editing it in paint)

Week 5:

On Monday, met and decided that because k-means doesn’t work

My goal should be to do research and find 3 possible models that might work better.

Model 1 : Expectation–maximization

EMis used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly

Generates new random data using model, and compares it to the actual data

Requires knowing a form of the model already, polynomial, log, exponential

Not clustering neccessarily but still useful for looking at the data, but requires knowing something about what the data should look like before hand which is very hard

Model 2 Density-based SCAN (DBSCAN)

- Clustering algorithm with two picked variables:
  - Radius
  - Min points

Picks a point at random if it has less than min-points within radius, then it N point and is discarded, otherwise is a core point (Red in left picture) and begins cluster, applying the same question to all nodes within radius of the point until all nodes in that range are connected. The edge points (yellow in left pic) are points adjacent to a core point that is not a core point, at the end edge points are discarded as well.

- Pros:
  - Any shape and number groupings
  - Easy to implement
- Cons:
  - Maybe not the best due to the density and holes in the data

Lastly

Model 3: Affinity Propagation

Alternates updating two metrics: Affinity / responsibility and Availability

(r(i,k) is element i,k in responsibility matrix, which represents responsibility of node i for node k)

(a(i,k) is element i,k in availability matrix)

(s(i,k) is the "distance" between node i and k)

After this has converged, then where r(i,i)+a(i,i)>0 are the centers

- Pros:
  - Any number groupings, grouping in any position relative to each other
  - Fairly easy to implement?
- Cons:
  - Not as sensitive to the holes in the data as DBSCAN (I think)

I will work on implementing model three over the weekend:

Week 6:

Found a library called scikit-learn, and it contains example code about using Affinity propagation.

Combined with example code here is what I came up with:

from sklearn.cluster import AffinityPropagation

from sklearn import metrics

import matplotlib.pyplot as plt

from itertools import cycle

from sklearn.datasets.samples_generator import make_blobs

import random as r

import numpy as np

from netCDF4 import Dataset

import time

# #############################################################################

# Generate sample data

def weighted (n):

x=0.0;

for i in range(0,n):

x+=r.random()

return x/n

Y=[]

for i in range(0,100):

Y+=[[weighted(16),weighted(16)]]

for i in range(0,100):

Y+=[[2+weighted(6),weighted(6)]]

Y=np.array(Y)

Above Imports necessary libraries and generates random data around 0,0 and 2,0, the weighted function gives a random distribution, that is more bezier the higher the input, and uniform at 1.

Below is the affinity propagation code from scikit learn

# Compute Affinity Propagation

af = AffinityPropagation(convergence_iter=5000000).fit(DATAUSED)

cluster_centers_indices = af.cluster_centers_indices_

labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

"""

print('Estimated number of clusters: %d' % n_clusters_)

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))

print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))

print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

print("Adjusted Rand Index: %0.3f"

% metrics.adjusted_rand_score(labels_true, labels))

print("Adjusted Mutual Information: %0.3f"

% metrics.adjusted_mutual_info_score(labels_true, labels))

print("Silhouette Coefficient: %0.3f"

% metrics.silhouette_score(X, labels, metric='sqeuclidean'))

"""

# #############################################################################

Below plots the final data, with colors according to what cluster they belong to, the commented code, adds additional markers on the clustering centers, and also draws lines between the centers to all of the points in a cluster.

plt.close('all')

plt.figure(1)

plt.clf()

plt.xlabel(xvar)

plt.ylabel(yvar)

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')

for k, col in zip(range(n_clusters_), colors):

class_members = labels == k

cluster_center = DATAUSED[cluster_centers_indices[k]]

plt.plot(DATAUSED[class_members, 0], DATAUSED[class_members, 1], col + '.')

#plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,

#markeredgecolor='k', markersize=14)

#for x in X[class_members]:

# plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Points: '+str(howmany) +'. Estimated number of clusters: %d' % n_clusters_)

plt.savefig('/home/annabelle/School/FusionStuff/'+str((int)(time.time()*100000))+'.png', bbox_inches='tight')

plt.show()

Below is a example output of this code:

Next what I did was input the data from Matt into the clustering algorithm, but as the data set is very large I only took a very small rendom selection from the data.

The code works well at clustering, especially with small numbers of points. as the number of points increases, it may stop converging the last cluster as seen above in the bottom left.

After getting some good examples with the scikit learn library, I will write my own.

Week 7:

Initially I just translated the wikipedia definition of the algorithm directly into python, however I was running into the issue that instead of converging, my two matrices were simply unchanging or oscillating between two values and not slowly converging to a finished state.

I did some research and after reading http://academics.smcvt.edu/jtrono/Papers/SMCClustering%20Paper_PatrickRedmond.pdf.

I found out that to fix this problem I can average the new value for the index of the matrix with the old one with a damping factor, which slowly convergence but fixes the oscillation.

It took a lot of work, and me throwing out the code and starting again once but in the end I got the below code:

import matplotlib.pyplot as plt

from itertools import cycle

import random as r

import numpy as np

from netCDF4 import Dataset

import time,math,sys

dataset = Dataset('../data/quasisymmetry_out.20190105-01-019_nfp5_stellSym.nc')

sz=dataset.dimensions['N_scan'].size

Z=[]

howmany=150

xvar='iotas'

yvar='max_curvatures'

for i in range(0,howmany):

rand=(int)(r.random()*sz)

Z+=[[dataset.variables[xvar][rand],dataset.variables[yvar][rand]]]

print("Data Read")

DATAUSED=Z

####################################################

s=[[0]*howmany for i in range(0,howmany)]

r=[[0]*howmany for i in range(0,howmany)]

a=[[0]*howmany for i in range(0,howmany)]

The above section initializes the data to be processed, and also sets up variables and imports libraries, similar to before.

###########SETTING UP S ################################

def dist(x,y):

dx = math.pow(x[0]-y[0],2)

dy = math.pow(x[1]-y[1],2)

return -math.sqrt(dx+dy)

for i in range(0,howmany):

for k in range(0,howmany):

s[i][k]=dist(DATAUSED[i],DATAUSED[k])

for k in range(0,howmany):

s[k][k]=np.median(s)

This pre-calculates the distance between any two points, as it will be needed many times and instead of calculating it each time it is needed it is only calculated once and then stored.

def updateR():

newr=[[0]*howmany for i in range(0,howmany)]

for i in range(0,howmany):

for k in range(0,howmany):

maxof=-sys.maxsize

for kbar in range(0,howmany):

if kbar!=k:

maxof=max(maxof, s[i][kbar]+a[i][kbar])

newr[i][k]=((s[i][k]-maxof)+(r[i][k]))/2

return newr

This is a direct translation of the definition of the responsibility matrix definition from wikipedia, the paper linked above also contains pseudo code.

def updateA():

newa=[[0]*howmany for i in range(0,howmany)]

for i in range(0,howmany):

for k in range(0,howmany):

sum=0

if i!=k:

for ibar in range(0,howmany):

if ibar!= i and ibar!=k:

sum+=max(0,r[ibar][k])

newa[i][k]=((min(0,r[k][k]+sum))+a[i][k])/2

else:

for ibar in range(0,howmany):

if ibar!=k:

sum+=max(0,r[ibar][k])

newa[k][k]=(sum+a[k][k])/2

return newa

This is a direct translation of the definition of the availability matrix definition from wikipedia, the paper linked above also contains pseudo code.

def iterate(n):

global r,a,s

printar(s)

i=0

while i<n:

if i%2==0:

r = updateR()

else:

a = updateA()

i+=1

if i%10==0:

print(i)

iterate(5000)

This is the main loop of the program, currently it is set to run some number of times, but the alg can also run until the number of clusters does not change over some number of iterations.

def getExemplars():

e=[]

for k in range(0,howmany):

if (a[k][k]+r[k][k]>0):

e+=[k]

return e

e=getExemplars()

q=[]

for i in e:

q+=[DATAUSED[i]]

Exemp=np.array(q)

print("clusters : "+str(Exemp))

def members(exmplars):

allpointshash=[[] for item in range(0,howmany)]

for i in range(0,howmany):

max2=r[i][0]+a[i][0]

kmax=0

for j in range(0,howmany):

if r[i][j]+a[i][j]>max2:

max2=r[i][j]+a[i][j]

kmax=j

allpointshash[kmax]+=[i]

return allpointshash

mem=members(getExemplars())

printar(mem)

print(mem)

Together these two functions figure out from the two matricies which points are centers and if a point is not a center what cluster it belongs to.

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')

DATAUSED2=np.array(DATAUSED)

print(DATAUSED2.size)

############### PLOTTING ######################################

plt.close('all')

plt.figure(1)

plt.clf()

clus=0

plt.xlabel(xvar)

plt.ylabel(yvar)

for k, col in zip(range(len(mem)), colors):#draw the cluster centers

if mem[k]!=[]:

clus+=1

plt.plot(DATAUSED[k][0], DATAUSED[k][1], col+".")#Cluster Center

for p in mem[k]:

plt.plot([DATAUSED[p][0], DATAUSED[k][0]], [DATAUSED[p][1], DATAUSED[k][1]], col)#Lines Between

plt.plot(DATAUSED[p][0], DATAUSED[p][1], col+".")#other points

plt.title('Points: '+str(howmany)+". Clusters: "+str(Exemp.__len__()))

plt.savefig('~/FusionStuff/'+str((int)(time.time()*100000))+'.png', bbox_inches='tight')

plt.show()

#############################################################################

Leading to the final part which like before graphs the result using matplotlib.

This code works excellently producing the following graphs, my version is slower but also lacks the failing to cluster final section problem that the scikit learn algorithm had:

(first is before I added color)

I did not meet on Monday as Bill was Busy, but on Wednesday we talked about goals over break and the plan was to do several things:

- Make it faster by pre-calculating max and sum of columns and rows.
- figure out how long the algorithm really needs to run for until the clustering is "good enough"
- Begin looking at the newer data Matt has and potentially use some code he has provided me to look at 3d models of several designs to look and understand the data better.

Since being back from break:

Monday:

Informed Bill of the progress I made over break, which was the improvement of my algorithm from O(n^3) to O(n^2) by pre-calculating max and sum of columns and rows.

Wednesday:

I was very busy with several tests, so did not meet with Bill

Monday:

Bill was sick, did not meet

Wednesday:

Met for a short time, arranged a meeting for Sunday which we would work for a much longer time to improve Affinity propagation, in order to hopefully have it deal with millions of data points at once. If we could not accomplish that then we would switch to a new algorithm. As hw, Bill had me prepare

SUNDAY:

First I explained in detail to Bill all of the parts of affinity propagation, through the pseudo code of the definition of the algorithm and also through the code of my improvements.

What he noticed is that the first matrix the similarity matrix was very like a negative definite adjacency matrix, which he had seen other used with other things in quantum mechanics. What Bill guessed is that he thought the other two matrixes were a way of approximating eigenvalues and eigenvectors.

So after some introduction/review of eigenvalues and vectors, Bill set out for me to investigate whether eigen values or vectors can be used to find clusters for some very small sets of data. As calculating eigen values and vectors can be done in nlogn, but it is very difficult to make any manual sense of what to do with those values and vectors on large sets of data.

On monday, I met again but didn't have much to report, but after todays meeting I have written some mathematica code to generate random points in clusters, calculate the similarity matrix (plus some value along the diagonal), and then calculate the Eigen Values and vectors.

While perhaps not the most efficient code, my code is below

close = 6

r[] := First[RandomReal[-1, 1]]/close

points = {

{1, 1}, {1 + r[], 1 + r[]}, {1 + r[], 1 + r[]},

{-1 + r[], -1 + r[]}, {-1 + r[], -1 + r[]}, {-1, -1},

{-1 + r[], 2 + r[]}, {-1 + r[], 2 + r[]}, {-1 + r[], 2 + r[]},

{2 + r[], -2 + r[]}, {2 + r[], -2 + r[]}, {2 + r[], -2 + r[]}

}

num = Length[

points]; (*less than 26*)

(*set up some matrix a matrix*)

dists =

Outer[List, Take[Alphabet[], num], Take[Alphabet[], num]] //

Flatten[#, 1] &;

(*replace the elements of the matrix with the points*)

g[a_, b_] := Replace[a, First[b] -> Last[b], {2}];

dists = Last[

FoldList[g, dists,

MapThread[List, {Take[Alphabet[], num], points}]]];

(*replace the points with the respective distances*)

f[a_] := -1*EuclideanDistance[First[a], Last[a]];

dists = Map[f, dists, {1}];

dists = ArrayReshape[dists, {Length[points], Length[points]}];

(*add the diagonal in*)

mat = dists + IdentityMatrix[num]*0.1;

MatrixForm[mat];

eval = Eigenvalues[mat] // N

evec = Eigenvectors[mat] // N

(*try to compute the centers by taking the dot of the evectors with \

the points*)

(*works sometimes*)

h[a_, b_] := a + First[b]*Last[b];

center[evector_] :=

Last[FoldList[h, 0, MapThread[List, {evector, points}]]]

ListPlot[points]

centers = Map[center, evec]

ListPlot[Take[centers*Abs[eval], 4]]

What I have found so far is that there are large eigen values in number equal to the number of clusters, and then the rest of the eigen values are small in comparison. (Bill did predict that this would be the case but I have indeed observed it)

I have also attempted to see if I could get clusters out of the eigen values, and so far I have been not entirely successful. Plotting centers * |eigenvalues| is interesting as it seems to give a fairly stable result (not changing with the small fluctuations in the random points generated).

Example points:

Centers*|eigenvalues|:

Certainly the points in this diagram for Centers*|eigenvalues| are not cluster centers, but the fact that they remain constant over small iterations and different sets of data is very interesting.

Upon further investigation, plotting eval*centers/Map[Norm, centers] gives:

Which does really look like cluster centers, not in magnitude exactly but in direction!

On a different set of data:

eval*centers/Map[Norm, centers]

which looks about right as well, with two obvious clusters and then some other ones in the center

Next Week:

I updated my mathematica code to plot it color, and also assign points to clusters based on the centers found.

An example clustering is provided below:

What I have found is:

Don’t have control over number of clusters it splits into

Not perfect, the directions the vectors I generate point not exactly where the clusters are

Is just taking random dot with points going to tell you sorta where points are?

Overall, interesting but not super useful, so what I did next was do some research about how spectral clustering is usually done

So I looked and found the following source: Segmentation Using Eigenvectors: A Unifying View by Yair Weiss

Which was very helpful, it told me several things:

Instead of using biggest Eigenvectors use the second smallest

Which would be a problem as approximation methods for e-values and vectors go from biggest to smallest

But it also gives a way to calculate the second smallest e-value from the first two biggest ones

I have not been able to confirm this fact in mathematica.

Instead of using the coordinates of the points, instead clusters with the values on the eigen-vector

Week after that:

Matt Launderman sent us an invite to a talk on spectral clustering to go to on Kirwan Hall 3206, Tuesday, April 30, 2:00pm, and Wednesday, May 1, 2:00 pm by David Bindel

However I had a appointment on Monday, so I only had time to meet with Bill for a short time

Decided to both go to the talk as that ison spectral clustering: which is what we are focusing on right now

Also Bill provided me two papers to read over and understand about skeletonization of a matrix:

A Highly Accurate Solver for Stiff Ordinary Differential Equations

by Dan Kushnir∗ Vladimir Rokhlin†

and

ON THE COMPRESSION OF LOW RANK MATRICES

by H. CHENG†, Z. GIMBUTAS†, P. G. MARTINSSON‡, AND V. ROKHLIN‡

On Wednesday Bill was ill, so we did not meet