It's been a long time coming, but I finally decided to make myself a simple website. I've been interested in a while about the idea of making a website for myself. With not having any real web development experience, however, this seemed a bit daunting. Although, I'd heard about flask, an easy way of deploying a website using python. So I decided to look up some materials online and came across this article. I chose this I enjoyed the idea of a simple and clean website with all the social links. Before this, however, I decided to get myself acquainted to flask to followed the flask startup guide. This gave me more of an understanding of how flask worked and how it works with non-static websites.
For this website, I used some simple HTML and CSS to program the front end as I did not need anything to be dynamic so Javascript was not necessary. The hardest part of the website was the SVG icons. I found the icons from icomoon and then edited their SVG's to change the colour and so that it was possible to insert a clickable URL into them and so that the CSS also stayed working.
The website was built using frozen-flask. This freezes any flask website so that is can be hosted on any traditional webserver. The webserver provider I chose was netlify. I chose netlify as it was very simple to deploy the website with simple and customisable build parameters that interfaced well with flask. Moreover, it made it very easy to change the DNS settings from the domain name provider. It also automatically redeploys every time a change is pushed to GitHub, meaning little management of the website needs to be done after the first deploy.
Overall, I enjoyed doing this project, with the initial site only taking a couple of days to make. I would like to take this further and understand more about flask and the possibilities of creating and deploying machine learning models using flask.
You can find the source-code here and the website at astatham.com.
Well it's been a long time but I'm happy to say i'm finally back. I've started University and all is going good so i've finally started up some personal projects. For this first project back i've been working off of a dataset of around 350 of Scandinavian chiropractors (found from upwork) However this was not a dataset itself but in-fact was a website.
Recently i've been interested in data mining so I thought i'd challenge myself to create a full csv (without gaps) of all of these chiropractors. I first started out by trying to find a way to get the contact details of a single person. For this I went to a regional webpage and got the links to the individual webpages. This was done through requests and BeautifulSoup4.
//Where l is a list of region webpages
for a in l:
url = a
response = requests.get(url)
page_content = bs.BeautifulSoup(response.content, 'lxml')
find = page_content.find_all("a", rel="bookmark") //finds all content with the rel 'bookmark which is all of the peoples individual webpages in a specific region
# gets links on a regions webpage
link_list = []
num = 0
if num <= len(find):
for i in find: //Adds all the individual webpages to a
link = find[num] * list with the append function.
url = link['href'] *
link_list.append(url)
num += 1
The MNIST or "Modified National Institute of Standard and Technology" is an image classification data-set as well as the first step for many people getting into machine learning as well as myself. Before this, I have read articles and paper's about machine learning but has never put finger to keyboard and gained any first-hand knowledge. I may not have written any code for this, however, I have read through and tried to understand all the concepts of this simple machine vision data-set. I want to use this blog post more to say how I found my experience using and getting started with tensorflow.
First off, installing tensorflow. Tensorflow is the most used machine learning package, made by Google. It derives its name from tensor which is defined as "A mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space. " or in laments terms are a geometric object that is used to describe relationships between geometric vectors or scalars in a multidimensional array. A great video for understanding tensors is this one by Dan Fleisch which covers even the most basics of vectors.
Tensorflow is not the easiest to install, it requires additional frameworks meant for developers and researchers. Something I don't know if I could quite categorize myself in however I was doing self-research so I downloaded the CUDA Toolkit 8.0. and the cuDNN v6.0. After these not installing the first time and a tactical reboot all had installed but alas it was still not working. I browsed YouTube for a while until I found a windows 7 tutorial and found I had to drag and drop some files that were not specified on the website. However, after this, the install worked and I was up and running. Overall this did take me longer than it should have as I was trying to brute force it myself, however, there are many great resources out there to help.
After running the obligatory "Hello world" command on my BACH I opened up my IDE and checked it worked and sure enough, it did. The next thing for me to do was to understand the basics of the MNIST dataset and what was going on behind the code. So the data-set is split up into three parts, the training data, the testing data and the validation data. The training data is the most abundant and the most crucial set of data is because it's what the computer will be 'learning' from. The data itself is handwritten numbers turned into 28x28 pixel grids using one-hot vectors as a blank is a 0 and a black is a 1. This means the algorithm can understand the structure of the image.
The algorithm uses a subclass of logistic regression called softmax regression, in this type of regression the commonly used sigmoid function is replaced with the softmax function. The function inside tensorflow is defined simply as-
y = tf.nn.softmax(tf.matmul(x, W) + b)
This function will go through the training data and will make an assumption. It will then be given the answer and will try and fix it using a cost or loss function. A commonly used one and the one used in this example is the "Cross-entropy" model and is defined as:
This function tells our algorithm how wrong it was an helps to adjust it to find the correct values.
We then use the back propagation algorithm to determine how each variable affects loss, we can then adjust this loss minimization using gradient descent.
The model is then sent through the train data again and is then evaluated. After this the algorithm has completed and in this scenario the accuracy should be around 92%, however with minor tweaks it can be increased to around 97%.
To conclude, this took a lot of research and a lot of understanding math a lot higher than me however some resources I have been using have really helped. I recently enrolled on the Khan academy course for vectors for linear algebra as well as listening to the machine learning podcast by OCDevel which has given me a greater understanding of logistic regression. I am still far away from writing any of my own machine learning code however I believe it is a good stepping stone into understanding more about one of the fastest growing fields out there.
Thank you very much for reading and i'll catch you on my next journey
Well I can say I took full advantage of my Christmas break by doing nothing productive however with a day back into studies I'm feeling much more motivated to continue the blog so keep your eyes open for more coming soon. This project has been in the works for nearly a month now and I can see why it is said constantly that 90% of data science is data wrangling and fixing data. As I self-collected this data (again) it took a long time, however, I still sincerely believe the outcome and the acknowledgement that I've learnt something from this project is enough for me to say I'd do it all over again. This is one of the main reason's I feel so attracted to data science, for the sheer feeling of finishing a long project.
Now let's get into the project itself
This week's project was to analyse my bank data, I've noticed recently I've really lost track of my money and decided this was a good opportunity to see how I spend my money, however when I tried to download my yearly statement in anything but PDF it got corrupted. So instead of getting data straight from a CSV, I had to manually input 10's of pages of pure data into a spreadsheet and categorize them. Now at first, I hated this but when I got into a rhythm it wasn't too bad however this took multiple hours to do. There were two different section of the spreadsheet. One for seeing the differences in money and categorizing each spend and one for converting into a CSV file. This is the spreadsheet
My first idea was to see how the differences in spending and earning compared each month. Therefore I manually created a small pandas data frame. Then all I needed to do was to plot it. For this I decided to use seaborn, I have had some practice with seaborn before so this wasn't too difficult. I also decided to experiment with Pallet's inside of seaborn for a while, created a few of my own however as It didn't need to be diverging I used an inbuilt one called 'Spectral', and I think it works quite well.
From this visualization I can see the good saving at the start of the year and then decrease in summer when I had a lot more time to do activities with friends/family as well as the large amount coming in November, this being my Birthday month.
However, I wanted to delve deeper into What I actually spent my money on, therefore I downloaded the spreadsheet I had made on page 2 and converted it into a CSV. However, I came across a problem first. I had Pound(£) sign's in front of the money, a rookie mistake, I quickly removed it in the spreadsheet and downloaded a new CSV. I was still coming across a problem, It took me a few hours of brainstorming at work to figure out the issue. However I had an Epiphany, I had removed a comma before from the data and had a hunch the data type was an object still because of this. Checking this with x.dtypes confirmed my suspicions. I changed the type of the column and double checked the types and they were all float values now rounded, for ease of plot, to integers.
A note about this data, I hadn't been paid December and I still spent a lot more on gifts that hadn't come through yet however the negative difference is still similar
I also removed overall from the data for now as it wasn't necessary for the visualizations.
This next set of visualizations I wanted to do in a FacetGrid as I don't have much experience with them however after 4 hours of trying I couldn't find a way of making it work. Every time I got closer a new error would pop up, however, I am not deterred and aim to work with FacetGrid's in future projects. For this section I wanted to see if there were any trends between the month and the spend.
There's not much we can take from these visualization's other than I have spent a lot more on games since getting my VR headset.
I next wanted to create a FacetGrid of all the months that would look like this. Alas, as stated before this didn't work. So instead I decided to do a large graph with every month on, however, this is hard to interpret the data from this. I also decided to learn how to create a cmap, something I somehow hadn't done but turned out to be rather easy. The colour scheme still needs to be worked on however it works. The code from this and the cmap can be found here
However I wanted to combine seeing what categorize were my spending along with an overall spend, therefore I decided to create a stacked bar graph, now I followed a YouTube tutorial for this and edited it for my own needs as well as some prior knowledge from week one's project. The code for this is far larger than expected and the colour scheme is blinding however it does the job.... and my IDE was freezing because I had too many visualizations open.
I believe this can be done more easily with a loop and I'm going to investigate that and update after however the code looks like this
Overall I think this project helped me gain some skills with different colors and designs as well as understanding some more fundamental concepts of python itself.
The notebook can be found here, raw data can be found here
I hope you enjoyed reading, I hope to get into simple ML and some other types of graphs as well as delving into tableau and R in the coming months.
Sorry, this one's nearly a week late, It's been a hectic few week with exams and work thus the data science side hustle, unfortunately, has been sitting in the wings these few week. However I have no exams left and only a few days of lecture and then 2 weeks of time off, meaning I can get on the grind once more! This meaning that the first book review should be up next week, and for a sneak preview, the book I will be reviewing is "The art of data science".
Anyways I digress, into the data we delve!
For this week I wanted to build on last weeks data so I decided as my exams were coming up that I should compare the data sets of an exam week with a normal week. Now for this I decided I wanted to find a bit more insight onto some specific factors through visualization and simple mathematics. So I decided to compare 3 key areas; Study, sleep and leisure time and how they compared between weeks. So firstly I modified the code of last weeks visualization to get the two comparisons. The code for this week can be found here, and the code for last week can be found here
On the left we can see the normal week and on the right we can see the exam week. What we can see directly from this that there is a large amount more study going on the exam week compared to the normal week except for Fridays where the constraints of part time work limit the amount of study that can be done. However Saturday and Sunday have similar days except the exam week has much more sleep (I somehow slept for 10+ hour each day, still not sure how). We can also see that leisure is decreases on the exam week.
Next I decided to create normal bar graphs, not stacked to visualize how the data of the 3 specific points stated before compared when visualized side by side. These are the graphs:
The code for these is relatively simple and just uses simple matplotlib bar plots.
I next simply created two data frames to more easily see my total time spent on each activity
print("Normal week : ",df2.sum(axis=0))
print("Exam week: ",df.sum(axis=0))
Next I wanted to find the averages of time spent on each activity so I firstly divided the data frame of the individual element such as "Studies" by 7 (Days of the week) and then summed all of the divided data points and then rounded it to two decimal places. The code can be found here, it is all very similar so does not need to be pasted for each set of data used. I then summarized my findings in text and found:
The 7 day average of time spent studying on an exam week is 6.29 hours
The 7 day average of time spent studying on a normal week is 3.86 hours
The 7 day average of time spent on leisure on an exam week is 4.86 hours
The 7 day average of time spent leisure on an leisure week week is 6.79 hours
The 7 day average of time spent sleeping on an exam week is 7.57 Hours
The 7 day average of time spent sleeping on an normal week is 6.79 Hours
From this you can make some inferences and say that the percentage difference of studying on an exam week and a normal week on average is 47.9%( ( difference / mean) x100) This could be done for all however this is data science not maths so you know we're going to be using a visualization
The first visualization I decided to do was in matplotlib however the data I wanted to visualize however I came into numerous problems and the code was inefficient for such little data. Therefore I decided to investigate different methods of data visualization. I was considering seaborn however in the end I came across plotly and with very little and readable code I could create quickly. Using this plotly code I created 3 graphs that are very easy to see the differences between activities
Overall this week I feel I found out how to more independently create graphs with matplotlib however it still needs a lot of work as I know the visualization potentials are much more. I also found that plotly is a great web based plugin that can be used to create simple visualization very quickly and for projects that are simple like this I will be using it more. I also feel I have gotten a more effective grasp on some core python functions and how to lay them out.
All code can be found here as ipynb or here as raw data
And be sure to check back sometime in the next week to follow on with the journey
For week one's project, I wanted to start ofF easy, I got this idea in early October from a Reddit thread. I wanted to see how I spent my time so I decided to try a similar thing. After creating a google sheets I decided to try and make a percentage bar graph of my time spent. From reading the Reddit thread I took some points for improvement from the original and decided to keep sleep time in. The visualization seemed to be made in sheets so I gave that a go and the results were... close to catastrophic. I couldn't get anything to work at all and the data seemed to not be working at all. So I gave up on it, put it to the side.
After a few months and a deeper understanding of python, I decided to come back to it. I did a quick search of percentage stacked bar charts with python and found some great base code. All credit for this code goes to Chris Albon and his great blog which I intend to read more on when I get further into machine learning. I modified the code to suit my needs, mostly creating more dictionaries, adding some more bars to the chart and then making sure the percentages worked. After all this I had some great working code through matplotlib and numpy arrays that looked like this:
You can find the code here