Introducing Python bindings such as NumPy, Pandas, Matplotlib, and Seaborn

Loading python libraries:

There are several Python bindings that are necessary for maintaining and organizing the data and visualization:

  1. NumPy: 'import numpy as np'

  2. Pandas: 'import pandas as pd'

  3. Matplotlib: 'import matplotlib.pyplot as plt'

  4. Seaborn: 'import seaborn as sns'

upload dataset into numpy array and pandas dataframe:

In order to upload the dataset file of Breast Cancer into the Google Colab, you need to click on Files:

Click on Upload to session storage and choose the Breast-Cancer-Data.csv file:

When the file is uploaded successfully, it should be visible in the tree of files and directories:

Load file into Numpy array or pandas dataframe:

In order to load the file into NumPy array, the 'genfromtxt' could be used. The full syntax is:

'data = np.genfromtxt('Breast-Cancer-Data.csv',delimiter=',',skip_header=1)'

In the above command, the first argument is the filename, the second argument is the delimiter, which is comma here. The skip_header tells the NumPy to skip how many lines as header.

One problem with using NumPy arrays for mixed data type datasets is that we need to define the data type as None and also tell the NumPy the number of headers and delimiter as we saw in the above definition. One of the best alternative to NumPy arrays is Pandas dataframe that can load the data as:

'df = pd.read_csv('Breast-Cancer-Data.csv')'

In comparison to above command for NumPy you see that we only need to give the filename to read_csv function and it would find the header, data type, and delimiter automatically.

matplotlib for scatter plot visualization:

Now we could use NumPy array created based on breast cancer file by using 'plt.plot(x,y,'o')', where x is the x axis array, y is the y axis array, and 'o' is the dot marker for scatter plot:

Also, we could use the Pandas dataframe for creating the scatter plot:

In comparison to NumPy array, you see that it's possible to call columns based on their given name and it's much easier to access the data due to the fact that we don't need to make the data trasnpose (data.T creates transpose of data array). Note that this visualization here is not complete due to the fact we did not put the x and y labels or put the class for data points, which we will see how to implement in the next sections.

The final notebook is available here: https://colab.research.google.com/drive/1M-_Rs0evA14Ewxwbym-doWbZMtSWMwnb?usp=sharing