Exercise 2

Python Programming I

GOAL : Install python and learn the basic operations .

CH1 : learning how to use NumPy .
CH2 : Dataset loading and creation .

Learning experience:

Ch1 has learn a lot of numpy function , building a solid foundation for future learning .

Ch2 has learn scikit-learn , pandas , SQL , S3 ways to loading and creating data or dataset.

working environment :

OS: Windows 11 home

CPU : intel i9-13900k

GPU : Nvidia RTX 4090

Python Version : 3.12.2

Development environment: jupyter notebook.

CH1 Working with Vectors, Matrices, and Arrays in NumPy [1]

1.0 Introduction

This chapter covers the most common NumPy operations we’re likely to run into while working on machine learning workflows.

1.1 Creating a Vector

This chapter is going to create a vector , we can use numpy to create .

np.array : a numpy function to create a array .

The upper code can generate two vector vector_row(horizontally) and vector_column(vertically) , and the output is show in below .

Note :No module named 'numpy' , pip install numpy

1.2 Creating a Matrix

This chapter is going to create a matrix , we also can use numpy to do it .

The upper code can generate a matrix , which contains three rows and two columns (a column of 1s and a column of 2s) , and the output is show in below .

There is another way to create matrix with numpy by using np.mat , However, the matrix data structure is not recommended for two reasons. First, arrays are the de facto standard data structure of NumPy. Second, the vast majority of NumPy operations return arrays, not matrix objects.

1.3 Creating a Sparse Matrix

This chapter is going to create a sparse matrix , and efficiently represent it .

sparse.csr_matrix : a scipy function to show sparse matrix .

In15 first create a sparse matrix , and use sparse.csr_matrix to show the sparse function , the output is show in below In17 , its mean the row 1 column 1 of the matrix is 1 , the row 2 column 0 of the matrix is 3 .

In 18 create a larger sparse matrix , but with the sparse.csr_matrix we can see the same output . That is, the addition of zero elements did not change the size of the sparse matrix.

Note : No module named 'scipy' , pip install scipy .

1.4 Preallocating NumPy Arrays

This chapter need to preallocate arrays of a given size with some value.

np.zeros : numpy.zeros(shape, dtype=float, order='C', *, like=None) [2]
np.full : numpy.full(shape, fill_value, dtype=None, order='C', *, like=None) [3]

In 21 create a vector of shape (1,5) with all zeros , the output is show in below .

In 22 create a matrix of shape (3,3) and fill the value 1 , the output is show in below .

In 25 create a matrix of shape (4,4) and fill the value 2 , the output is show in below .

1.5 Selecting Elements

This section will learn how to select one or more element in a vector or matrix.

In 26 create an array and matrix ,

In 27~34 show the select result ,

vector [2] is 3 ,

matrix [1,1] is 5,

*vector[:] mean selcet all the vector ,

vector [:3] select everything up to and including the third element ,

vector [3:] select everything after the third element ,

vector [-1] select the last element ,

vector [::-1] reverse the vector ,

matrix[:2,:] select the first two rows and all column of a matrix ,

matrix[:,1:2] select all rows and the second column .

Note: Like most things in Python, NumPy arrays are zero-indexed, meaning that the index of the first element is 0, not 1.

1.6 Describing a Matrix

This section will learn how to describe the shape, size, and dimensions of a matrix.

shape : show the matrix shape (row, column) .
size : show the numbers of elements of the matrix .
ndim : show the matrix dimensions .

1.7 Applying Functions over Each Element

This section will learn how to apply function to all element in an array .

np.vectorize : class numpy.vectorize(pyfunc=np._NoValue, otypes=None, doc=None, excluded=None, cache=False, signature=None) [4]
The term "lambda" in programming refers to an anonymous function, meaning a function without a name. In Python, lambda functions are defined using the keyword "lambda" followed by the parameters and the expression to be evaluated.
In the context of the expression add_100 = lambda i: i + 100, it's creating a lambda function that takes one argument i and returns the result of adding 100 to it. This lambda function is then assigned to the variable add_100. [5]

In 48 create a 3x3 matrix , and create a function(add_100) that adds 100 to something , and apply function to all elements in matrix .

In 49 using broadcasting , but broadcasting does not work for all shapes and situations, but it is a common way of applying simple operations over all elements of a NumPy array.

1.8 Finding the Maximum and Minimum Values

This section will find the Maximum and Minimum values in the matrix .

np.max : find the max element .
np.min : find the min element .

In 50 create a matrix and use max and min function to find the max and min element in the matrix . In 52 add the parameter axis can find the max element in each column , In 53 axis = 1 will find the max element in each row .

Note : if set axis = 2 , Axis Error : axis is out of bounds for array of dimension ,

1.9 Calculating the Average, Variance, and Standard Deviation

This section will learn how to calculate the avg. , var. , std. , about an array .

np.mean : find the mean of all the matrix element .
np.var : find the variance of all the matrix element .
np.std : find the standard deviation of all the matrix element .

In 56 create a matrix , and use mean , var , std , to calculate the value .

In 59 we also can use axis to find the mean in each column or row .

1.10 Reshaping Arrays

This section will know how to reshape the array without changing the element values .

matrix.reshpae : matrix.reshape(shape, order='C') [6]

In 60 create a 4x3 matrix , and use the reshape function to change the matrix to 2x6 without changing the element values , the only requirement is that the shape of the original and new matrix contain the same number of elements , In 62 use -1 means as many as needed so reshape(1, -1) means one row and as many columns as needed , In 63 if we provide one integer, reshape will return a one-dimensional array of that length . In 64 and we can reshape back to 4x3 , and we will see it doesn't change any value .

1.11 Transposing a Vector or Matrix

This section will learn how to transpose a vector or matrix .

matrix.T : Returns the transpose of the matrix. Does not conjugate! For the complex conjugate transpose, use .H. [7]

In 65 create a matrix and use function T to transpose the matrix , the output is show in below , In 66 technically, a vector can’t be transposed because it’s just a collection of values ,In 67 However, it is common to refer to transposing a vector as converting a row vector to a column vector (notice the second pair of brackets) or vice versa .

1.12 Flattening a Matrix

This section will learn the way to transform a matrix into a one-dimensional array.

matrix.flatten : Return a flattened copy of the matrix. All N elements of the matrix are placed into a single row. [8]
np.ravel : Return a contiguous flattened array. A 1-D array, containing the elements of the input, is returned. A copy is made only if needed. [9]

In 68 create a 3x3 matrix and use flatten to accomplish one-dimensional array , the output is show in below . In 69 another way to accomplish one-dimensional array . In 70 we use ravel to complete , ravel operates on the original object itself and is therefore slightly faster , it also lets us flatten lists of arrays, which we can’t do with the flatten method. This operation is useful for flattening very large arrays and speeding up code , so we can see the two matrix a and b , are flatten to one-dimensional .

1.13 Finding the Rank of a Matrix

This section we can find the rank of the matrix .

np.linalg.matrix_rank : Return matrix rank of array using SVD method . Rank of the array is the number of singular values of the array that are greater than tol. [10]

In 74 is easy to find the rank of the matrix .

1.14 Getting the Diagonal of a Matrix

This section will learn how to get the diagonal of the matrix .

matrix.diagonal() : numpy.diagonal(a, offset=0, axis1=0, axis2=1) , Return specified diagonals. [11]

In 75 is easy to find the diagonal elements with matrix.diagonal , In 76~77 add offset so we can find the diagonal one above or below the main diagonal . the output is show in below .

1.15 Calculating the Trace of a Matrix

This section can calculate the trace of a matrix .

matrix.trace() : numpy.trace(a, offset=0, axis1=0, axis2=1, dtype=None, out=None) , Return the sum along diagonals of the array. [12]

In 78 we can use trace to find the trace of a matrix , or In 79 find the sum of the diagonal .

1.16 Calculating Dot Products

This section will calculate the dot product of two vectors.

np.dot : numpy.dot(a, b, out=None) , Dot product of two arrays. [13]

In 80 we create two array , and use np.dot to calculate dot product , the output is show in below , In 81 or we can use python way to calculate dot product . ( python 3.5 +)

1.17 Adding and Subtracting Matrices

This section will add or subtract two matrices .

np.add : numpy.add(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj]) = <ufunc 'add'> , Add arguments element-wise. [14]
np.subtract : numpy.subtract(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj]) = <ufunc 'subtract'> , Subtract arguments, element-wise. [15]

In 82 we create two matrix , and use np.add function and np.subtract function , so we can calculate the matrix , the output is show in below . In 84 alternatively, we can simply use the + and – operators . the output is show in below .

1.18 Multiplying Matrices

This section will multiply two matrices.

In 85 we create a matrix and use np.dot to multiply two matrices , the output is show in below , In 86 or we can use "@" to multiply two matrices , In 87 if we use "*" it will multiply two matrices element-wise , the output is show in below .

1.19 Inverting a Matrix

This section will calculate the inverse of a square matrix.

np.linalg.inv : linalg.inv(a) compute the (multiplicative) inverse of a matrix. [16]

In 88 create a matrix and use linalg.inv to calculate inverse of matrix , the output is show in below , In 89 we can verify by multiply matrix and its inverse , the output is identity matrix and is show in below .

1.20 Generating Random Values

This section will generate pseudorandom values .

np.random.seed : Reseed the singleton RandomState instance. [17]
np.random.random : Return random floats in the half-open interval [0.0, 1.0). Alias for random_sample to ease forward-porting to the new random API. [18]
np.random.randint : Return random integers from low (inclusive) to high (exclusive). Return random integers from the “discrete uniform” distribution of the specified dtype in the “half-open” interval [low, high). If high is None (the default), then results are from [0, low). [19]
np.random.normal : Draw random samples from a normal (Gaussian) distribution. [20]
np.random.logistic : Draw samples from a logistic distribution. [21]
np.random.uniform : Draw samples from a uniform distribution. [22]

In 90 we set a random seed , and generate three random floats between 0.0 and 1.0 , the output is show in below , In 92 generate three random integers between 0 and 10 , In 93 draw three numbers from a normal distribution with mean 0.0 and standard deviation of 1.0 , In 94 draw three numbers from a logistic distribution with mean 0.0 and scale of 1.0 , In 95 draw three numbers greater than or equal to 1.0 and less than 2.0 .the output is show in below .

Note : The same seed will generate the same output , we can use random seed to avoid this situation .

CH2 Loading Data [1]

2.0 Introduction

The first step in any machine learning endeavor is to get the raw data into our system , this chapter covers loading data from various sources such as CSV files and SQL databases, as well as generating simulated data for experimentation. We primarily focus on using the pandas library for loading external data and scikit-learn for generating simulated data.

2.1 Loading a Sample Dataset

This section will load a preexisting sample dataset from the scikit-learn library.

datasets.load_digits() : sklearn.datasets.load_digits(*, n_class=10, return_X_y=False, as_frame=False) , Load and return the digits dataset (classification). Each datapoint is a 8x8 image of a digit. [23]

In 3 load a dataset from the scikit-learn library using the datasets module, and then create a feature matrix and target vector, and views the feature data of the first observation, the output is show in below, the output is a one-dimensional array containing 64 elements, representing the feature data of a handwritten digit. Each element represents the brightness value of a pixel in the handwritten digit image, ranging from 0 to 15. These values correspond to the pixel intensities in the handwritten digit image, arranged row-wise from the top-left corner, moving from left to right and top to bottom.

There are two popular sample datasets in scikit-learn are :

load_iris
Contains 150 observations on the measurements of iris flowers. It is a good dataset for exploring classification algorithms.
load_digits
Contains 1,797 observations from images of handwritten digits. It is a good dataset for teaching image classification.

and see more details on any of these datasets, you can print the DESCR attribute : In 7

Note : No module named 'sklearn' , pip install scikit-learn .

2.2 Creating a Simulated Dataset

This section will generate a dataset of simulated data.

make_regression : sklearn.datasets.make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None) , Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile. See make_low_rank_matrix for more details.

The output is generated by applying a (potentially biased) random linear regression model with n_informative nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale. [24]

make_classification : sklearn.datasets.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None) , Generate a random n-class classification problem. [25]
make_blobs : sklearn.datasets.make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False) , Generate isotropic Gaussian blobs for clustering. [26]

In 8 create a simulated dataset for a regression problem, including a feature matrix, a target vector, and the true coefficients. The following parameters:

n_samples: The number of samples to generate.
n_features: The number of features, i.e., the number of columns in the feature matrix.
n_informative: The number of informative features, which have a true effect on the target values.
n_targets: The number of target values, typically 1.
noise: The standard deviation of random noise added to the target values.
coef: If set to True, returns the true regression coefficients used in generating the simulated data.
random_state: The seed for the random number generation to ensure reproducibility of results. [5]

After generating the feature matrix and target vector, the print function is used to display the feature matrix and target vector for the first three samples. The output is show in below.

In 9 create a simulated data for a binary classification problem, including a feature matrix and a target vector. The following parameters:

n_samples: The number of samples to generate.
n_features: The number of features, i.e., the number of columns in the feature matrix.
n_informative: The number of informative features, which have a true effect on the target values.
n_redundant: The number of redundant features, which are derived from the informative features.
n_classes: The number of classes for the target values.
weights: The weights for each class, used to control the relative proportions of each class in the generated dataset.
random_state: The seed for the random number generation to ensure reproducibility of results. [5]

After generating the feature matrix and target vector, the print function is used to display the feature matrix and target vector for the first three samples. The output is show in below.

In 10 create a simulated data containing multiple clusters, where each cluster is centered around a Gaussian distribution. The following parameters:

n_samples: The number of samples to generate.
n_features: The number of features, i.e., the number of columns in the feature matrix.
centers: The number or positions of the cluster centers.
cluster_std: The standard deviation within each cluster, controlling the spread of data within each cluster.
shuffle: Whether to shuffle the generated samples.
random_state: The seed for the random number generation to ensure reproducibility of results. [5]

After generating the feature matrix and target vector, the print function is used to display the feature matrix and target vector for the first three samples. The output is show in below.

In 13 the scatter function from the Matplotlib library is used to create a scatter plot. Each sample in the feature matrix is represented as a point in a two-dimensional plane, and the points are colored according to the classification labels in the target vector. Each row in the feature matrix represents a sample, where the values in the first and second columns correspond to the x and y coordinates of the sample in the two-dimensional plane, respectively. The values in the target vector determine the cluster or class to which each sample belongs. Finally, the plt.show() function is used to display the scatter plot on the canvas. the output is show in below.

NOTE : No module named 'matplotlib' , pip install matplotlib.

2.3 Loading a CSV File

This section we need to import a comma-separated value (CSV) file.

pd.read.csv : Read a comma-separated values (csv) file into DataFrame. [27]

In 16 the read_csv function from the Pandas library is used to load a dataset from a specified URL, and then the head function is used to view the first two rows of the dataset.

Before loading CSV files, it's helpful to preview their contents to understand the structure and necessary parameters for loading.

The read_csv function in Pandas offers over 30 parameters, primarily designed to handle various CSV formats.

CSV files are typically comma-separated, but can also use other delimiters like tabs (TSV files), specified by the sep parameter.

Often, the first line of a CSV file contains column headers, which can be specified using the header parameter. If absent, header=None is used.

The read_csv function returns a Pandas DataFrame, a versatile and widely used object for analyzing tabular data. The output is show in below.

NOTE : No module named 'pandas' , pip install pandas.

2.4 Loading an Excel File

This section is going to import an Excel spreadsheet.

pd.read_excel : Read an Excel file into a pandas DataFrame. [28]

In 20 the read_excel function from the Pandas library is used to load an Excel file from a specified URL, and then it loads the data into a Pandas DataFrame named dataframe. The Excel file being loaded is sourced from the provided URL. In the read_excel function, the parameter sheet_name=0 specifies the index of the sheet to load (starting from 0), indicating to load the first sheet. The parameter header=0 specifies that the first row of the Excel file should be treated as column names. These parameters ensure the correct loading of the Excel file's content. Finally, the head(3) function is used to view the first three rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : No module named 'openpyxl' , pip install openpyxl .

2.5 Loading a JSON File

This section will load a JSON file for data preprocessing.

pd.read_json : Convert a JSON string to pandas object. [29]

In 21 the read_json function from the Pandas library is used to load a JSON file from a specified URL, and then loads the data into a Pandas DataFrame named dataframe. The JSON file being loaded is sourced from the provided URL. In the read_json function, the parameter orient='columns' specifies the orientation of the JSON data, indicating that the column names of the JSON data should be treated as column labels in the DataFrame. This parameter ensures the correct loading of the JSON file's content.Finally, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : The orient parameter might take some experimenting to figure out which argument (split, records, index, columns, or values) is the right one.

2.6 Loading a Parquet File

This section will load a Parquet file.

pd.read_parquet : Load a parquet object from the file path, returning a DataFrame. [30]

In 25 the read_parquet function from the Pandas library is used to load a Parquet file from a specified URL, and then loads the data into a Pandas DataFrame named dataframe. The Parquet file being loaded is sourced from the provided URL. In this code snippet, there is only one parameter url specifying the location of the Parquet file to be loaded. The Pandas read_parquet function automatically identifies and loads the contents of the Parquet file. Finally, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip install pyarrow.

Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip install fastparquet.

Parquet is a popular data storage format in the large data space. It is often used with big data tools such as Hadoop and Spark.

2.7 Loading an Avro File

This section will load an Avro file into a pandas DataFrame.

pdx.read_avro : Read the records from Avro file and fit them into pandas DataFrame using fastavro. [31]

In 28 the first step involves using the get function from the requests library to download an Avro file from a specified URL. Subsequently, the read_avro function from the pandavro library is used to read the downloaded Avro file and load its contents into a Pandas DataFrame named dataframe. The url variable specifies the location of the Avro file, and after downloading the file using the requests library, the open function is used to save it to the local file system. Finally, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : No module named 'pandavro' , pip install pandavro .

If you work with large data systems, you’re likely to run into one of these formats in the near future.

2.8 Querying a SQLite Database

This section will load data from a database using structured query language (SQL).

create_engine() : Create a connection to the database . [32]
pd.read_sql_query : Read SQL query into a DataFrame. [33]

In 36 according to the book's [1] Note : that this is one of a few recipes in this book that will not run without extra code. Specifically, create_engine('sqlite:///sample.db') assumes that an SQLite database already exists. I didn't create the file('sqlite:///sample.db'), so it failed.
so In 35 i tried to find the file location, and opened the file, it was empty, so i used chatgpt to help me created it, In 39 is the code what chatgpt provide , first the create_engine function from the SQLAlchemy library is used to create a connection to a SQLite database, and it assigns this connection to the variable database_connection. Then, the DataFrame function from the Pandas library is used to create a DataFrame object containing the data to be inserted. Next, the data from the DataFrame is inserted into a database table named "data". If the table already exists, it is replaced with the new data using the parameter if_exists='replace'; if it does not exist, a new table is created. Finally, the dispose method is used to close the connection to the database. Finally, In 40 i run the same result with the book [1] .

NOTE : No module named 'sqlalchemy' , pip install sqlalchemy.

2.9 Querying a Remote SQL Database

This section will connect to, and read from, a remote SQL database.

In 1 Firstly, the connect function from the PyMySQL library is used to establish a connection to a MySQL database, and this connection is assigned to the variable conn. When establishing the connection, information such as the hostname, username, password, and database name needs to be specified. Then, the read_sql function from the Pandas library is used to execute an SQL query and load the query result into a Pandas DataFrame named dataframe. Here, the SQL query "select * from data" is used to retrieve all data from all columns of the data table named "data". Finally, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : Need a docker to simulate another database. [34]

2.10 Loading Data from a Google Sheet

This section will read in data directly from a Google Sheet.

In 46 This code aims to download a CSV file from a specified Google Sheets URL and then read it into a Pandas DataFrame, displaying the first two rows of the data.

Firstly, the read_csv function from the Pandas library is used to read the CSV file from the specified URL, and it is loaded into a Pandas DataFrame named dataframe.

Then, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : The /export?format=csv query parameter at the end of the URL above creates an endpoint from which we can either download the file or read it into pandas.

2.11 Loading Data from an S3 Bucket

This section will read a CSV file from an S3 bucket you have access to.

In 49 Firstly, the read_csv function from the Pandas library is used to read the CSV file from the specified S3 path, and it is loaded into a Pandas DataFrame named dataframe. Here, the storage_options parameter is used to specify AWS credentials, including the access key ID and secret access key. Then, the head(2) function is used to view the first two rows of the DataFrame, allowing for a quick inspection of the structure and content of the data. The output is show in below.

NOTE : Install s3fs to access S3 , pip install s3fs
It’s worth noting that public objects also have HTTP URLs from which they can download files, such as this one for the CSV file.

2.12 Loading Unstructured Data

This section will load unstructured data like text or images.

In 50 Firstly, the get method from the requests library is used to download the text file from the specified URL, and it is stored in the variable 'r'. Then, a file 'text.txt' is opened for writing (with 'wb' indicating binary mode) using Python's standard library, and the content of the file downloaded from the URL is written into the local file using the write method. Next, the local file 'text.txt' is opened again for reading (with 'r' indicating read-only mode) using Python's standard library, and the content of the file is read into the variable text using the read method. Finally, the content of the text variable, which represents the content of the downloaded and written local file, is printed out using the print function.