Physics & Data Science - Data Science

Data Science & Machine Learning

Click on the underlined text for the relevant Youtube videos, links and Python projects.

Introduction to Statistical Learning with Python

ML algorithms such as boosting, support vector machines or deep learning models could provide high flexibility and accuracy in predicting future outcomes when trained on a complex dataset. However, as the flexibility of the model increases, its interpretability is often lost which prevents us to derive useful relations between the predictors and our predictions. In some practical applications, we may require this information between the features and the outcomes, i.e we are interested in inference: the relationship between the features and the target values. In this context, one might prefer to work with simple and interpretable models in the context of regression and classification type tasks in order to better capture such relationships. To make myself more familiar with the Python Data Science stack, I have been deriving solutions (see the github repo) for the exercises of the Introduction to Statistical Learning book. ISLP is perfect for this purpose as it provides an intuitive approach while keeping a good dosage of theoretical/mathematical background for the materials covered. The current progress of the project covers many fundamental concepts in data science/ML within the context of interpretable models (First 4 Chapters of ISLP) including various diagnostic tools to analyze the goodness of fit of ML models, their tuning for bias/variance trade-off and Bayesian statistics and probability in the context of classification using the standard Python data science libraries: numpy, pandas, statsmodels, scikit-learn, matplotlib and seaborn.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a widely used statistical data analysis technique that allows one to filter out the most relevant components in a big dataset. For this reason, it is generally seen as a data dimensionality reduction tool that can be utilized to feed the processed data into your favorite ML model. Principal Component Pursuit algorithm is a robust variant of PCA (sometimes referred as robust PCA) that can be used in handling data with corruption, outliers or missing entries. Here is short introductory Python notebook file that discusses PCP including its application to electricity prices in the power markets in order to separate the rare spikes (foreground) from the more relevant seasonal data (background).

BigQuery and SQL

Structured Query Language, or SQL, is a powerful programming language that enables data scientists and analysts to navigate through and extract useful insights from large databases. BigQuery is a Google's server-less and cloud based data warehouse service that lets you apply SQL queries to big datasets. Using Cloud Client Libraries for the BigQuery API we can efficiently query to extract insightful information from large datasets using Python. I have been practicing SQL queries with BigQuery using authentication with service account key file. Check out this github repo for this introductory project which is based on Kaggle's SQL tutorials: Intro to SQL and Advanced SQL.

Note: It turns out that VS code has a nice extension for running queries with BigQuery, read through the following article for this purpose.