SCHEDULE & CURRICULUM

SCHEDULE

SCHEDULE SUMMARY: The data science summer school runs from Monday, September 9th to Friday, September 20th. Each day follows a similar structure with an opening, lecture or exercise, coffee break, another lecture or exercise, lunch break, and then additional lectures, exercises, or competitions in the afternoon. The final days include guest lectures from experts in the field.

Schedule - Data Science Summer School 2024

*This schedule may undergo future revision.

PREREQUISITE

To ensure a smooth learning experience, participants are expected to have basic knowledge of Python programming and data manipulation. Familiarity with concepts such as data types, control structures, functions, and libraries like NumPy and Pandas will be beneficial. 

Participants should also have a Google account, as the course will heavily utilize Google Colab for coding exercises and collaboration.

ATTACHED MATERIALS

To help participants prepare for the course, we have provided a set of materials that cover the prerequisite topics. These materials include tutorials and sample code on Python programming and data engineering

Participants are encouraged to review these materials before the course begins to refresh their knowledge and ensure they have the necessary foundation.

*In Google Colab, go to "File" > "Save a copy in Drive". This way, you can modify and interact with the code yourself.

 

CODING ENVIRONMENT

Throughout the course, we will be using Google Colab as our primary coding environment. Google Colab is a cloud-based Jupyter Notebook platform that allows participants to write, execute, and share Python code seamlessly. Each session will have a dedicated Google Colab notebook prepared by the instructor, which will contain the lecture slides, relevant code examples, and exercises related to the topic being covered.

HANDS-ON EXERCISES

To reinforce the concepts learned during the lectures, participants will engage in hands-on coding exercises within the Google Colab notebooks. These exercises will be designed to challenge participants' understanding and provide practical experience in applying the techniques covered in each session. 

The exercises will range from basic implementations to more complex problem-solving tasks, allowing participants to gradually build their skills and confidence.

KAGGLE COMPETITIONS

For the advanced topics that involve Kaggle competitions, we will leverage the kaggle platform to create live classroom-limited competitions. These competitions will provide participants with the opportunity to apply their knowledge to real-world datasets and compete with their peers in a controlled environment. 

To facilitate participation, the instructor will share a baseline model implemented in Google Colab, serving as a starting point for participants to build upon and improve their solutions.

COLLABORATION & SUPPORT

Throughout the course, we will be using Google Classroom (for classroom management) and Google Colab (for coding environment).

Each session will have a dedicated Google Colab notebook prepared by the instructor, which will contain the relevant code examples, explanations, and exercises related to the topic being covered.

TECHNICAL REQUIREMENTS

To fully participate in the course, participants are required to bring their own personal computers. The computers should have a stable internet connection. Our university uses eduroam service and for those without eduroam access, separate WiFi network will be accessible. 

An updated web browser (Chrome recommended) and a Google account for accessing Google Colab are necessary. Participants are responsible for ensuring their computers meet the minimum specifications.

COVERED TOPICS

TOPIC 1  Introduction to Data Science (Lecture) 

The Introduction to Data Science session provides a comprehensive overview of the field, exploring its fundamentals, process, and real-world applications. The session will also cover the data science lifecycle, introducing the key stages involved in solving complex problems using data. By the end of this session, participants will have a solid foundation in data science concepts and appreciate its transformative potential in today's data-rich world.

🎯 Participants will gain insights into the roles and responsibilities of data scientists, understanding how they contribute to data-driven decision-making across various industries.

TOPIC 2  Data Preprocessing (Lecture + Coding Exercise)

 Data preprocessing is a crucial step in the data science pipeline, ensuring that the data is clean, reliable, and suitable for analysis. This session covers essential techniques such as data cleaning, handling missing values, data transformation, and feature scaling. The importance of data preprocessing will be emphasized, as it directly impacts the accuracy and reliability of the insights derived from the data.

🎯 Participants will learn how to identify and address common data quality issues, preparing datasets for subsequent analysis and modeling.

TOPIC 3  Data Exploration (Lecture + Coding Exercise)

Data exploration, also known as exploratory data analysis (EDA), is a critical phase in understanding and extracting insights from datasets. This session introduces techniques for summarizing and visualizing data, including calculating descriptive statistics, creating informative plots, and identifying patterns, relationships, and anomalies.The session will emphasize the role of data visualization in communicating insights effectively to stakeholders.

 🎯 Participants will learn how to use EDA to gain a deeper understanding of their data, formulate hypotheses, and guide further analysis.

TOPIC 4  Data Exploration (Lecture + Coding Exercise)

Classification algorithms are widely used in data science for predicting categorical outcomes. This session provides an introduction to fundamental classification techniques, including logistic regression, decision trees, naive Bayes, and support vector machines. The session will cover the process of building and evaluating classification models, including model training and performance metrics. Participants will appreciate the importance of classification in various domains, such as spam detection, customer churn prediction, and medical diagnosis.

🎯 Participants will learn the principles behind each algorithm, understanding their strengths, weaknesses, and application scenarios.

TOPIC 5  Regression Algorithms (Lecture + Coding Exercise)

 Regression algorithms are essential for predicting continuous outcomes and understanding the relationships between variables. This session covers popular regression techniques, including linear regression and regularization methods.  The session will discuss the assumptions and limitations of regression models and introduce techniques for handling non-linear relationships and multicollinearity. Regression analysis finds applications in various fields, such as sales forecasting, price estimation, and risk assessment.

🎯 Participants will learn how to build and interpret regression models, assess their performance, and make predictions.

TOPIC 6  Ensemble Learning (Lecture + Coding Exercise)

Ensemble learning is a powerful approach that combines multiple models to improve predictive performance. This session explores popular ensemble techniques, including bagging (e.g., random forests), boosting (e.g., AdaBoost, gradient boosting), and stacking. The session will cover the concepts of model diversity, bias-variance trade-off, and the advantages of ensemble learning over single models. Ensemble methods have proven successful in various domains, such as credit risk assessment, fraud detection, and recommendation systems.

🎯 Participants will learn how ensemble methods leverage the strengths of individual models to create more robust and accurate predictions.

TOPIC 7  Model Optimization (Lecture + Coding Exercise) 

Model optimization is the process of fine-tuning models to achieve the best possible performance while avoiding overfitting. This session covers techniques for preventing data leakage and optimizing model performance, including hyperparameter tuning and cross-validation. The session will emphasize the importance of model selection and validation in ensuring the reliability and robustness of the trained models.

🎯  Participants will learn how to build a machine learning pipeline and assess model generalization using cross-validation.

TOPIC 8  Multiclass Classification (Lecture + Coding Exercise)

Multiclass classification extends binary classification techniques to handle problems with more than two classes. This session covers strategies for multiclass classification, such as one-vs-rest, one-vs-one, and multinomial logistic regression. The session will discuss the challenges and considerations specific to multiclass problems, such as class separability and computational complexity. Multiclass classification finds applications in various domains, including image classification, text categorization, and sentiment analysis.

🎯  Participants will learn how to adapt binary classification algorithms to multiclass scenarios, handle class imbalance, and evaluate the performance of multiclass classifiers.

Kaggle Competition 1  Multiclass Classification: In this competition, participants will tackle a challenging multiclass classification problem where the goal is to predict the correct class among more than 10 possible categories. The dataset will be carefully curated to include a diverse set of features and a large number of samples, allowing participants to explore various multiclass classification techniques. The competition will emphasize the importance of handling class imbalance, feature selection, and model evaluation metrics specific to multiclass scenarios.

🎯  Participants will need to apply their knowledge of algorithms such as one-vs-rest, one-vs-one, and multinomial logistic regression to build accurate and efficient classifiers.

TOPIC 9  Data Simplification (Lecture + Coding Exercise) 

Data simplification techniques aim to reduce the complexity and dimensionality of high-dimensional datasets while preserving the most informative features. This session explores dimensionality reduction methods, such as principal component analysis (PCA), t-SNE, and feature selection techniques. The session will discuss the trade-offs between information retention and dimensionality reduction and provide guidelines for selecting appropriate techniques based on the dataset characteristics and project goals.

🎯  Participants will learn how these techniques can improve model performance, reduce computational overhead, and facilitate data visualization. 

Kaggle Competition 2  Dimensionality Reduction: This competition focuses on the application of dimensionality reduction techniques to improve the performance and interpretability of predictive models. The provided dataset will contain a large number of related and unrelated variables, challenging participants to identify the most informative features and reduce the data's dimensionality. The competition will emphasize the trade-offs between dimensionality reduction and model performance, encouraging participants to find the optimal balance.

🎯  Participants will explore techniques such as principal component analysis (PCA), t-SNE, and feature selection methods to transform the high-dimensional data into a lower-dimensional representation while preserving the most relevant information.

TOPIC 10  Clustering (Lecture) 

Clustering is an unsupervised learning technique that aims to group similar data points together based on their inherent patterns and similarities. This session explores various clustering algorithms, such as k-means, hierarchical clustering, and density-based clustering (e.g., DBSCAN). The session will discuss the applications of clustering in customer segmentation, anomaly detection, and data compression, providing guidelines for selecting appropriate algorithms based on data characteristics and desired outcomes.

🎯  Participants will learn how to apply clustering techniques to discover the underlying structure of unlabeled datasets, handle challenges like determining the optimal number of clusters and noisy data points, and interpret the results. 

Kaggle Competition 3  Clustering: In this unsupervised learning competition, participants will work with an unlabeled dataset to discover inherent patterns and groups within the data. The goal is to develop clustering algorithms that can automatically partition the data into meaningful clusters without prior knowledge of the ground truth. The competition will focus on selecting the appropriate number of clusters, handling noisy or outlier data points, and evaluating the quality of the resulting clusters using internal and external validation measures.

🎯  Participants will explore various clustering techniques, such as k-means, hierarchical clustering, and density-based clustering (e.g., DBSCAN), to identify the underlying structure of the data.

TOPIC 11  Anomaly Detection (Lecture) 

Anomaly detection focuses on identifying rare or unusual instances within a dataset that deviate significantly from the norm. This session explores various anomaly detection techniques, including statistical methods such as Z-score, density-based approaches like Local Outlier Factor (LOF) and Isolation Forest, and machine learning algorithms specifically designed for this task, such as One-Class SVM and Autoencoders. The session will provide insights into the challenges and considerations in designing effective anomaly detection systems and guide participants in selecting suitable algorithms based on the characteristics of the data and the desired outcomes.

🎯  Participants will learn how to select appropriate features, handle imbalanced data, and evaluate the performance of anomaly detection models using metrics such as precision, recall, and F1-score.

Kaggle Competition 4  Anomaly Detection: The anomaly detection competition challenges participants to identify unusual or rare instances within a dataset. The provided data will contain a mix of normal and anomalous samples, and the task is to develop models that can accurately flag the anomalous instances. The competition will emphasize the importance of selecting appropriate features, handling imbalanced data, and evaluating the performance of anomaly detection models using metrics such as precision, recall, and F1-score.

🎯  Participants will explore statistical methods, density-based approaches, and machine learning algorithms specifically designed for anomaly detection.

TOPIC 12  Neural Networks (Lecture) 

Neural networks and deep learning have revolutionized various fields, including computer vision, natural language processing, and speech recognition. This session provides an introduction to the fundamentals of neural networks, covering basic architectures, activation functions, and training algorithms. The session will discuss the role of deep learning in solving challenging problems and provide hands-on experience in building and training neural networks using popular deep learning frameworks.

🎯 Participants will learn how neural networks can model complex non-linear relationships and discover hierarchical representations from data. 

Kaggle Competition 5  Neural Network: This competition focuses on building a simple image classification model using a well-known dataset. Participants will have the opportunity to apply their knowledge of neural networks and deep learning techniques to develop accurate and efficient classifiers. The dataset will consist of labeled images belonging to multiple classes, and participants will need to design and train neural network architectures to predict the correct class for each image. The competition will cover topics such as data preprocessing, network architecture design, hyperparameter tuning, and model evaluation. 

🎯 Participants will gain hands-on experience in using popular deep learning frameworks and techniques to solve real-world image classification problems.

TOPIC 13  Text Mining (Lecture) 

Text mining and natural language processing (NLP) techniques enable the extraction of meaningful insights from unstructured textual data. This session covers essential techniques for processing and analyzing text data, including text preprocessing, feature extraction, sentiment analysis, and topic modeling. The session will highlight the importance of text mining in domains such as social media analysis, customer feedback analysis, and content recommendation systems.

🎯 Participants will learn how to convert raw text into structured representations, extract relevant features, and apply machine learning algorithms for various NLP tasks. 

Kaggle Competition 6  Text Mining: In this competition, participants will tackle a binary classification problem based on textual data. The dataset will consist of a collection of text documents, each labeled as belonging to one of two classes. The goal is to develop text mining and natural language processing (NLP) techniques to accurately classify the documents. The competition will emphasize the importance of handling text-specific challenges, such as handling large vocabularies, dealing with noisy and unstructured data, and capturing semantic information.

 🎯 Participants will explore techniques such as text preprocessing, feature extraction, sentiment analysis, and topic modeling to convert the raw text into structured representations suitable for machine learning algorithms.

TOPIC 14  Time Series Forecasting (Lecture)

Time series analysis and forecasting are crucial for understanding and predicting patterns in data that evolve over time. This session introduces techniques for analyzing and modeling time series data, including time series decomposition, exponential smoothing, ARIMA models, and evaluation metrics. The session will discuss the challenges and considerations in time series analysis, such as handling missing data, selecting appropriate models, and assessing forecast performance. 

🎯 Participants will learn how to identify trends, seasonality, and irregularities in time series data and develop models to make accurate forecasts. 

Kaggle Competition 7  Time Series Forecasting: The time series forecasting competition challenges participants to predict future values based on historical time series data. The dataset will contain a series of observations recorded at regular intervals, and the task is to develop models that can accurately forecast values for a short time horizon. The competition will focus on preprocessing time series data, handling missing values, selecting appropriate features, and evaluating the performance of forecasting models using metrics such as mean absolute error (MAE) and root mean squared error (RMSE).

🎯 Participants will explore techniques such as time series decomposition, exponential smoothing, ARIMA models, and machine learning algorithms adapted for time series data. 

GUEST LECTURES (TALKS)

The guest lecture series brings data science experts to share their insights, experiences, and real-world applications of data science. These lectures provide participants with exposure to diverse perspectives, challenges, and trends in various domains. Guest speakers will discuss case studies, share best practices, and highlight the impact of data science in their respective fields.

🎯 Participants will have the opportunity to engage with the speakers, ask questions, and gain valuable insights into the practical aspects of data science in industry settings.