Introduction to Data Science | Data Mining
IC 1109 | IT - 501D Introduction to Data Science and R- Language
Introduction to Data Science | Data Mining
IC 1109 | IT - 501D Introduction to Data Science and R- Language
Data science is an interdisciplinary field that integrates domain expertise, computer science, and statistical methods to extract meaningful insights from data.
Data analysis is important within specific Sciences, business, or engineering contexts. The ACM and IEEE's curriculum emphasizes:
Foundational knowledge in programming, data structures, algorithms, and statistics.
Practical applications of data management, machine learning, and data visualization.
Ethical considerations in data privacy, security, and societal impacts.
Contextual learning, where students apply data science skills to specific domains
Larose, D. T. (2015). Data mining and predictive analytics. John Wiley & Sons.
Larose, D. T., & Larose, C. D. (2014). Discovering knowledge in data: an introduction to data mining (Vol. 4). John Wiley & Sons.
Larose, D. T. (2005). An introduction to data mining. Introduction to Data Mining.
Unit 1: Introduction to Data Science (Week 1)
What is Data Science?
Multi-disciplinary nature (CS + Stats/ML + Domain)
Hypothesis-Driven vs. Data-Driven paradigms
Core mindset: Always ask questions
Types of Data
Structured vs. Unstructured
Quantitative vs. Categorical
Big Data and the 3Vs (Volume, Velocity, Variety)
Common Data Sources and Formats
CSV, JSON, XML, SQL dumps, APIs, Web scraping, etc.
Unit 2: Core Types of Data Science Problems (Week 1–2)
Classification Problems
Regression Problems
Unit 3: Probability & Statistics Foundation (Week 2–3)
Probability Overview
Experiments, Sample Space, Events
Probability rules, Random Variables, Expected Value
Independence, Conditional Probability, Bayes’ Theorem
Joint & Marginal Probability
PDF vs. CDF
Descriptive Statistics
Measures of Centrality (Mean, Geometric Mean, Median, Mode)
Measures of Variability (Variance, Standard Deviation)
Interpreting Variance
Correlation (Pearson & Spearman) – “Correlation ≠ Causation”
Unit 4: Classic Statistical Distributions (Week 3)
Binomial Distribution
Normal (Gaussian) Distribution & 68-95-99.7 rule
Poisson Distribution
Power-Law Distributions & Pareto Principle (80/20 rule)
Unit 5: Data Preprocessing Pipeline (Week 4–5)
Data Cleaning
Errors vs. Artifacts
Data Compatibility & Unit Conversion
Missing Value Imputation techniques
Outlier Detection
Feature Engineering
Continuous features: Scaling, Binning, Log transforms, Interactions, PCA
Categorical features: Label/Ordinal encoding, One-Hot, Hashing, Embeddings
Sampling Techniques
Inverse Transform Sampling
Monte Carlo methods
Unit 6: Modeling Fundamentals (Week 6)
Why we model: Prediction vs. Inference
Error components (Reducible vs. Irreducible)
Modeling Philosophies
Occam’s Razor
Bias-Variance Trade-off
No Free Lunch Theorem
Nate Silver style: Think probabilistically, update with new evidence, seek consensus
Unit 7: Modeling Taxonomy & Evaluation (Week 7)
Types of Models
Parametric vs. Non-parametric
Supervised vs. Unsupervised
Black-box vs. Interpretable
Deterministic vs. Stochastic
Evaluation Metrics
Classification: Accuracy, Precision, Recall, F1, ROC-AUC
Regression: MSE, RMSE, MAE, R²
Evaluation Environment
Train/Validation/Test splits
Cross-Validation (k-Fold, LOOCV)
Bootstrapping
Handling small datasets
Unit 8: Supervised Learning – Regression & Classification (Week 8–9)
Linear Regression
Ordinary Least Squares (closed-form & Gradient Descent)
Regularization (Lasso, Ridge)
Feature selection & dimensionality reduction
Evaluation (RSE, R², p-values)
Logistic Regression
Maximum Likelihood Estimation
Handling imbalanced & multi-class problems
Module 9: Instance-based & Clustering Methods (Week 10)
Distance Metrics (Euclidean, Manhattan, Cosine, KL, JS)
k-Nearest Neighbors (k-NN)
Algorithm & optimization (LSH, KD-Trees)
Clustering
K-Means (including choosing K)
Hierarchical Clustering & Dendrograms (Linkage types)
Unit 10: Tree-based & Ensemble Methods (Week 11)
Decision Trees (advantages & disadvantages)
Ensemble Techniques
Bagging & Random Forests
Boosting (concept)
Naive Bayes Classifier
Support Vector Machines (overview)
Principal Component Analysis (PCA)
Module 11: Deep Learning & Big Data Basics (Week 12)
Introduction to Deep Learning
Key architectures: DNN, CNN, RNN, GAN, Transfer Learning
Core concepts: Neurons, Activation functions, Backpropagation, Dropout, etc.
TensorFlow basics (Tensors, Variables)
Big Data & Hadoop Ecosystem (high-level)
HDFS, MapReduce, YARN
Hive, Pig, Spark, HBase, Kafka, Sqoop, Oozie
Unit 12: Supporting Tools & Concepts (Week 13–14)
Essential SQL
SELECT, WHERE, GROUP BY, JOINs, Subqueries
Python Data Structures
Lists, Tuples, Dictionaries, Sets
Collections (deque, Counter), heapq
Machine Learning Terminology Recap
Overfitting, Regularization, Hyperparameters, Cross-entropy, A/B testing, etc.
Final Week
Capstone Project / Revision
Recommended Resources & Further Learning Paths
Assignment and Internal Assessment
Programming Assignment-R/Python ( Deadline: Regular Students: 6 December 2023 )
Activity (Additional Topics: Web Search, Big Data, Machine Learning )
Classroom Material and Resources and Cheat Sheet compiled by Maverick Lin (http://mavericklin.com)
Programming Assignment { R/ Python ] From the Beginning of Semester - To be Submitted
LAB Assignments Reference
Sample Video Submitted: 👉 Priyanshi Soni