1. Course Overview
Data Science combines statistical analysis, programming, and domain knowledge to extract insights from data. This comprehensive course will equip students with hands-on skills in Python, statistics, machine learning, data visualization, and big data tools. By the end of the program, participants will be able to build, evaluate, and deploy data‐driven models, as well as work on real‐world datasets and projects.
2. Target Audience & Prerequisites
Target Audience: Graduates, IT professionals, analysts, or anyone interested in building a career in Data Science.
Prerequisites:
Basic knowledge of programming (preferably in Python or any other language).
Familiarity with high‐school level mathematics (algebra, probability, and statistics).
Eagerness to learn and work on practical, real‐world datasets.
4. Learning Outcomes
By the end of this course, students will be able to:
Write clean, efficient Python code for data manipulation and analysis.
Perform exploratory data analysis (EDA) to uncover patterns and insights.
Apply statistical concepts and probability distributions to real‐world datasets.
Build, evaluate, and tune machine learning models (supervised & unsupervised).
Implement deep learning models using frameworks such as TensorFlow/Keras.
Work with big data tools (Hadoop, Spark) for large‐scale data processing.
Use data visualization libraries (Matplotlib, Seaborn, Plotly) and BI tools (Tableau/Power BI).
Deploy a machine learning model as a REST API or simple web app.
Complete an end-to-end capstone project on a real dataset (e.g., customer‐segmentation, sales forecasting, or NLP).
6. Software & Tools Covered
Programming Languages: Python (primary), R (overview)
IDE / Notebook: Jupyter Notebook, Google Colab, VS Code
Libraries / Frameworks:
Data Manipulation: Pandas, NumPy
Visualization: Matplotlib, Seaborn, Plotly
Machine Learning: Scikit-learn
Deep Learning: TensorFlow (Keras API) / PyTorch (overview)
NLP: NLTK, spaCy (basic)
Big Data: Apache Hadoop (HDFS, MapReduce), Apache Spark (PySpark)
Databases: MySQL / PostgreSQL (basic querying), NoSQL overview (MongoDB)
BI Tools: Tableau / Power BI (basic)
Version Control: Git & GitHub (basic)
Deployment: Flask / FastAPI (basic), Heroku / Streamlit (overview)
7. Detailed Module Descriptions
Module 1: Introduction to Data Science & Python Environment
What is Data Science?
Definition, History, Applications (Healthcare, Finance, Retail)
Data Science vs. Data Analytics vs. Business Intelligence
Data Science Lifecycle (CRISP-DM)
Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment
Setting Up Environment
Installing Anaconda (Python 3.x), Jupyter Notebook
IDEs: VS Code / PyCharm (overview)
Basic Git & GitHub Workflow: clone, commit, push, pull, branch
“Hello Data Science” Script
Writing and executing the first Python script
Basic Jupyter Notebook workflows (cells, magic commands)
Module 2: Python for Data Science
Python Fundamentals
Variables, Data Types (int, float, string, boolean)
Control Flow: if/else, for loops, while loops
Functions, Arguments, Return Values
Data Structures
Lists, Tuples, Dictionaries, Sets
List comprehensions, Dictionary comprehensions
File Handling & Exception Handling
Reading/writing text files, CSV handling (built-in csv module)
Try/Except blocks, custom exception raising
Virtual Environments & Package Management
conda create / conda activate / pip basics
Installing common data science packages
Module 3: Data Manipulation with Pandas and NumPy
NumPy Basics
Creating ndarrays, Indexing, Slicing, Boolean Masking
Universal Functions (ufuncs), Broadcasting
Pandas Data Structures
Series vs. DataFrame
Reading/Writing CSV, Excel, JSON, HTML
DataFrame Operations
Selecting & Filtering Rows/Columns
Adding / Dropping Columns, Renaming
Handling Missing Values (dropna, fillna), Data Imputation Strategies
GroupBy Operations: Aggregations, Transformations, Filters
DateTime Handling
Converting strings to datetime, extracting date parts (year, month, day)
Resampling, Time Series indexing
Module 4: Exploratory Data Analysis (EDA) & Visualization
Descriptive Statistics
Measures of Central Tendency (mean, median, mode)
Measures of Dispersion (variance, standard deviation, IQR)
Skewness & Kurtosis interpretation
Univariate Analysis
Histograms, Density Plots, Box Plots
Bivariate & Multivariate Analysis
Scatter Plots, Pair Plots, Correlation Matrices
Heatmaps to visualize correlation
Visualization Libraries
Matplotlib: Creating figures, subplots, customizing axes, labels, legends
Seaborn: High-level plotting (bar plots, violin plots, joint plots, lmplot)
Plotly Express (Overview): Interactive charts (scatter, line, bar)
Module 5: Statistical Foundations for Data Science
Probability Basics
Sample space, events, addition & multiplication rules
Conditional probability, Bayes’ theorem
Probability Distributions
Discrete: Bernoulli, Binomial, Poisson
Continuous: Uniform, Normal, Exponential
Using SciPy to compute pmf/pdf/cdf
Sampling & Sampling Distributions
Law of Large Numbers, Central Limit Theorem
Hypothesis Testing
Formulating null and alternative hypotheses
Type I vs. Type II Errors, Significance Level (α), p-value interpretation
t-tests (One-sample, Two-sample), chi-square test, ANOVA (conceptual)
Confidence Intervals
CI for mean, proportion
A/B Testing Overview
Setup, sample size considerations, interpreting results
Module 6: Supervised Machine Learning (Regression & Classification)
ML Workflow
Training set, validation set, test set
Cross-validation (k-fold, stratified)
Regression Techniques
Simple Linear Regression: Cost function, gradient descent (conceptual)
Multiple Linear Regression, Feature Scaling
Regularization: Ridge, Lasso, ElasticNet (why regularize, λ tuning)
Classification Techniques
Logistic Regression: Sigmoid function, cost function, decision boundary
K-Nearest Neighbors: Distance metrics, choosing k
Decision Tree Classifier: Gini impurity, entropy
Ensemble: Random Forest (bagging) overview
Model Evaluation Metrics
Regression: RMSE, MAE, R²
Classification: Confusion Matrix, Accuracy, Precision, Recall, F1-Score, ROC Curve & AUC
Overfitting / Underfitting
Bias-variance tradeoff, learning curves
Module 7: Unsupervised Learning & Clustering
Clustering Concepts
Difference between supervised vs. unsupervised
K-Means: Elbow method for choosing k, inertia
Hierarchical Clustering: Agglomerative vs. Divisive, dendrograms (conceptual)
DBSCAN (density-based); when to use vs. k-means
Dimensionality Reduction (overview)
PCA: Eigenvalues, explained variance
t-SNE: Visualizing high-dimensional data
Association Rule Mining (overview)
Support, Confidence, Lift, Apriori Algorithm basics
Cluster Interpretation
Segment naming, profiling clusters
Module 8: Advanced Machine Learning & Model Tuning
Ensemble Learning
Bagging vs. Boosting vs. Stacking (conceptual)
AdaBoost: How boosting reduces bias/variance
Gradient Boosting: Decision tree stumps, learning rate
XGBoost (overview), LightGBM, CatBoost (conceptual)
Model Selection & Hyperparameter Tuning
GridSearchCV vs. RandomizedSearchCV, Bayes optimization (mention only)
Nested Cross-Validation (overview)
Feature Engineering
Creating new features, polynomial features, interaction terms
Handling categorical variables: One-hot, Label encoding
Feature Selection Techniques
Filter methods (correlation threshold, chi-square)
Wrapper methods (Recursive Feature Elimination)
Embedded methods (Lasso, tree-based importance)
Module 9: Introduction to Deep Learning
Neural Network Basics
Perceptron vs. Multilayer Perceptron (MLP)
Activation Functions: Sigmoid, ReLU, Tanh, Softmax
Forward & Backpropagation (conceptual)
Keras / TensorFlow
Installing TensorFlow, Using Keras High-Level API
Building Sequential Models: Adding layers, compiling with optimizers (Adam, SGD), loss functions (binary_crossentropy, categorical_crossentropy)
Training & Validation: Epochs, batch size, callbacks (EarlyStopping)
Convolutional Neural Networks (CNN) (Overview)
Conv layers, pooling layers, flatten & dense layers
Common architectures (LeNet, AlexNet, VGG – conceptual)
Recurrent Neural Networks (RNN) & LSTM (Overview)
When to use RNNs (sequence data), vanishing gradients, LSTM/GRU basics
Preventing Overfitting
Dropout, Batch Normalization, Data Augmentation (for images)
Module 10: Natural Language Processing (NLP) & Time Series (Overview)
Text Preprocessing
Tokenization, Stopword Removal, Stemming vs. Lemmatization
Lowercasing, Punctuation Removal, Regular Expressions in text cleaning
Feature Extraction for Text
Bag of Words, TF-IDF vectorization
Word Embeddings: Word2Vec, GloVe (overview)
Basic NLP Models
Building a Sentiment Analysis Pipeline (Multinomial Naïve Bayes, Logistic Regression)
Named Entity Recognition (NER) – spaCy (demonstration)
Time Series Analysis (Conceptual)
Components: Trend, Seasonality, Noise
Plotting Time Series, Moving Averages, Rolling Statistics
Introduction to ARIMA (AutoRegressive Integrated Moving Average) – Conceptual
Module 11: Big Data Fundamentals & PySpark
Big Data Ecosystem
Challenges of Big Data (Volume, Velocity, Variety, Veracity)
Hadoop Architecture: HDFS, NameNode & DataNode, YARN, MapReduce (conceptual)
Introduction to HDFS: Basic commands (ls, put, get, cat)
Apache Spark Basics
Spark vs. Hadoop: In-memory computation, RDD vs. DataFrame
Setting up PySpark (local mode), SparkSession
PySpark DataFrame Operations
Reading/Writing DataFrames (CSV, Parquet)
Schema, DataFrame Transformations (select, filter, groupBy, agg)
SQL Queries on Spark DataFrames (createOrReplaceTempView)
Spark MLlib (Overview)
ML Pipelines, Transformers & Estimators (conceptual)
Example: Building a simple regression model with Spark MLlib
Module 12: Data Visualization & Business Intelligence Tools
Advanced Visualization Techniques
Customizing Seaborn Plots: Themes, Contexts, Palettes
Interactive Plots with Plotly Express: Hover info, dropdowns, animations
Storytelling: Choosing the right chart for your data (bar vs. line vs. scatter vs. box)
Tableau / Power BI Fundamentals
Connecting to Data Sources (Excel, CSV, SQL)
Creating Basic Charts: Bar chart, Line chart, Pie chart, Heatmap, Scatter plot
Building Dashboards: Filters, Parameters, Interactivity
Publishing Dashboards to Tableau Public / Power BI Service (Overview)
Data Storytelling Best Practices
Structuring a narrative around data insights
Visual design principles: color, layout, annotations
Module 13: Model Deployment & MLOps Basics
Creating a REST API with Flask / FastAPI
Setting up a virtual environment, installing Flask / FastAPI
Writing endpoints: /predict, /healthcheck
Serializing models with Pickle / Joblib
Basic MLOps Concepts
Model Versioning (Git, DVC – overview)
Continuous Integration / Continuous Deployment (CI/CD) for ML (conceptual)
Monitoring & Logging (overview)
Containerization (Overview)
Why Docker? Dockerfile basics, building & running containers (demonstration)
Cloud Deployment (Overview)
Deploying to Heroku: Creating Procfile, requirements.txt, pushing to Heroku
Streamlit: Building a quick interactive dashboard, Streamlit sharing
Module 14: Capstone Project (End-to-End)
Project Teams & Dataset Selection
Form groups of 2–3 students. Each team proposes a problem statement (e.g., sales forecasting, churn prediction, image classification, recommendation system).
Obtain and share dataset (public sources: Kaggle, UCI, local Pune datasets if available).
Phase 1: Problem Definition & EDA
Clearly define objectives, success metrics, and constraints.
Perform thorough EDA: Data cleaning, feature engineering, visualization.
Phase 2: Model Building & Evaluation
Select appropriate algorithms (regression, classification, clustering, etc.).
Train/test split, cross-validation, hyperparameter tuning.
Evaluate performance metrics and perform error analysis.
Phase 3: Model Deployment & Reporting
Serialize the final model (Pickle/Joblib), build a minimal Flask / Streamlit app for prediction.
Create a presentation deck (5–7 slides) summarizing problem, approach, findings, and demo.
Final Demo & Evaluation
Each group presents to the instructor and peers.
Code review feedback, suggestions for improvement.
Grading Criteria: Technical correctness, code quality, visualization clarity, deployment functionality, presentation.
8. Assignments & Evaluation
Weekly Assignments:
Hands-On Labs & Jupyter Notebook submissions for each module.
Mini-projects (e.g., EDA report, basic ML model).
Quizzes:
Short MCQs or coding quizzes at the end of key modules (Statistics, ML, Deep Learning).
Capstone Project:
Group‐based, counts for 30 % of final grade.
Presentation & demonstration during the final week.
Attendance & Participation:
Active participation in classroom discussions, labs, and Q&A sessions (10 %).
9. Recommended Reading & Resources
Books
“Python for Data Analysis” by Wes McKinney (Pandas-focused)
“Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron
“Introduction to Statistical Learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani (free PDF available)
“Deep Learning with Python” by François Chollet (focus on Keras)
Online Tutorials & Documentation
Official documentation for Pandas, NumPy, Scikit-learn, TensorFlow
Kaggle Notebooks (for practice datasets and kernels)
Coursera & edX courses (for supplementary learning)
Datasets for Practice
UCI Machine Learning Repository (iris, wine, adult demographic, etc.)
Kaggle Datasets (Titanic, IMDB, MNIST, Avito Demand Prediction)
Government of India Open Data Portal (datasets related to Pune area, weather, transport, etc.)
10. Certification & Job Assistance
Course Completion Certificate:
Awarded to participants who complete all assignments, quizzes, and achieve at least 75 % attendance.
Letter of Recommendation available upon request (for outstanding performers).
Placement Support:
Resume review & mock interviews (TechBodhi placement team).
Introduction to partner companies in Pune for internships & entry‐level Data Science roles.
Post-Course Resources:
Access to alumni group on WhatsApp / Telegram for networking, doubt clearing, and job leads.
3 Months of “Doubt Solving Sessions” (Weekends).
11. Why Learn at TechBodhi, Pune?
Hands-On Learning: Emphasis on practical labs, real datasets, and industry‐relevant projects.
Experienced Faculty: Instructors with 10+ years of experience in Data Science & Analytics.
Small Batch Size: Maximum of 20 students ensures personalized attention and mentorship.
Pearson VUE Exam Center Onsite: Convenient for learners who wish to take certification exams (e.g., Microsoft DP-100, IBM Data Science).
Industry Connections: Tie-ups with Pune-based startups and IT firms for internships and placements.
Updated Curriculum: Content regularly reviewed to align with latest trends (AutoML, Explainable AI, MLOps).