Lecturer: Dr Santosh Kumar Nukavarapu
Data analytics and data modeling is in great need in the era of big data. The first half of the course focuses on learning data analytics tools and learning how to leverage those tools to analyze data and facilitate decision-making based on insights from the data. We will study specialized systems and algorithms that have been developed to work with data at scale including parallel database systems, Apache Spark, and its contemporaries. The second half of the course is close to data mining, which explores models/algorithms to extract value in the data. It introduces algorithms designed for processing big data. In addition, it covers algorithms working with different types of data (static and streaming data) and can be applied to real-world problems in different domains such as recommendations and graphs. Moreover, we'll talk about Generative AI technology and emerging shift to data centric models in context of Big data
This project explores real estate transactions in Connecticut from 2001 to 2022 using a dataset of over 1 million records. My objective is to use Exploratory Data Analysis (EDA) to understand how sale prices compare to assessed values, identify market trends, uncover pricing anomalies, and explore variation by property type. The analysis was done in google collapse.
Note: This project is just an exploratory analysis and not a detail analysis of what really happened. The aim is to help understand issues with large dataset.
This project involves setting up cloud accounts such as qdrant, Alpaca API (Financial News),Comet ML Account and cloning an open source GitHub code to execute in ODU Wahab Cluster.
In this project, I built an end-to-end supervised machine-learning pipeline that predicts whether a LendingClub loan will be fully paid or charged off. Using 381 k historical loan records and loan-level features, I first cleaned and transformed raw data (e.g., converting percentage strings and employment-length text to numeric values, one-hot-encoding categorical variables, and standard-scaling numerics ). I then trained and compared four classifiers—Logistic Regression, Random Forest, Support Vector Machine, and K-Nearest Neighbors—under a stratified 80/20 train-test split and class-imbalance–aware metrics.
Random Forest delivered the best trade-off between recall for defaulted loans and overall discrimination (ROC AUC ≈ 0.70), correctly identifying about 73 % of defaults while keeping false alarms manageable. Feature importance analysis highlighted interest rate, debt-to-income ratio, and revolving-credit utilization as key risk drivers. The study underscores the value of rigorous preprocessing and model comparison in credit-risk analytics and suggests future gains from SMOTE resampling, hyper-parameter tuning, and explainable-AI tools such as SHAP.
In this project, I developed a large-scale product recommendation system using Apache Spark’s Alternating Least Squares (ALS) algorithm. Leveraging the publicly available RetailRocket dataset, which contains millions of user–item interactions, I built a collaborative filtering model to generate personalized top-N product recommendations and predict a user’s next likely purchase.
The system was implemented and tuned on Databricks Community Edition, achieving scalable performance despite the dataset’s extreme sparsity. This project demonstrated the power of big data tools in applied microeconomics and e-commerce analytics, combining data preprocessing, model tuning, and evaluation metrics such as HitRate@K and MAP@K to deliver actionable insights for personalization at scale