Projects

COMPLETED DATA SCIENCE PROJECTS

IMAGE CLASSIFICATION

MOVIE REVIEW ANALYSIS

ANIMAL HEALTH CLASSIFICATION

Fake news DETECTION MODEL

INTRODUCTION TO SQL

Sentimental Analysis

Movie Recommendation System

Netflix Data Cleaning

Transportation Data Science Project

Exploring Transportation Data to Enhance Road Safety

Yahoo!Finance - Stock Data

Exploring Financial Markets with Data Science: A Project Summary

PROJECTS IN PROGRESS

WALMART SALES DATA USING SQL

GEN Z LEADERSHIP PROJECT USING TABLEAU

IMAGE CLASSIFICATION

This project explores fundamental concepts in data science and machine learning using the Iris dataset and libraries like Pandas, NumPy, and Matplotlib within the Google Colab environment. It covers topics ranging from basic Python function creation to data manipulation, analysis, and visualization.

The project begins with a simple programming exercise to implement a function that identifies numbers divisible by both 2 and 3 within a given range. This serves as an introduction to Python function syntax and control flow.

Next, the project delves into data analysis using Pandas. The Iris dataset is loaded and analyzed to answer questions regarding species representation, descriptive statistics, and correlations between different features. This section provides hands-on experience with Pandas data structures and functions.

NumPy, a fundamental library for numerical operations, is introduced subsequently. The project involves reshaping and plotting an array of binary values to visualize patterns as alternating light and dark shades. This reinforces the concept of data manipulation and visual representation using NumPy and Matplotlib.

Matplotlib is further explored to visualize the concept of exponential decay. The project challenges the user to create a plot illustrating the depreciating value of a house over time, providing practical experience in plotting data and manipulating graph elements using Matplotlib.

The project also includes sections on 3-way holdout splitting, which is essential for evaluating machine learning model performance, and it mentions using Midjourney.

Overall, this project demonstrates a well-rounded approach to introducing core concepts in data science and machine learning. By combining practical exercises with widely used libraries within the Google Colab environment, the project equips users with valuable skills for conducting data analysis, visualization, and preliminary machine-learning tasks.

Project Link

MOVIE REVIEW ANALYSIS

Sentiment Analysis of Movie Reviews: Unveiling Emotions in Text

In today's digital age, where opinions are constantly shared online, understanding the sentiment expressed in text data has become crucial. Sentiment analysis, a branch of Natural Language Processing (NLP), aims to decipher the emotions and opinions behind written words. This project delves into the realm of sentiment analysis by focusing on classifying movie reviews as positive or negative, offering valuable insights into how machines can learn to understand human emotions.

The project leverages the vast Internet Movie Database (IMDb) dataset, which contains 50,000 movie reviews pre-labeled with sentiment polarity. This dataset provides a rich training ground for developing a machine learning model capable of automatically predicting the sentiment of a given review.

The journey begins with setting up the environment and performing basic data exploration. We utilize essential libraries like Pandas for data manipulation, NLTK for text processing, and scikit-learn for machine learning tasks. By examining the dataset's structure and content, we gain a preliminary understanding of the task at hand.

Next, we tackle the challenge of preparing the text data for analysis. Since computers cannot directly comprehend human language, we need to preprocess the text using various NLP techniques. This involves tokenization, cleaning, removing punctuation, and converting the text to lowercase. Stemming, a crucial step in reducing words to their root form, further enhances the model's performance.

Once the text data is preprocessed, we employ count vectorization, a method that transforms text into numerical vectors by counting the frequency of each word in a document. This representation enables the machine learning model to process and learn from the text data. We then divide the data into training and testing sets, utilizing a Naive Bayes classifier to train our model.

After training, we evaluate the model's performance using metrics such as accuracy. This step allows us to assess how well the model generalizes to unseen data. To gain deeper insights into the model's behavior, we employ visualizations like word clouds, which highlight the most frequent words associated with positive and negative reviews.

Finally, the project encourages collaborative learning by introducing group work. Each group member is tasked with exploring different classifiers, fostering a comprehensive understanding of how various models handle sentiment analysis tasks. This collaboration leads to a more nuanced perspective on the topic and enhances the learning experience.

In conclusion, this project provides a hands-on introduction to the world of sentiment analysis and its application in classifying movie reviews. By combining data science, machine learning, and NLP techniques, we gain valuable insights into how machines can learn to decipher human emotions hidden within textual data. The project's emphasis on collaborative learning further strengthens the understanding of various approaches to sentiment analysis. As we continue to refine and develop such models, the potential to unlock deeper meaning and emotion within textual data is boundless.

Project Link

ANIMAL HEALTH CLASSIFICATION

Animal Health Prediction: A Data-Driven Approach to Wildlife Conservation

This project delves into the crucial area of animal health and conservation, leveraging the power of data science and machine learning to predict the risk of an animal's condition becoming dangerous. The project utilizes a dataset from Kaggle, encompassing various animal species and their associated symptoms. The primary objective is to develop a predictive model that can accurately identify animals at risk of dying based on observed symptoms, contributing to bio-heritage conservation and animal welfare.

Data Exploration and Preprocessing:

The initial phase involves importing the dataset and conducting exploratory data analysis (EDA). This includes examining the dataset's structure, data types, and identifying potential issues like missing values. The dataset undergoes thorough cleaning, including handling inconsistencies in animal names, addressing repetitive symptoms, removing special characters, and correcting spelling errors. These preprocessing steps are crucial for improving the quality and reliability of the data, ultimately enhancing model performance.

Addressing Dataset Imbalance:

The project acknowledges the challenge of an imbalanced dataset, where one class (e.g., dangerous vs. not dangerous) is significantly underrepresented compared to the other. This imbalance can lead to biased model predictions. To mitigate this issue, the project employs Random Over Sampling, a technique that generates synthetic samples for the minority class, creating a more balanced dataset for training the model effectively.

Model Building and Evaluation:

A Random Forest Classifier is employed as the primary model for predicting animal danger levels. The model is trained on the preprocessed and balanced dataset, and its performance is evaluated using metrics like accuracy, precision, recall, and F1-score. The project also explores other machine learning models, including Logistic Regression, Gradient Boosting Classifier, XGBoost, SVM Classifier, KNN Classifier, Decision Tree Classifier, LightGBM, and AdaBoostClassifier, to compare their predictive capabilities and potentially identify a superior model.

ROC Curve Analysis:

The Receiver Operating Characteristic (ROC) curve is used to visualize the performance of the classification models. The ROC curve plots the true positive rate against the false positive rate, providing a comprehensive assessment of the model's ability to distinguish between the two classes. The area under the ROC curve (AUC) serves as a quantitative measure of the model's overall performance.

Conclusion:

This project demonstrates the application of data science and machine learning techniques to address a real-world challenge in animal health and conservation. The rigorous data preprocessing, handling of dataset imbalance, and exploration of multiple machine learning models contribute to the development of a robust predictive model. The evaluation metrics and ROC curve analysis provide valuable insights into the model's performance, ensuring its effectiveness in identifying animals at risk. The findings of this project have the potential to significantly improve animal welfare and aid in bio-heritage conservation efforts. By providing a predictive tool for identifying animals in danger, the project empowers wildlife professionals and conservationists to take timely and informed actions, ultimately contributing to the preservation of biodiversity and the well-being of animal populations.

Project Link

Fake news DETECTION MODEL

The project outlined in the code aims to tackle the pervasive issue of fake news by building a detection model. Leveraging the power of Natural Language Processing (NLP) and Machine Learning (ML), the project seeks to classify news articles as either "real" or "fake." This essay will explore the key steps involved in this project, from data preprocessing to model evaluation, and discuss its significance in the fight against misinformation.

The project begins with importing essential libraries such as Pandas, NLTK, and Scikit-learn, providing the necessary tools for data manipulation, text processing, and model building. The dataset, obtained from a GitHub repository, is then loaded and explored using functions like head(), describe(), and value_counts() to gain initial insights into its structure and content. Data cleaning is a crucial step, addressing missing values and converting text to a consistent format for further analysis.

The core of the project lies in text preprocessing using NLP techniques. The code demonstrates the application of tokenization, punctuation removal, lowercasing, stop word removal, and lemmatization. These steps transform raw text into a structured format suitable for machine learning algorithms. Notably, PorterStemmer() is applied to reduce words to their base form, further enhancing the model's ability to generalize.

After preprocessing, the data is split into training and testing sets to evaluate the model's performance on unseen data. CountVectorizer() is employed to convert text into numerical representations, creating a feature matrix for the machine learning models. LabelBinarizer() transforms the categorical labels ("real" or "fake") into a binary format.

Two machine learning models are utilized for classification: Multinomial Naive Bayes and Support Vector Machine (SVM). The code demonstrates the training, prediction, and accuracy evaluation of both models. The results provide insights into the effectiveness of each model in identifying fake news.

Finally, a WordCloud visualization is generated to highlight the most frequent words in "real" news articles, offering a visual representation of the dataset's characteristics.

In conclusion, this project showcases a comprehensive approach to fake news detection, encompassing data preprocessing, NLP techniques, machine learning models, and visualization. By accurately classifying news articles, the project contributes to combating the spread of misinformation and promoting a more informed society. While the project focuses on a specific dataset and models, it serves as a valuable example for future research and development in the field of fake news detection.

Project Link

INTRODUCTION TO SQL

A Practical Introduction to SQL: Exploring the New York City Airbnb Dataset

This project serves as a practical introduction to the world of SQL using the engaging and real-world New York City Airbnb dataset. The central objective is to equip users with a foundational understanding of basic SQL syntax and data analysis techniques through hands-on experience with a relevant example.

The project guides users through a series of essential steps. It begins by establishing a connection to an SQLite database using the sqlite3 library in Python, creating a dedicated environment for storing and querying the data. Subsequently, the Airbnb dataset, initially stored in a CSV file, is seamlessly loaded into a Pandas DataFrame and then imported into the SQLite database, making it readily accessible for SQL operations.

The heart of the project lies in the formulation and execution of various SQL queries designed to explore the rich Airbnb data. These queries cover a wide range of tasks, including retrieving specific columns, counting the total number of records, calculating average prices and review numbers, filtering listings based on criteria like neighborhood and price, and grouping data to gain insightful perspectives into different facets of the listings.

Beyond query execution, the project emphasizes the importance of responsible database management by highlighting the necessity of closing the connection upon completion. This practice ensures proper resource allocation and promotes efficient workflow.

Through this immersive experience, users gain invaluable practical skills in utilizing fundamental SQL commands such as SELECT, FROM, WHERE, GROUP BY, ORDER BY, COUNT, and AVG. They learn to craft queries that effectively answer specific questions about the data, for instance, identifying the top neighborhoods by average price or determining the number of listings for each room type.

The project's utilization of a real-world dataset greatly enhances its relevance and applicability. By interacting with genuine data, users gain a deeper understanding of how SQL can be effectively employed in real-world scenarios. This project serves as a robust foundation for future explorations into the world of SQL and data analysis techniques, empowering users to confidently navigate the complexities of data-driven decision-making.

In conclusion, this project offers a comprehensive and practical introduction to SQL, equipping users with the essential skills and knowledge necessary to analyze real-world data with confidence. Its emphasis on hands-on experience, real-world relevance, and responsible database management practices creates a valuable learning opportunity for individuals seeking to unlock the power of SQL in their data analysis endeavors.

Project Link

Sentimental Analysis

This project delves into the fascinating realm of sentiment analysis, applying Natural Language Processing (NLP) and machine learning techniques to classify movie reviews as positive or negative. The primary goal is to build a model to predict the sentiment expressed in a given review accurately. This essay outlines the project's key steps, from data preparation to model evaluation, highlighting the significance of each stage.

The project utilizes the IMDb movie reviews dataset, a rich collection of 50,000 pre-labeled reviews. The initial phase involves data preprocessing, which is crucial for transforming raw text into a machine-readable format. Using the NLTK library, the reviews undergo tokenization, punctuation removal, lowercasing, and stop word elimination. This cleaning process ensures that only relevant words are considered for analysis. Further refinement is achieved through stemming, reducing words to their root form using the PorterStemmer.

Once the data is prepared, it's divided into training and testing sets. This split allows the model to learn from a portion of the data (training set) and then be evaluated on unseen data (testing set), ensuring its generalization ability. CountVectorizer converts the text into a numerical representation, creating a matrix of word frequencies. The sentiment labels are also transformed into numerical form using LabelBinarizer.

With the data ready, a Multinomial Naive Bayes classifier is chosen for model building. This classifier is well-suited for text classification tasks due to its ability to handle discrete features like word counts. The model is trained on the training data, and its performance is assessed using the accuracy score on the testing data. Achieving an accuracy of around 70-80% demonstrates the model's effectiveness in sentiment prediction.

The project further explores the use of alternative classifiers, encouraging experimentation and comparison. Each team member investigates a different classifier and discusses their findings and insights. This collaborative approach promotes a deeper understanding of various machine learning models and their suitability for sentiment analysis.

Data visualization is crucial for finding the data's underlying patterns. Word clouds are generated to visualize the most frequent words associated with positive and negative reviews. This visual representation provides valuable insights into the language used to express it. Additionally, a visualization of the sentiment distribution offers a clear overview of the dataset's balance between positive and negative reviews.

In conclusion, this project provides a comprehensive introduction to sentiment analysis, covering data preparation, model building, evaluation, and visualization techniques. By applying NLP and machine learning, it successfully develops a model capable of predicting sentiment in movie reviews. The project also encourages the exploration of different classifiers and data visualization methods, fostering a deeper understanding of the field. This practical experience equips participants with valuable skills in data science and machine learning, applicable to a wide range of real-world problems.

Project Link

Movie Recommendation System

Unveiling User Preferences: A Movie Recommendation System Using Singular Value Decomposition

In today's digital age, where an abundance of choices often leads to decision paralysis, recommendation systems have emerged as invaluable tools for navigating the vast sea of content. This project delves into the realm of movie recommendations, leveraging the power of Singular Value Decomposition (SVD) to predict user preferences and provide tailored suggestions.

The project utilizes the widely recognized MovieLens 100k dataset, a rich collection of user ratings on movies, to train and evaluate the recommendation model. SVD, a powerful matrix factorization technique, is employed to uncover latent features representing underlying user preferences and movie characteristics. By decomposing the user-item rating matrix, SVD identifies hidden patterns that drive rating behavior, allowing the model to predict unknown preferences.

The process involves several key steps. First, the dataset is preprocessed to handle missing values and convert timestamps into a readable format. The data is then split into training and testing subsets to evaluate the model's performance. Hyperparameter tuning is performed to optimize the SVD algorithm, ensuring the most accurate predictions possible.

The trained model predicts ratings for user-movie pairs, providing a basis for personalized recommendations. By recommending movies with the highest predicted ratings, the system aims to match users with content they are likely to enjoy. This personalized approach enhances user satisfaction and engagement, creating a more enjoyable movie-watching experience.

To further refine the recommendations, the predicted ratings are rounded to fall within the standard rating categories. This ensures that the recommendations align with the familiar rating scale, providing users with clear and understandable suggestions.

The project demonstrates the effectiveness of SVD in building accurate and personalized movie recommendation systems. It provides valuable insights into the underlying factors that influence user preferences, highlighting the potential of recommendation systems in enhancing content discovery and user satisfaction.

By applying data science principles and employing powerful algorithms like SVD, this project showcases the ability to transform raw data into valuable insights that drive personalized recommendations. The resulting movie recommendation system has the potential to revolutionize the way users discover and engage with content, ultimately making the movie-watching experience more enjoyable and fulfilling.

Project Link

Netflix Data Cleaning

Data Cleaning: Preparing the Netflix Dataset for Analysis

In the ever-evolving landscape of data science, the process of data cleaning stands as a fundamental pillar, ensuring the quality and reliability of any subsequent analysis. Before embarking on the exciting journey of extracting insights and knowledge from data, it is crucial to transform raw data into a pristine and usable format. This project delves into the intricacies of data cleaning, utilizing the popular Netflix dataset as a case study.

The Netflix dataset, a treasure trove of information about movies and TV shows, presents a unique opportunity to explore various data-cleaning techniques. However, like any real-world dataset, it is not immune to imperfections. Missing values, duplicate entries, inconsistent data types, and formatting variations can all hinder the effectiveness of data analysis. This project aims to address these challenges, ultimately preparing the Netflix dataset for further exploration and analysis.

The journey begins with data loading and exploration, where we familiarize ourselves with the dataset's structure and contents. Using the Python programming language and the Pandas library, we import the dataset and conduct initial investigations to understand its overall composition. This step lays the foundation for the subsequent cleaning process.

Next, we tackle the issue of missing values, a common problem in datasets. We identify columns with missing data and carefully consider the implications for our analysis. Depending on the nature and extent of missingness, we employ appropriate strategies, such as imputation or removal. Imputation involves filling in missing values using statistical methods or logical deductions, while removal involves discarding rows or columns with excessive missing data.

Duplicate entries, another potential source of data inaccuracy, are addressed in the next stage. We identify and remove any duplicate rows in the dataset, ensuring that each data point is unique and representative. This step helps maintain data integrity and avoids potential biases in analysis.

Data type standardization is crucial for seamless data manipulation and analysis. We convert data types to appropriate formats, such as transforming dates to datetime objects. This ensures consistency and facilitates the application of various data analysis techniques.

Finally, we address inconsistencies in data entries, which can arise from variations in capitalization, formatting, or data entry errors. We identify and resolve these inconsistencies, ensuring uniformity and reliability in the dataset.

Through this meticulous data cleaning process, we transform the raw Netflix dataset into a pristine and analysis-ready form. The resulting dataset is a valuable asset for further data exploration, enabling us to uncover insights into trends, patterns, and relationships within the vast Netflix content library.

In conclusion, data cleaning is an indispensable step in any data science project. This project, focused on the Netflix dataset, provided hands-on experience in applying various data-cleaning techniques. By addressing missing values, duplicates, data types, and inconsistencies, we ensured data quality and reliability, ultimately preparing the dataset for insightful analysis. The skills and knowledge gained from this project will undoubtedly contribute to the success of future data science endeavors.

Project Link

Transportation Data Science Project

Exploring Transportation Data to Enhance Road Safety

This project delves into the NYC OpenData Motor Vehicle Collisions dataset to understand factors contributing to crashes and propose recommendations for improving road safety, particularly for vulnerable road users. Utilizing Python and libraries like Pandas, Matplotlib, Seaborn, and Folium, the project encompasses data preparation, ethical considerations, preprocessing, exploration, time series analysis, and geospatial analysis.

Data Preparation and Exploration: The initial phase involved importing necessary libraries, loading the dataset, and conducting preliminary data exploration. Descriptive statistics and visualizations, such as bar charts, were used to identify the top contributing factors to crashes and the most frequently involved vehicle types. This analysis highlighted the importance of driver attentiveness, yielding right-of-way, and maintaining safe following distances.

Time Series Analysis: To understand trends over time, the project incorporated time series analysis. Examining crash frequencies per hour of the day revealed a peak during the afternoon rush hour, suggesting factors like fatigue and distractions might be at play. Additionally, analyzing monthly crash data revealed a significant decrease during the COVID-19 pandemic, attributed to reduced traffic volume and travel restrictions.

Geospatial Analysis: Geospatial analysis was conducted to identify high-risk areas. Bar charts revealed Brooklyn as the borough with the highest number of crashes, potentially due to its higher population density and traffic volume. Heatmaps and severity maps pinpointed specific intersections and regions with concentrations of crashes, injuries, and fatalities.

Research Question and Insights: Building on the initial analysis, the project focused on a specific research question: Which borough sees the most crashes, injuries, and deaths of people? This question was chosen to identify areas requiring targeted safety interventions. Visualizations supported the initial observation that Brooklyn had the highest number of crashes, injuries, and fatalities, highlighting the need for focused efforts in this borough.

Recommendations: Based on the project's findings, several recommendations were proposed to the Department of Transportation and Federal Highway Authority:

Targeted interventions in high-risk areas: Focusing on specific intersections and boroughs like Brooklyn where crashes are most frequent.
Public awareness campaigns: To emphasize driver attentiveness, yielding right-of-way, and maintaining safe following distances.
Infrastructure improvements: Consider engineering solutions to improve safety at identified intersections.
Continued data collection and analysis: For ongoing monitoring and identification of emerging trends.

Conclusion: This project demonstrated the power of data science and visualization techniques in understanding transportation safety challenges and generating actionable insights. The analysis provided a comprehensive view of crash patterns, enabling evidence-based recommendations to enhance road safety and reduce the risks for all road users. By addressing the identified factors and implementing the proposed recommendations, we can work towards creating safer roads for everyone.

Project Link

Yahoo!Finance - Stock Data

Exploring Financial Markets with Data Science: A Project Summary

This project delves into the fascinating world of financial markets using the power of data science. By leveraging Python libraries like yfinance, pandas, and scikit-learn, we embarked on a journey to analyze historical stock data for Apple Inc. (AAPL) and develop a basic trading strategy.

The project was structured into several milestones, each building upon the previous one. It began with retrieving historical stock data using yfinance and performing initial data cleaning tasks. This crucial step ensured the quality and reliability of our analysis.

Next, we conducted exploratory data analysis (EDA) to gain insights into the stock's price trends. Visualizations, such as line plots, helped us observe the stock's historical performance and identify potential patterns. Statistical summaries provided a quantitative understanding of the data's central tendencies and variability.

To delve deeper into trend analysis, we calculated simple moving averages (SMAs) for different time periods. Plotting these SMAs alongside the closing price allowed us to visualize the stock's movement and identify potential buy and sell signals.

Building upon this foundation, we implemented a basic moving average crossover trading strategy. This strategy generated signals based on the crossover points of the 10-day and 50-day SMAs. We visualized these signals on a plot to understand where buy and sell actions would occur according to the strategy.

To evaluate the strategy's hypothetical performance, we performed backtesting. This involved simulating the execution of the strategy using historical data and calculating the portfolio's total value over time. This provided insights into the strategy's potential profitability and risks.

Finally, we introduced machine learning to predict future price movements. We engineered features from historical price data and trained a Random Forest Classifier model. By evaluating the model's performance using metrics like accuracy and a confusion matrix, we gained insights into its predictive capabilities.

Through visualizations like the confusion matrix heatmap and feature importance plot, we further enhanced our understanding of the model's behavior and the factors influencing its predictions.

In conclusion, this project provided a hands-on experience in applying data science techniques to financial market analysis. From data retrieval and cleaning to strategy development and machine learning, we explored various aspects of stock market analysis. The insights gained from this project can serve as a foundation for further exploration and development of more complex trading strategies. By combining data, code, and visualizations, we uncovered valuable information hidden within the financial markets, empowering us to make more informed decisions in this dynamic and exciting field.

Project Link

PROJECTS IN PROGRESS

WALMART SALES DATA USING SQL

GEN Z LEADERSHIP PROJECT USING TABLEAU

Page updated

Report abuse

Projects

COMPLETED DATA SCIENCE PROJECTS

IMAGE CLASSIFICATION

MOVIE REVIEW ANALYSIS

ANIMAL HEALTH CLASSIFICATION

Animal Health Prediction: A Data-Driven Approach to Wildlife Conservation

Fake news DETECTION MODEL

INTRODUCTION TO SQL

Sentimental Analysis

Movie Recommendation System

Netflix Data Cleaning

Transportation Data Science Project

Exploring Transportation Data to Enhance Road Safety

Yahoo!Finance - Stock Data

Exploring Financial Markets with Data Science: A Project Summary

PROJECTS IN PROGRESS

WALMART SALES DATA USING SQL

GEN Z LEADERSHIP PROJECT USING TABLEAU

Get in touch at [darpo24a@mtholyoke.edu]