" The successful man will profit from his mistake
and try again in a different way "
by Dale Carnegie
" The successful man will profit from his mistake
and try again in a different way "
by Dale Carnegie
Visualization & dashboard
Comprehensive Accident Analysis Dashboard using data from 2021 and 2022. With over 3 million entries, it focuses on crucial KPIs related to accident casualties. The dashboard created in Excel offers an insightful visual representation of the data allowing for in-depth analysis.
Skills: Excel
Comprehensive visualization of campaign performance. This dashboard presents an incisive understanding of sales trends over time, meticulously categorized. What sets it apart is its forward-looking perspective, employing predictive forecasting to anticipate future trends, an indispensable tool.
Skills: Tableau
A dynamic fusion of medical knowledge and data-driven clarity. This dashboard harmonizes global COVID-19 data, presenting a visual exploration of confirmed cases and recoveries across nations. It offers a positive path towards informed strategies and impactful decisions in the fight against the pandemic.
Skills: Python, Plotly
Power of statistics
To identify whether there is a statistically significant difference between the video view count and Account verification status in TikTok.
Using Hypothesis test of two sample T-Test, also know as A/B Test
The p-value is extremely small (much smaller than the significance level of 5%), you reject the null hypothesis. You conclude that there is a statistically significant difference in the mean video view count between verified and unverified accounts on TikTok.
Skills: Probability distribution, Hypothesis testing & Sampling
Python | Pandas | SciPy
Check the Continuous distribution of AQI dataset across the column.
Applying Statistics & Visualization
The AQI data distribution is approximately normal, indicating that the West Phoenix site has notably worse air quality than others, necessitating further examination and increased resource allocation for improvement.
Skills: Empirical Rule, Z-Score & Hist Plot
Python | Pandas | SciPy | Matplotlib
Constructing a Confident Interval of AQI dataset for RRE Operation
Applying Sampling, EDA & Statistical Test
The 95% confidence interval from the sample data yielded [10.36, 13.88], suggesting a notably greater population mean AQI for California with 95% certainty than 10.
Skills: Sampling, MOE, P-value & Box Plot
Python | Pandas | SciPy | Seaborn
Regression models
Identifying the linear relationship between Sales from ads and Streaming Services Budget involves considering both continuous and categorical variables.
Achieved through Ordinary Least Squares (OLS) regression to model the data and validate the multiple regression assumptions using Visualization Plots and the VIF formula.
Using TV and Radio as the independent variables results in a multiple linear regression model with R²=0.904. In other words, the model explains 90.4% of the variation in Sales. Suggesting a strong relationship between the advertising expenses on TV and Radio and the Sales generated.
Skills: OLS, EDA, Residuals, Q-Q Plot, scatter Plot, VIF.
Python | Pandas | SciPy | Seaborn | Matplotlib
Determine statistical significance of Aircraft customer satisfaction with Inflight Entertainment, using satisfaction as dependent categorical variable and Inflight Entertainment as independent variable.
I split the dataset into test and train data, trained a logistic classifier model, and evaluated its performance using a Confusion Matrix to calculate metrics like Accuracy, Recall, Precision, and F1 Score.
Customers who rated in-flight entertainment highly were more likely to be satisfied, and logistic model achieved an accuracy of 80.2%, showing an improvement over the dataset's customer satisfaction rate of 54.7%. Interpretation of the Confusion Matrix shed light on false positives and false negatives.
Skills: Logistic Regression, Confusion Matrix, Recall, Accuracy.
Python | Pandas | Scikit learn | SciPy | Seaborn | Matplotlib
I developed a churn prediction model for Waze to help prevent churn and facilitate business growth. The insights gained from this model will enable Waze's leadership to optimize their retention strategy, enhance the user experience, and make data-driven decisions regarding product development.
I split the dataset into test and train data. Then, I trained a logistic classifier model using the predictor variables. Then evaluated the model's performance using a Confusion Matrix to calculate important metrics such as Accuracy, Recall, Precision, and F1 Score.
The impact produced was an 82.37% accuracy for the churn prediction model. Surprisingly, the 'km_per_driving_day' variable showed the strongest positive correlation with churn but ranked as the second-least-important variable in the model. This insight can help Waze optimize their retention strategy and user experience.
Skills: Logistic Regression, Confusion Matrix, Recall, Accuracy.
Python | Pandas | Scikit learn | SciPy | Seaborn | Matplotlib
Machine learning models
I spearheaded the development of a cutting-edge Revenue Predictor for Taxi Drivers in New York, aimed at determining whether customers would leave tips for drivers. This intricate predictive model operated within a binary framework, employing an array of factors to accurately forecast customer behavior.
I employed advanced machine learning techniques, utilizing RandomForest and XGBoostClassifier models. Rigorous cross-validation was complemented by evaluation metrics, including the impactful F1 score. This was all done on meticulously divided training and testing datasets.
The XGBoost Classifier emerged as the top performer, With an accuracy score of 62.56% and F1 score of 35.78%. Notably, it highlighted essential predictors: 'predicted fare', 'mean distance', and 'mean duration'. This breakthrough model has the potential to reshape taxi drivers' decisions and optimize revenue strategies.
Skills: Random Forest, XGBoost Classifier, GridSearchCV, Evaluation Metrics.
Python | Pandas | Scikit learn | XGBoost | Matplotlib
Developed a Python-based solution for clustering FIFA 22 players using K-Means. I loaded player data, implemented K-Means from scratch, and provided an alternative using Scikit-Learn.
Authenticated Google Colab, loaded player data from Google BigQuery, and I implemented K-Means manually with NumPy and visualized the process with PCA. Additionally, provided a Scikit-Learn alternative and visualized cluster with inertia and silhouette scores.
My project enables users to cluster FIFA players based on attributes for insights into performance and potential. It offers flexibility with clustering methods and helps users make data-driven decisions in the game.
Skills: SQL, K-Means, PCA, Silhouette score.
Python | Pandas | Scikit learn | BigQuery | Seaborn
Developed a Python-based Content based movie recommendation system using TF-IDF and cosine similarity. Users input movie titles and receive real-time recommendations based on movie similarity.
I loaded movie data and implemented a search engine that cleans movie titles and calculates similarity. Interactive widgets allow user input, and a recommendation function finds similar movies based on user preferences.
Project enhances the movie-watching experience by providing personalized movie recommendations. The user-friendly interface fosters engagement. It showcases the application of NLP techniques for data exploration.
Skills: TfidfVectorizer, widgets, ipython
Python | Pandas | Scikit learn | Numpy
I executed the "NBA MVPs Prediction with Machine Learning" project, involving three core phases dedicated to predicting NBA MVP award winners.
I collected data using web scraping, focused on data cleaning and preparation, and employed various machine learning models to create a robust prediction system.
Our NBA MVP prediction tool has evolved, now boasting a potent accuracy boost from 67% to an impressive 75%. Beyond offering valuable insights to basketball enthusiasts, it actively invites contributions for ongoing improvements.
Skills: Random Forest, Ridge Regression, Linear Regression, Support Vector Machine, Evaluation Metrics.
Python | Pandas | Scikit learn | Selenium | Beautifulsoup | Requests | Matplotlib
If you like my works, drop a comment in mail box