This project analyzes Yelp data to uncover trends in local business performance and user behavior across Arizona. Using distributed computing frameworks such as Hadoop and Apache Spark (via PySpark), the project involved merging multiple large-scale JSON datasets, transforming them into a queryable format (Parquet), and running Spark SQL queries to extract insights. Two milestones were completed: the first focused on business-level analysis (e.g., identifying high-rated categories and locations), and the second on user-level behavior (e.g., activity levels, review sentiment, and community influence). Results were visualized and reported in Jupyter notebooks.
Big Data Tools: Hadoop, Apache Spark, PySpark
SQL & Data Processing: Used Spark SQL with complex multi-dataset queries
Data Engineering: Converted JSON files to Parquet, filtered & merged datasets
Business Analytics: Identified best-performing business types & customer preferences
User Behavior Analysis: Analyzed reviews, user influence, sentiment trends
Visualization & Reporting: Generated reports with charts and clear insights