Data Engineer, Brillio
In this project, an Extract, Transform, Load (ETL) pipeline was created using Apache Spark for processing raw JSON data hosted in an Amazon S3 bucket. The objective was to transform the data for efficient querying using AWS Redshift.
The first step involved uploading the raw JSON files to an Amazon S3 source bucket. An Elastic Map Reduce (EMR) cluster was initialized to process this data using a Spark job, which fetched the source data and started the transformation process according to the project requirements. After the necessary transformations, the results were partitioned and stored back in an S3 bucket as parquet files - a columnar storage file format optimized for big data processing workloads.
The next phase included utilizing the transformed and partitioned Parquet files in Amazon Redshift. With its high-performance SQL querying capabilities, Redshift enabled in-depth analysis on the transformed data.
Data Engineer, Brillio
The project revolved around the collection and cleaning of various tweet datasets for sentiment analysis. Once the datasets were cleaned, the existing sentiments within the datasets were normalized. The primary objective was to compare these normalized sentiments with the system's predicted sentiments for the same set of tweets.
This comparison served a dual purpose; it validated the accuracy of the existing sentiment analysis system and resulted in a refined and enriched dataset ready for further exploration and analysis.
As part of this project, I had the responsibility to oversee and execute the data cleaning process, which is a crucial step in any data analysis pipeline. My duty was not only to remove noise from the data but also to ensure its integrity and retain useful information.
Additionally, I used PySpark for comparing original tweet sentiments with computed sentiments. This task required a keen eye for detail and understanding of sentiment analysis concepts. The goal was to achieve a high degree of accuracy between the original and computed sentiments.
By ensuring the careful cleaning of data and precise comparison of sentiments, we were able to improve our sentiment analysis model, leading to more reliable insights from our tweet datasets.