In this project, you are given a dataset of 4 different data tables in the CSV format: dirty_tweets.csv, clean_tweets.csv, user.csv, and geo.csv. You need to perform 4 different analytical tasks on this dataset and visualize the results in Texera.
After logging into https://texera-ds4all.ics.uci.edu/, go to the "Datasets" tab. You should see a public dataset named "DS4Everyone-project1".
2. You can click the dataset to preview its content. In the dataset explorer, you can view the data formats and example rows.
3. Click the "Workflows" tab, and click the icon to create a new workflow.
Task 1. Clean the dirty_tweets Dataset
Remove an Extra Column: The dirty_tweets table includes a column named create_time, which is not required for our analysis. Remove this column to make the table have the same schema as the clean_tweets table.
Filter Out NaN Values: Inspect the favorite_count column for any "NaN" (which stands for "Not a Number") values and remove rows containing "NaN" in this field.
Remove Duplicates: Eliminating any duplicate rows in the dirty_tweets table to ensure each tweet is unique.
Task 2. Merge Datasets
Ensure Schema Consistency: Verify that both dirty_tweets and clean_tweets datasets have the same schema. Utilize the Typecast Operator to adjust their field types as necessary, so that both tables have the same structure and data types.
Merge the two tables: Use the Union Operator to combine the two datasets into one.
Task 3. Inspection of the first 500 tweets:
Show the creation times and contents of the first 500 tweets.
Example output:
Task 4. Word Cloud for popular tweets:
Pick tweets with over 2,500 favorites.
Create a word cloud for their text. The following is an example output
Task 5. Visualizing the trend of "happy" tweets:
Compute the monthly count of tweets that include the word "happy" in their text.
Sort the counts by their month.
Plot a line chart showing the trend of tweets mentioning "happy". The following is an example output.
Task 6. Count of influential tweets by state :
Select tweets from users with more than 5,000 followers.
Count the number of tweets per state.
Sort the per-state count in descending order.
Select the top 5 states.
Visualize the state name and counts in a bar chart. The following is an example output.