Data Visualization and Analysis:
Data Visualization and Analysis:
A word cloud is a visual representation of text data, where the size of each word corresponds to its frequency or importance within the text. It provides a quick and intuitive way to identify the most common words in a document or corpus.
Word clouds help highlight key themes, trends, or topics, making it easier for readers to grasp the main ideas or characteristics of the text at a glance.
The two word clouds shown above are for the initial true and false news datasets before concatenation.
This plot is a scatter plot that shows the relationship between the length of text documents and the length of their titles. Each dot on the graph represents one text document. The horizontal axis shows the number of words in the text and the vertical axis shows the number of words in the title. The graph shows a positive correlation between text length and title length. This means that longer text documents tend to have longer titles, and shorter text documents tend to have shorter titles. However, there is also a lot of variation in title length for documents of a similar length. For example, some short documents have long titles, and some long documents have short titles. Overall, the graph suggests that title length can be a rough indicator of text length, but it is not a perfect predictor.
This plot is a bar graph representing the label distribution (real or fake) in the dataset. The x-axis shows the label and the y-axis shows the number of samples of that label. Clearly, the number of records with label as fake are more than the ones with the label true.
This plot shows the average title length in different languages: English, German, Spanish and French. English titles are the longest, with an average of 60 characters. German titles are shorter, with an average of 55 characters. Spanish titles are the shortest, with an average of 35 characters. This is just an average and there will be variation in title length within each language. However, the plot does suggest that there is a general trend for titles to be shorter in Spanish than in English, French or German.
This graph is a histogram that shows the distribution of the number of sentences in a text. The x-axis shows the number of sentences, and the y-axis shows the number of texts with that number of sentences. The most common text length is between 1 and 5 sentences, with fewer texts having more or fewer sentences.
This pie chart shows us the percentage division of the news in the data set that had an image in the article while extracting the data. Here, 1 represents that there is an image and 0 represents that there isn't. Clearly, more articles contain images in them.
This plot is a bar graph that shows the top 10 author contributions to a dataset, likely articles or research papers. The x-axis shows the author and the y-axis shows the number of articles. "No Author" has the most contributions which means most articles are anonymously written, followed by "Activist Post" and "Edjenner". The remaining authors have contributed considerably fewer articles. The graph gives a general sense of the relative contributions of the different authors.
This plot is a bar graph that shows the type-wise sentiment analysis of articles. The x-axis shows the sentiment type, which includes bias, conspiracy, fake, hate, junksci, satire, and state. The y-axis shows the average sentiment polarity. The sentiment with the highest average polarity is satire, followed by hate and junk science. Bias, conspiracy, and fake news have the lowest average polarity. However, it is important to note that the sentiment polarity ranges from -1 (negative) to 1 (positive), so even the sentiment types with the lowest average polarity still have some positive sentiment.
This plot is a bar graph that shows the distribution of the number of words in the news articles for different languages. The x-axis shows the number of words, and the y-axis shows the percentage of titles with that number of words.
The plot is a bar graph that shows the distribution of news articles across eight categories. The x-axis shows the category, and the y-axis shows the number of articles. The category with the most articles is "Politic News," followed by "World News" and "News". The categories with the fewest articles are "US-News" and "Middle-east News."
This box plot shows the comparison between title lengths of both real and fake labels. As we can see, there seem to be some sort of outliers in the fake news part of the data. This can be due to the fact that some of the titles in the fake news data set are unusually long. This can be a very useful insight later on.