AUTOMATIC FAKE NEWS DETECTION
PROJECT DELIVERY PHASE I
INTRODUCTION
The accelerated increase in fake news in online social network which is assessable to every single person worldwide in just one click has led to change in the way the information is absorbed, the allurement towards social media news has made a significant amount of news being broadcast. People are witnessing hate crimes all over the world due to acceptance of false and fraudulent information, on the other hand, there is no claim to that falsified information. The arguable published news is being propagated, people use social media to spread fake news selfishly for their profits, benefits and entertainment [1]. This influences ideology of people, which lead to change in views or hatred among people of different culture.
With increase of social media fake news can be spread faster than a forest fire, the driving factor for this false information is not newspapers but new traditions like Facebook, Instagram, YouTube, podcast, etc., [2]. Evening knowing about these problems, to track down each news or piece of information is very difficult since the language used is hard to be understood by any method, we have to detect the fake news, because the language is never stable but changes person to person. Therefore, many machine learning algorithms fail due to this constraint [3].
With this project I aim to determine which news is fake by applying classification techniques, the initial steps include cleaning of data, applying natural language processing, topic modelling and EDA.
The dataset I am using was acquired from IEEE Data Port available at https://ieee-dataport.org/open-access/fnid-fake-news-inference-dataset. The dataset includes 15212 training samples, 1058 validation samples, and 1054 test samples. The classes of this data are “real” and “fake”. The analysis of dataset depicted that there are 8 columns namely: id: matches the id in the PolitiFact website API (unique for each sample), date: The time each article was published in the PolitiFact website, speaker: The person or organization to whom the Statement relates, statement: A claim published in the media by a person or an organization and has been investigated in the PolitiFact article., sources: The sources used to analyze each Statement, paragraph_based_content: content stored as paragraphed in a list, fullText_based_content:Full text using pasted paragraphs and label: The class for each sample. I will explore the data more while working on it to check what kind of news or article it has on which EDA will depend.
I am planning to build fake news detection system using data processing, supervised machine learning and deep learning models.
1. Data preprocessing – using NLP techniques to prepare the data for next phase, preprocessing steps are as followed:
a. Tokenized input
b. Stemming
c. Text cleaning
d. Final preprocessed text
2. Supervised machine learning model – to build the detection system I will start with basic machine learning approaches like Linear Regression, support vector machine. The result from the proposed machine learning approach is studied using TF-IDF.
3. Deep learning models – with believe that above mentioned supervised machine learning techniques will give a good result, I will try to experiment the dataset with deep learning model.
a. LSTM – LSTM will be used with an idea that it might solve the long term dependency problem.
b. CNN – the model will consist of an embedding layer, a convolution layer with 3 convolutions, a max-pooling layer, and a fully connected network.
Many algorithms have been introduced to detect the fake news or the misleading information, there is a need of the hour to have a perfect algorithm or model to detect fake news because it has already costed so many lives, spread of hatred is common using social media, this affects the sentiments of people and people are very sensitive when it comes to their religion and ethics they follow; therefore, a misleading information can be bizarre.
Most of the fake news detection models are developed using supervised machine learning and identifies if the news is real or fake, this is done by comparing the input with data containing real or fake data. William yang Wang “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection [3] used logistic regression, SVM, CNNs and hyperparameter tuning and resulted in giving 20-30% accuracy on validation tests, further, there was overfitting of data when used on Bi-LSTM. Toward Automatic Fake News Classification [4] trained deep learning model using Fake News Challenge (FNC) dataset to train the veracity-based (IR-DL) submodule, and the University of Washington Fake News Dataset (UW) to train the style-based module. The accuracy
of prediction was 67.1% for ternary classification and 72.12% for binary classification.
Kelly Stahl [5] Fake news detection in social media, SVM and Naïve Baiyes classifier were rivals in this paper, because both being supervised learning techniques and efficient to classify the data, these both algorithms were not that promising the paper proposes to combine both classifiers for more accurate data. Furthermore, the paper suggests the advantages of using the SVM method are that it tends to be very accurate and performs extremely well on datasets that are smaller and more concise.
[1] Fake News Detection on Social Media using a Natural Language Inference Approach, Fariba Sadeghi, Amir Jalaly Bidgoly, and Hossein Amirkhani https://assets.researchsquare.com/files/rs-107893/v1_stamped.pdf
[2] Á.I.R. (2019, September 29). FAKE NEWS DETECTION USING DEEP LEARNING. Arvin.Org.
https://arxiv.org/pdf/1910.03496.pdf
[3] Wang, W. Y. (2017, July 4). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. Aclweb.Org. https://www.aclweb.org/anthology/P17-2067.pdf
[4] Ghosh, S. (2019). Toward Automatic Fake News Classification. Scholarspace.Manoa.Hawaii.Edu. https://scholarspace.manoa.hawaii.edu/bitstream/10125/59664/0224.pdf
[5]Stahl, K. (2018, May 15). Fake news detection in social media. Csustan.Edu. https://www.csustan.edu/sites/default/files/groups/University%20Honors%20Program/Journals/02_stahl.pdf
PROJECT DELIVERY PHASE II
In phase 2, I worked on exploratory data analysis (EDA) on the Fake News Detection Dataset as described above and applied supervised machine learning algorithms. The preprocessing consists of few steps as followed:
a. Tokenized input
b. Text cleaning
c. Lemmatization
d. Final preprocessed text
Firstly, I used regex to clean both training and testing data, secondly, I tokenized the data to get a word cloud, which shows frequency of words used, there was a use of lot of political words which concludes that the data is focused on political news where people are mostly seen agreeing or disagreeing with the politics.
To see it more clearly which word was used the most I printed top 20 words used in the dataset, see the graph below. As we can see 'say' does not add any value to the data, it did not matter now so I will be removing the word 'say' and text '000' in final phase at the time of final cleaning (by hard coding these words into stopwords) and prettifying the data.
I was curious to know which sites or links were used to get the article or news, this would show what are people influenced by and what they watch the most or what is easily accessible. The analysis shows people use social media applications which are most commonly used currently, for example, facebook, twitter, etc., as you can see in the image below, at the top of the leader board is youtube.com, followed by facebook.com and then twitter.com. Therefore, these social media applications can influence a whole population and puts us in a spot where we need to address what is going in there.
Next, I compared 4 top sources of news using seaborn histogram. The histogram depicts that most of the statements were 250-500 words long and very few were more than 700-800 words.
Length of real and fake news is almost the same, so this implies that it is hard to tell by just looking at the length of the statement that if it could be real or fake. Fake news can be as consise and short as Real news.
Further, to build the detection system I started with basic machine learning approaches like LinearSVC, Logistic Regression, Decision Tree and MultinomialNB. I used two approaches for applying algorithms on dataset:
I took only training data csv file and split it into 2 training and testing with 80:20 ratio and applied MultinomialNb to check, it did not get me good accuracy with the accuracy of 61.5%. Therefore, I decided to use separate training and testing data extracted from IEEE data portal.
In the second approach, I used two separate files for training and testing, these files were cleaned and preprocessed in the EDA step of the project, the files were imported from the drive which I used to store the processed data. I further applied all four algorithms on this data to compare the accuracy of each.
The second approach consists of additional comparison between 3 columns listed below,
Statement: A claim published in the media by a person or an organization and has been investigated in the PolitiFact article.
Paragraph_based_content: content stored as paragraphed in a list.
FullText_based_content: Full text using pasted paragraphs.
The intent behind using 3 different column was to compare how models would react to larger sentences with more information of the context, as statement has least words, followed by paragraph and then the whole article. An article is the whole news from which a paragraph was extracted which consisted of the sentence used in the sentence column.
Confusion Matrices on each of the model used:
LinearSVC - accuracy (67.36%) 2. Logistic Regression - accuracy (68.31%)
3. Decision Tree - accuracy (61.10) 4. MultinomialNB - accuracy (71.82%)
PRESENTATION
PROJECT DELIVERY PHASE III
1D Convolution and LSTM for text data in Fake News detection dataset
The research on fake news detection requires a lot of variation in the datasets, to understand the statement used in the dataset I made a model which consists of 1D convolution layers and another model where I again used Convolutional layer followed by LSTM layers.
Neural networks take the input as numeric values or vectors and perform the required mathematical calculations, each value optimize the accuracy of the model, for example, biases, weights, etc., and generate the loss function accordingly. The way of using these weights and biases is to add activation function which uses the weights and biases as inputs and produce the output for the next layers. In my model, 1D CNN layer produces some values and pass it through the LSTM which is the next layer to optimize the loss function. There are alot of activation functions, however, I used 'relu' in the hidden layers and 'softmax' in the last dense layer. Relu addresses the vanishing gradient problem so it is the best activation function for my model, other activation function was 'tanh' which did not give me even 50% accuracy. Below images depicts 2 out of 4 models that I used to train the model.
The dataset that I used consists of variety of text because of which a model needs some memory cells, to address the memory cells one type in RNN is LSTM (long short-term memory) cells. LSTM carries additional information to current vectors and processes at each state. LSTM makes sure that there is no loss in the information during the process, the process I used is Sequential.
Image below shows an LSTM cell has three different gates,
Input gate - takes in the current input.
Output gate - predict the values received through input gate.
Forget gate - discards the information which is not revelant to training of the model.
PROPOSED MODEL LAYERS
Max Pooling - I used max pooling to reduce the number of computation operations which eventually reduces the number of parameters needed in the system. In my model I used 2 Maxpooling layers. [1]
Dense Layer - This receives all the inputs from the previous layers, basically it constitutes of all the neurons, therefore, it is densely connected. This is used in the model to reduce the overfitting. In my models I have used 2-4 dense layers depending on the model. For CNN I used 2 and LSTM I used 4 dense layers.
Drop Out - This layer is used to regularize the neurons selected from the previous layer. This randomly sets the outgoing edges of hidden units to 0. I have used 1 dropout and value was set to 0.2.
Activation Function - I used ReLu activation function, the reason is that it does not activates all the neurons at the same time. This brings in the non linearity in our model.
OUTCOMES
Result 1: The first model was performed on training dataset, no other dataset was taken into account. The model consists of one 1D convolution layer and 2 dense layers.
Result 2: The first model was performed on combined training dataset and testing dataset to take all the data into account. The model consists of one 1D convolution layer and 2 LSTM layers and 4 dense layers. It was compiled with binary_crossentropy loss and adam optimizer.
FUTURE WORK
There are more approaches like including BERT, Fine Tuning in the models to increase the accuracy.
BERT is an advanced pretrained word embedding, my data consisted of text in it and BERT is a sentence encoder which helps the model to understand the context of the data properly. In my opinion it can outperform the models I have used as it is bidirectional and will do a commedable job of handling unlablled text.
Fine Tuning this is limited exploratory analysis to improve the accuracy, there are some fine tuning strategies that I would like to include like, preprocessing the long text because BERT consists of 512 largest sequence length, second approach being 12 layer encoder and pooling layer [2]. Lastly, fine tuning the BERT with different learning parameters to tackle the overfitting problem of BERT [3].
CONCLUSION
To conclude, dataset had three columns out of which two had large number of words, supervised machine learning algorithms performed well on larger text data rather than the column which had fewer words, for example, accuracy for smaller text data when tested with logistic regression was 68%, whereas, with larger text data it jumped to 80.4%.
Out of four supervised learning models, logistic regression performed the best on each text column, on the other hand, decision tree performed slightly less than other models.
Neural Networks worked below average, with accuracy of 60% for 1D CNN and 50.61% for LSTM models. Therefore, the best accuracy models are logistic regression and linear regression.
PRESENTATION
REFERENCES (PHASE III)
[1] Kaliyar, R. K., Goswami, A., & Narang, P. (2021). FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimedia Tools and Applications,80(8), 11765-11788. doi:10.1007/s11042-020-10183-2
[2] Devlin J, Chang M-W, Lee K, Kristina T (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1)
[3] Tenney I, Das D, Pavlick E (2019) BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th annual meeting of the association for computational linguistics
[4] O. Ajao, D. Bhowmik, S. Zargari, Fake news identification on twitter with hybrid CNN and RNN models,
Proceedings of the 9th international conference on social media and society (2018), pp. 226-230
[5] Fake News Detection on Social Media using a Natural Language Inference Approach, Fariba Sadeghi, Amir Jalaly Bidgoly, and Hossein Amirkhani https://assets.researchsquare.com/files/rs-107893/v1_stamped.pdf
[6] Á.I.R. (2019, September 29). FAKE NEWS DETECTION USING DEEP LEARNING. Arvin.Org.
https://arxiv.org/pdf/1910.03496.pdf
[7] Wang, W. Y. (2017, July 4). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. Aclweb.Org. https://www.aclweb.org/anthology/P17-2067.pdf
[8] Ghosh, S. (2019). Toward Automatic Fake News Classification. Scholarspace.Manoa.Hawaii.Edu. https://scholarspace.manoa.hawaii.edu/bitstream/10125/59664/0224.pdf
[9]Stahl, K. (2018, May 15). Fake news detection in social media. Csustan.Edu. https://www.csustan.edu/sites/default/files/groups/University%20Honors%20Program/Journals/02_stahl.pdf